I Introduction
Samplingbased motion planners are efficient tools to plan in high dimensional spaces in difficult environments. They can be used to plan motions for robotic manipulation tasks, autonomous car maneuvers, and many other problems. An important aspect of a samplingbased motion planner is the sampling distribution. Planners such as Probabilistic Road Map (PRM) [1], RapidlyExploring Random Trees (RRT) [2], Expansive Space Trees (EST) [3], Fast Marching Trees (FMT*) [4], Batch Informed Trees (BIT*) [5]
, etc. and their many variants iteratively build trees to connect samples drawn from their sampling distributions. Thus, the distribution strongly affects how the search progresses. Traditionally, planners draw random state samples from a uniform distribution (many times with a slight goal bias). However, for many classes of environments, a different probability distribution over the state space can speed up planning times. For example in environments with sparse obstacles, it can be useful to heavily bias the samples towards the goal region as the path to the goal will be relatively straight. The natural questions to ask are “How heavily should the goal be biased?” or more generally “What is the best probability distribution to draw out of?” In previous literature, many researchers have found good heuristics
[6] [7] to modify the probability distributions for specific environments. However, these heuristics do not work generally and may not apply to new environment types. In fact, a heuristic can increase the planning time dramatically if it is unsuited to the problem at hand.In this work, we present a systematic way to generate effective probability distributions automatically for different types of environments. The first issue encountered is how to choose a good representation for probability distributions. The sampling distributions can be very complicated in shape and may not easily be representable as common distributions such as Gaussians or Mixtures of Gaussians. Instead, the distribution is represented with rejection sampling, a powerful method that can implicitly model intricate distributions. The process of accepting or rejecting samples is formulated as a Markov Decision Process. This way, policy gradient methods from traditional Reinforcement Learning literature can be used to optimize the sampling distribution for any planning costs such as the number of collision checks or the planning tree size. The method presented will use past searches in similar environments to learn the characteristics of good planning distributions. Then, the rejection sampling model will be applied to new instances of the environment.
The contribution of this paper is to 1) present an adaptive approach to generating good probability distributions for different environments that improve the performance of sampling based planners and 2) analyze the policies learned against previous heuristic approaches. The method presented is shown to imitate previous heuristic approaches on a simple 2D problem and has found good intuitive heuristics for tabletop manipulation tasks for robotic arms. The paper is organized as follows: Section II discusses previous research in modifying sampling distributions. Section III gives a formal view of the problem. Section IV describes our method. Section V gives specific implementation details. Section VI details an experiment on a simulated environment as well as a real robot experiment and conclusions are presented in Section VII.
Ii Related Work
There is a number of methods that use rejection sampling to bias the sampling distribution. Boor et al. [8]
introduces a method to bias random samples towards obstacles for the PRM planner. For every sampled state, an addition state is generated from a Gaussian distribution around the first state. A sample is only accepted if exactly one point is in collision. Urmson and Simmons
[9]proposed a method to compute lower cost RRT paths. Each node in their tree is given a heuristic “quality” that estimates how good a path passing through that node will be. Rejection sampling is used to sample points near high quality nodes. This method is mostly superseded by RRT*
[10], but is a useful case of how rejection sampling has been used to improve path quality. Yershova et al. [6] introduces DynamicDomain RRT which rejects samples that are too far from the tree. The idea is that drawing samples on the other side of an obstacle is wasteful since it will lead to a collision, so sampling is restricted to an area close to the tree. Shkolnik and Tedrake [7] introduce BallTree which does the opposite of DynamicDomain RRT and rejects samples that are too close to the tree. The idea is that many nodes in the tree are wasted in exploring areas that are close. Shkolnik and Tedrake [11] also present another heuristic to improve RRT performance for differentially constrained systems by rejecting samples where the reachability region of the nearest neighbor is further from the random sample than the nearest neighbor itself, so that extending towards the sample will not actually encourage exploration.There are also methods that do not utilize rejection sampling. Zhang and Manocha [12] modifies random samples by moving points in the obstacle space to the nearest point in free space. The effect of this method is that small “tunnels” that are surrounded by obstacles will be sampled more frequently. As they have noted, this is effective for environments that have narrow passages which are particularly hard for traditional planners to solve due to the small probability of sampling within the narrow passage. Diankov et al. [13] grows a backward tree in the task space and biases samples in the forward configuration space tree towards it. The backward task space tree can be much more easily found in manipulation tasks and can effectively guide the forward configuration space tree. Yang and Brock [14] propose a method to quickly compute an approximation to the medial axis of a workspace. Their goal is to generate PRM samples that are close to the medial axis, as it is a good heuristic to plan in environments with narrow tunnels. This has also been explored in [15].
While the previous work has yielded good results for certain environments, they are not generally applicable. There has been some work in automating how to improve sampling for different environments. Zucker et al. [16] introduces a method to optimize workspace sampling. The workspace is discretized and features such as visibility are computed for each discrete cell. The workspace sampling is improved using the REINFORCE algorithm [17]. This method performs well in the environment it is optimized in, but new environments can potentially have a high preprocessing cost to compute the features. In addition, discretizing the workspace may be infeasible for certain problem domains. Gammell et al. [18] introduced Informed RRT* which improves RRT* performance by restricting samples to an ellipsoid that contains all samples that could possible improve the path length after an initial path is found. Kunz et al. [19] and Yi et al. [20] improve the informed sampling technique. This technique does not improve the speed at which the first path is found. More recently, Ichter et al. [21]
used a Variational Autoencoder to learn an explicit sampling distribution for FMT*.
Our approach differs from previous work by introducing a general method for sampling based planners that is not a human created heuristic nor does it require any discretization of the workspace. In addition, this method can be combined with most previous approaches to further improve performance.
Iii Problem Statement
The problem this paper addresses is to reduce the computational cost of path planning in certain types of environments by modifying the sampling distributions. For clarity, let us consider planning trajectories for a robotic arm in typical tabletop environments.
Following the notation from [10], a state space for a planning problem is denoted as . For a given environment, let denote the obstacle space, a subset of that the robot can not move in. Thus a map is uniquely defined by its . A specific environment type, , is a probability distribution over possible obstacle spaces, . For a 7DOF robotic arm, is the 7 dimensional configuration space, and will assign higher probability to environments that look like scattered objects on a table.
Let
be a sequence of Random Variables that represents the
random sample of the state space drawn during the planning process (Note that the random variables do not need to be identical, as shown in Fig. 7). In standard samplingbased planners, are independent and identically distributed. Now given a specific map, , and a sequence of random state space samples, let be a Random Variable representing the number of collision checks, the size of the search tree, and the number of random samples drawn during the planning process. is a Random Variable due to its dependence on the random samples, , that are drawn. The problem this paper addresses is the following optimization problem:(1) 
Eq. 1 succinctly describes the following: Given a distribution of maps, , find the sequence of distributions that minimizes the expected computational cost of the search, . For a robotic arm, this amounts to finding the probability distribution that will minimize the number of collision checks, size of the search tree, and the number of random samples drawn in common tabletop environments.
Iv Learning Sampling Distributions
It is difficult to represent the sequence of distributions, from Eq. 1 explicitly. The distribution may be very complicated and not easily representable with simple distributions. In addition, there may not be an easy explicit map available (often there is just an oracle that returns whether a collision has occurred). A way to implicitly represent a complicated distribution is with rejection sampling, similar to techniques presented in [8], [6], [7], [11]. In our method, random samples will be drawn from some explicitly given distribution, (usually the uniform distribution with a peak at the goal). For each random sample drawn, a probability of rejection is computed. The sample is then either passed to the planner or rejected. The end result is that unfavorable samples are discarded with high probability so computation time is not wasted in attempting to add the node into the tree or in checking it for collisions. This can improve performance as the sampling operation is usually cheap, but collision checking and tree extension is much more expensive. For example, in the robotic arm experiments described later, the policy has learned that samples with large distances between joints and obstacles are unfavorable as it does not progress the search. The policy is learned offline, and is applied to new environments that are similar in nature (for example, in a grasping task, a desk with different objects in different locations).
More formally, the probability of rejecting a sample is denoted as where is the action of rejecting a sample. The function, is learned offline (discussed in Section IVB). Thus, can implicitly represent a probability measure , the distribution that is effectively being sampled when applying rejection sampling.
(2) 
This is valid as long as is some finite positive number. This will be easily satisfied if .
Iva Rejection Sampling as a Markov Decision Process
The process of rejecting samples during the planning algorithm will be modeled as a Markov Decision Process (MDP), so that traditional reinforcement learning methods may be easily applied to optimize the rejection sampling. Following the notation in [22], a MDP consists of a tuple containing (in some literature, the MDP also includes a discount factor, ). is the set of all possible states the system can be in. is the set of actions that can be taken. are transition probabilities, and is the reward for taking action in state . A typical goal in a MDP is to select a policy, , mapping states to actions that maximizes the sum of rewards over a time horizon , .
In the setting of samplingbased planners, a state consists of the environment, , the current state of the planner, and a randomly generated random sample from the distribution . The action space is . Upon taking action , the sample will be passed to the planner. Upon taking action , the sample will be rejected. In both cases, a new random sample, will be included in the new state . A reward, is given based on . The cost defined for will simply become the negative reward. A MDP model of the rejection sampling applied to RRT is described pictorially in Fig. 2. Note that algorithms that may use batches of samples such as PRM or BIT* can utilize this simply by drawing and rejecting samples until there is enough for a batch.
The policy will be defined as , the probability of taking action in state . Furthermore, will be restricted to a class of functions with parameters
and take in as input a feature vector
instead of the raw state . The policy will be referred to as . In this paper, the function is represented as a neural net where represents the weights in the network. By implicitly defining probabilities in Eq. 2 with policy , can be written as a function of . The optimization problem in Eq. 1 can be rewritten as(3) 
where all share the same parameters but may be different distributions due to the different states the planner will be in. Furthermore, to keep notation with the reinforcement learning literature, the planning cost, , will be redefined as
(4) 
where the rewards have been chosen to reflect the negative cost represented by . Specific reward functions for experiments are described in Section V. Finally, the expectation can be approximated with some samples of typical environments that contains.
(5) 
where is a set of that are representative of the environment .
IvB Optimizing the Probability Distributions
There are many methods from reinforcement learning literature that has been developed to solve the optimization problem posed in Eq. 5. These methods can be roughly split into two categories: 1) value based methods such as QLearning [23] which try to estimate the expected sum of rewards at a given state and 2) policy gradient methods which attempt to directly optimize the policy. This paper utilizes policy gradient methods, in particular, the REINFORCE algorithm introduced by Williams [17] and later extended to function approximations by Sutton et al. [22]. The rationale for choosing policy gradient methods over value based methods is that the policy will have an explicit form that is fast to evaluate which is vital as the policy will be used in the innerloop of samplingbased planners.
In REINFORCE with function approximations, the policy is improved iteratively by taking gradient ascent steps, , where is , the quantity being maximized in Eq. 5. For multiple environments, this can be achieved by iteratively take gradient descent steps for every environment or use an average gradient of all the environments. The likelihood ratio policy gradient presented in [22] can be written as
(6) 
where and
is an estimate of the value function used as a baseline to reduce variance
[22]. Given an environment and policy , the expectation in Eq. 6 can be estimated by running the planner times with and collecting samples of tuples to calculate for each rollout, then averaging over the rollouts.During training, another neural network is fitted to represent
with weights . Utilizing the samples in each iteration of the policy gradient ascent, an iteration of gradient descent is run onto minimize the loss function
(7) 
to update the baseline . The steps of the algorithm are detailed in Algorithm 1.
One downside of policy gradient methods is that they are susceptible to local minima as the objective function is not convex. To mitigate this, several different policies are initialized and the best policy is chosen. Different features should also be tested. The performance depends on what information is available.
IvC Probabilistic Completeness
It is intuitive that this process of rejection sampling will preserve probabilistic completeness for RRT. Following the original proof in [2], the existence of an attraction sequence of length between the start and goal positions is assumed. The proof then turns into showing that there is a minimum probability of transitioning from one ball in the attraction sequence to the next. Treating the transition as a biased coinflip with success rate , the question of whether a path is found in steps turns into a question of whether or not out of coinflips, are successful. In [2], is given as
(8) 
where is the element in the attraction sequence. The rejection sampling modifies and not . Setting a lower threshold for the probability of acceptance of a sample as , we can write
(9)  
(10)  
(11)  
(12)  
(13) 
Thus, when evaluating the modified for the learned distribution
(14) 
One key difference between the original proof and our method is that the samples drawn are no longer independent, as the acceptance or rejection of a sample can influence future samples. However, the probability of drawing a sample from some number of times is lowerbounded by since each sample has at least probability of being drawn. Thus, the probability that the modified distribution draws successful samples from tries is lowerbounded by the probability of drawing successful independent samples out of from a biased coinflip with .
Thus, this method simply scales the probability of the original proof by a constant factor, which does not change the proof in anyway, preserving probabilistic completeness.
V Implementation Details
This section briefly describes the details of the reward function and the policy neural network so that experiments may be replicated.
Va Reward Function
The reward function used is chosen to reflect the computation time of the planning algorithm.
(15) 
is a small value that represents the cost of sampling. is the number of nodes added to the tree in iteration and is the number of collisions checks performed in iteration . are simply scaling factors (the experiments in this paper use , .). Note that the total reward will simply be the scaled total number of nodes plus the scaled total number of collisions plus the scaled total number of samples drawn from . The reward function is designed to reflect the operations that take the majority of the time: extending the tree and collision checking. The reward function can be made more elaborate, or be nonlinear, but this form is used for simplicity. In practice, this method can be made more accurate by measuring the time of each operation (collision check, node expansion, etc.) to compute the weighting factors . In addition, the rewards are normalized by their running statistics so that all problem types can have similar reward ranges.
VB Policy and Value Networks
In this work, the policy
is a neural network that outputs probabilities of acceptance and rejection. The choice in using a neural network to represent the policy is due to the flexibility of functions they can represent. Initial results showed that a simple model like logistic regression can be insufficient in complicated environments. In addition, with neural networks, there is no need to select basis functions to introduce nonlinearities.
The network used is a relatively small two layer perceptron network (the inference must be fast as this function is run many times in the inner loop of the algorithm). For reference, the network evaluted a sample in around 3.59 microseconds using only the cpu of a typical laptop. The input
is passed through two hidden layers with 32 and 16 neurons and rectified linear activation. There is a batchnorm operation [24]after each hidden layer. The second batchnorm layer is passed to a final fully connected layer with 2 outputs that represent the logit for accepting or rejecting the sample. The logit is fed into a softmax operation to obtain the probabilities. Additionally, the logits are modified so that all probabilities lie between 0.05 and 0.95. This is so that
in order to guarantee that is a valid probability distribution. This also allows the policy to always have a small chance of accepting or rejecting, which is useful for exploration in the reinforcement learning algorithm. The policy network is shown in Fig. 4.The neural network for is similar to the policy network. The only difference is that the output layer is a single neuron representing the value. All networks are trained with the Adam optimizer [25] with a learning rate of .
The implementation code is available at https://github.com/chickensouple/learning_implicit_distributions
Vi Experiments
There are experiments done in three sets of environments. First, we test the algorithm on three different planners in a simulated FlyTrap environment. This allows us to analyze the learned policies and behavior in detail in a simplified world. Next, the algorithm is tested on a pendulum environment to analyze its performance with dynamical systems. Then, we apply the algorithm to a more complicated 7 degree of freedom robotic arm to show performance on a real system.
Via 2D Flytrap
The first experiment run is that of the 2D Flytrap. This environment is used as a benchmark in [6] and [7] as an example of a hard planning problem. It is difficult to solve because of the thin tunnel that must be sampled in order to find a path to the goal. The training and testing environments are shown in Fig. 6. Three different planners are tested on the environment: RRT with Connect function [26], Bidirectional RRT (BiRRT) [26], and EST. An example of the training curve is shown in Fig. 3.
For RRT, the feature used is the distance to the nearest tree node minus the distance of that tree node to its nearest obstacle. For BiRRT, the feature used is the distance to the current tree being expanded minus the distance of that tree node to its nearest obstacle. For EST, there are a few choices for how to modify the sampling. In this experiment, we chose to modify the probability of picking nodes in the tree for expansion (the alternative being modifying the probability of how to pick nodes to expand to) since the choice of node has a larger effect on the algorithm’s performance. The features used are two dimensional: the nearest obstacle to the node, as well as the number of nodes in a certain radius (this is the same as used in the original EST paper [3]).
For each planner, the original policy of always accepting samples is compared against the policy trained on the environment shown in Fig. 5(a). The results in Fig. 5 show the statistics over 100 run. The average of each metric tracked for the planners is compared.
For all planners, the number of collision checks is reduced while the number of samples drawn is increased. In RRT, it is reduced around five times. The tradeoff between collision checks and number of samples saves overall execution time. In addition, the decreasing the tree size and reducing collision checks does not decrease the quality of the paths found. For each planner, the path found by the trained policy is equivalent in length or sometimes shorter, despite not explicitly optimizing for path length.
Next, the policies learned for RRT are analyzed. The learned policy rejects samples that are far away from the tree with higher probability. This is similar to the strategy that is suggest by Dynamic Domain RRT [6], in which the ideal version of it rejects all samples that are further away from the tree than the closest obstacle. However, for Flytrap environments where the space outside of the Flytrap is not a large fraction of the space, the strategy suggested by BallTree [7] is more effective. BallTree rejects all samples that are closer to the tree than the nearest obstacle. It is curious that for very similar types of environments, the policies that work better for each are almost complete opposites! This shows a need to use the data itself to tune a rejection sampling policy. When training on the different sized environment shown in Fig. 5(c), the policies learned to exhibit behaviour similar to BallTree. The policy trained in the larger environment rejects samples further from the tree, and the policy trained in the smaller environment rejects samples that are closer to the tree as shown in Fig. 8. The distributions encountered during the search process are visualized in Fig. 7 by sampling a uniform grid in the statespace and using Eq. 2 to compute discretized probabilities for sampling each point.
ViB Pendulum Task
In addition to the flytrap environment, experiments were done on a planar pendulum to test the effectiveness of it on a dynamical system. The pendulum starts at the bottom and needs to reach the top. It is control limited so it must plan a path that increases its energy until it can swing up. In this experiment, we used a steering function that randomly samples control actions and time durations. This is a common steering function that may be used in more complicated systems [27]. The results are shown in Fig. 5. Number of collision checks is not included as for this particular experiments as there are no obstacles to collide with. The features used are 1) the difference between the goal angle and the current angle and 2) the difference in angular velocities. The policy learns to reject samples that are not likely to lead to the goal state, which saves the execution time otherwise spent computing the steering function.
ViC 7 Degree of Freedom Arm
The algorithm is also tested on the 7 degree of freedom arm of the Thor robot (Fig. 9). This experiment is used to validate the method in a higher dimensional space and in a realistic environment. Thor is given tasks to move its arm to various difficult to reach places in assorted tabletop environments. The environments consists of crevices for Thor to reach into and obstacles to block passages. The base planner used is BiRRT, with a four dimensional feature space (EST and RRT were not used as the planning took too long). The first three features are the distances of various joints to the closest obstacle, and the last feature is the distance of the current configuration to the goal. Two very different environments were used for training, and a third environment distinct from the first two was used for testing.
The results of the arm experiments are shown in Fig. 10. The figure details the statistics over successful plans over 100 runs of the planner. Our algorithm had 97% success rate in finding a path, while the original had 96% when the number of samples drawn is limited to 100,000 (this difference is too small to make any claims). Similar to the Flytrap experiments, a policy is learned that trades off extra samples for a vastly reduced number of collision checks and nodes in the tree. On the test environment, the number of nodes in the tree is more than 5 times less and uses 2.7 times less collision checks. In addition, the variance of the results is greatly reduced when using the learned distribution.
Next, the policies learned for the Thor arm are examined to see what aspect of the environment it is exploiting. We note that the probability increases as 1) the distance of the configuration to the goal is lower, or 2) the workspace distance of the later joints is closer to an obstacle. This policy makes a lot of intuitive sense. Samples are concentrated near the surface of the table and objects, probing the surface for a good configuration.
Vii Conclusions
Sampling distributions in samplingbased motion planners are a vital component of the algorithm that affects how many times computationally expensive subroutines such as collision checks are run. While the method presented can improve planning times by modifying the sampling distribution, it is not the whole solution for all problem types. In maps where the thin tunnel issue is more pronounced, rejection sampling does not alleviate the main dilemma of how to sample the thin tunnel. However, this method can be easily combined with existing techniques such as [12, 14, 28] to improve performance.
In addition, this paper does not directly address the problem of finding the optimal solution. The authors believe that an offline method for generating sampling distributions is not the solution for that aspect of planning. Instead, this method can be applied to existing optimal planners. It can, for instance, be used to find the first solution faster in Informed RRT* [18] or BIT* [5] to improve upon existing methods.
In conclusion, this paper presents a general way to obtain good rejection sampling schemes for samplingbased motion planners. The process can be seen as a way of encoding the prior knowledge of the environments into the rejection policy by learning from previous searches in similar environments and is shown to be effective in practice.
Acknowledgment
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1321851. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors would like to thank Bhoram Lee and Min Wen for thoughtful conversations about the paper.
References
 [1] L. E. Kavraki, P. Svestka, J.C. Latombe, and M. H. Overmars, “Probabilistic roadmaps for path planning in highdimensional configuration spaces,” IEEE transactions on Robotics and Automation, vol. 12, no. 4, pp. 566–580, 1996.
 [2] S. M. LaValle and J. J. Kuffner Jr, “Randomized kinodynamic planning,” The international journal of robotics research, vol. 20, no. 5, pp. 378–400, 2001.
 [3] D. Hsu, J.C. Latombe, and R. Motwani, “Path planning in expansive configuration spaces,” in Robotics and Automation, 1997. Proceedings., 1997 IEEE International Conference on, vol. 3, pp. 2719–2726, IEEE, 1997.
 [4] L. Janson, E. Schmerling, A. Clark, and M. Pavone, “Fast marching tree: A fast marching samplingbased method for optimal motion planning in many dimensions,” The International journal of robotics research, vol. 34, no. 7, pp. 883–921, 2015.
 [5] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Batch informed trees (bit*): Samplingbased optimal planning via the heuristically guided search of implicit random geometric graphs,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 3067–3074, IEEE, 2015.
 [6] A. Yershova, L. Jaillet, T. Siméon, and S. M. LaValle, “Dynamicdomain rrts: Efficient exploration by controlling the sampling domain,” in Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on, pp. 3856–3861, IEEE, 2005.
 [7] A. Shkolnik and R. Tedrake, “Samplebased planning with volumes in configuration space,” arXiv preprint arXiv:1109.3145, 2011.
 [8] V. Boor, M. H. Overmars, and A. F. Van Der Stappen, “The gaussian sampling strategy for probabilistic roadmap planners,” in Robotics and automation, 1999. proceedings. 1999 ieee international conference on, vol. 2, pp. 1018–1023, IEEE, 1999.
 [9] C. Urmson and R. Simmons, “Approaches for heuristically biasing rrt growth,” in Intelligent Robots and Systems, 2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ International Conference on, vol. 2, pp. 1178–1183, IEEE, 2003.
 [10] S. Karaman and E. Frazzoli, “Incremental samplingbased algorithms for optimal motion planning,” Robotics Science and Systems VI, vol. 104, p. 2, 2010.
 [11] A. Shkolnik, M. Walter, and R. Tedrake, “Reachabilityguided sampling for planning under differential constraints,” in Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pp. 2859–2865, IEEE, 2009.
 [12] L. Zhang and D. Manocha, “An efficient retractionbased rrt planner,” in Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pp. 3743–3750, IEEE, 2008.
 [13] R. Diankov, N. Ratliff, D. Ferguson, S. Srinivasa, and J. Kuffner, “Bispace planning: Concurrent multispace exploration,” Proceedings of Robotics: Science and Systems IV, vol. 63, 2008.
 [14] Y. Yang and O. Brock, “Adapting the sampling distribution in prm planners based on an approximated medial axis,” in Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, vol. 5, pp. 4405–4410, IEEE, 2004.
 [15] S. A. Wilmarth, N. M. Amato, and P. F. Stiller, “Maprm: A probabilistic roadmap planner with sampling on the medial axis of the free space,” in Robotics and Automation, 1999. Proceedings. 1999 IEEE International Conference on, vol. 2, pp. 1024–1031, IEEE, 1999.
 [16] M. Zucker, J. Kuffner, and J. A. Bagnell, “Adaptive workspace biasing for samplingbased planners,” in Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pp. 3757–3762, IEEE, 2008.
 [17] R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” in Reinforcement Learning, pp. 5–32, Springer, 1992.
 [18] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Informed rrt*: Optimal samplingbased path planning focused via direct sampling of an admissible ellipsoidal heuristic,” in Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, pp. 2997–3004, IEEE, 2014.
 [19] T. Kunz, A. Thomaz, and H. Christensen, “Hierarchical rejection sampling for informed kinodynamic planning in highdimensional spaces,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 89–96, IEEE, 2016.
 [20] D. Yi, R. Thakker, C. Gulino, O. Salzman, and S. Srinivasa, “Generalizing informed sampling for asymptotically optimal samplingbased kinodynamic planning via markov chain monte carlo,” arXiv preprint arXiv:1710.06092, 2017.
 [21] B. Ichter, J. Harrison, and M. Pavone, “Learning sampling distributions for robot motion planning,” arXiv preprint arXiv:1709.05448, 2017.
 [22] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, pp. 1057–1063, 2000.
 [23] C. J. Watkins and P. Dayan, “Qlearning,” Machine learning, vol. 8, no. 34, pp. 279–292, 1992.

[24]
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in
International conference on machine learning, pp. 448–456, 2015.  [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [26] J. J. Kuffner and S. M. LaValle, “Rrtconnect: An efficient approach to singlequery path planning,” in Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE International Conference on, vol. 2, pp. 995–1001, IEEE, 2000.
 [27] Y. Li, Z. Littlefield, and K. E. Bekris, “Asymptotically optimal samplingbased kinodynamic planning,” The International Journal of Robotics Research, vol. 35, no. 5, pp. 528–564, 2016.
 [28] S. Choudhury, J. D. Gammell, T. D. Barfoot, S. S. Srinivasa, and S. Scherer, “Regionally accelerated batch informed trees (rabit*): A framework to integrate local information into optimal path planning,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 4207–4214, IEEE, 2016.