I Introduction
Consider motion planning for robots such as UAVs [16], autonomous ships [3], and spacecrafts [22]. The planning solution needs to satisfy two criteria. First, the solution path must be feasible, meaning that the path must be collisionfree and satisfy kinodynamic constraints, e.g. velocity and acceleration bounds even in the presence of sensor noise. Second, the path needs to be efficient, i.e. near optimal with respect to objectives such as time to reach the goal. For example, a motion plan for a carlike robot should avoid obstacles, reach the goal promptly, not make impossibly sharp turns, and maintain enough clearance to compensate for sensor noise.
Current state of the art kinodynamic motion planners search the robot’s feasible state space by building a tree data structure of possible robot motions rooted at the robot’s current state. The methods iteratively use a local planner to attempt to grow the tree until the goal is reached. While some treebased methods grow the tree by randomly propagating actions, others guide the tree growth to focus state space expansion thus requiring the local planner to be a steering function, a control policy that guides a robot to a specific goal in obstaclefree space, while satisfying the kinodynamic constraints. For example, consider a carlike robot needing to translate a small distance to the left, a motion resembling parallel parking. This motion plan is difficult, even in the absence of obstacles, and requires a steering function to steer the car to the goal. Computing the steering function requires solving an optimal control problem, and is generally NPHard [27]. To date, only very limited robot dynamics such as linear [26] and differential drive [19] systems have optimal steering functions.
Besides the existence of steering functions, there are two additional difficulties facing efficient treebased kinodynamic motion planning. First, treebased methods that use steering functions require identifying the state in the tree from which to grow. This requires a function that compare the distance between states and return those that are expected to be easily solved by the steering function. An effective distance function for kinodynamic planning is the Time To Reach (TTR) between states using an optimal steering function [19]. TTR, however, is often expensive to compute as it involves numerically integrating the steering function [19]. Second, neither the steering functions nor the related TTR are informed by sensors, and, as a result, do not account for potential obstacles. For example, if a goal is occluded by a wall, the steering function is not able to see the wall due to the lack of sensory input, and TTR would return a value as if an agent could go through the wall.
Recently, deep Reinforcement Learning (RL) emerged as a promising near optimal steering function for kinodynamic systems [12]. In addition, deep RL algorithms can learn policies that map noisy lidar or camera observations directly to robot actions, thus enabling obstacle avoidance while navigating between states for differential drive robots [4, 5]. With the recent development of AutoRL [4]
, which uses evolutionary algorithms to largely eliminate the need to handtune hyperparameters, network structure and reward functions.
This combination offers the promise of deep RL being employed for local planning, i.e., providing both steering function and obstacle avoidance. However, RL policies often lack longterm planning capabilities [17] and get trapped in environments with complex obstacles [6].To address the lack of available steering functions, good distance functions for aiding tree growth, and obstacleawareness facing kinodynamic motion planning, we propose RLRRT, which combines RL and samplingbased planning. It works in three steps. First, we learn an obstacleavoiding pointtopoint (P2P) policy with AutoRL. This is a mapless, goalconditioned policy that maps sensor readings to control. These P2P policies generalize to new environments without retraining [4]. Second, we train a supervised obstacleaware reachability estimator that predicts the time it takes the P2P policy to guide the robot from a start to goal state in presence of obstacles, using local observations such as lidar. The key insight is that the AutoRL policy and the estimator implicitly learn the topology of the obstacles. This allows reasonably accurate estimates of time to reach in new environments. Lastly, presented with a motion planning problem in a new envrionment, in a RRT setting, we use the RL policy as a local planner and the reachability estimator as the distance function. The combination of these two learning solutions offers two primary advantages. First, by using RL policies as an obstacle avoiding local planner, RLRRT can be applied to a variety of kinodynamic systems without optimal steering functions. Second, by using the obstacleaware reachability estimator, RLRRT can prune out randomly sampled states that are unreachable from the tree, e.g., the policy is expected to be unsuccessful, and identify nodes with small TTR to the sampled state. In the example of a car in front of a wall, the RL policy will go around the wall, and the estimator will predict that the time to reach will be longer because of the wall.
We evaluate RLRRT in two environments with three kinodynamic robots. Results indicate that AutoRL policies are effective obstacleavoiding local planners. The obstacleaware reachability estimators, one for each robot, are 7480% accurate in identifying if a goal state is reachable. Compared to a state of the art steering function free method, SST [14], RLRRT is up to 2.3 times more likely to identify a path within a fixed time budget and the identified path is up to 4.5 times shorter. RLRRT typically identifies dynamicallyfeasible paths in very few iterations – 11 in this case – thanks to intelligent node selection and the obstacleavoiding local planner (Figure (a)a). The enclosed video demonstrates RLRRT tree construction and trajectory execution on a physical differential drive robot.
Ii Related Work
Steering functionbased kinodynamic planners, such as kinodynamic RRT* [26] and DFMT [23] rely on a steering function to “pull” the tree to achieve rapid exploration [21] and a proper distance function [26, 19, 27]. RLRRT uses AutoRL [4] to learn steering functions, thus bypassing the challenging twopoint boundary value problem.
Steering function freebased approaches, such as EST [21] and SST [14], propagate random actions from a selected node. These methods can be applied to a variety of robot dynamics, although they tend to “wander” [1], thus they can take a long time to identify a solution.
Recent research has offered several solutions for P2P obstacleavoidance policies on a differential drive robot from raw sensory input, including learning from demonstration [20], curriculum learning [28], and reinforcement learning [25, 4]. Other research offers hierarchical solutions to navigation, where the RL agent executes a path identified by another planner, e.g., from a grid [5], PRMs [6, 8], or manually selected waypoints [11]. However, none of those methods are designed for kinodynamic robots, leading to failures at milestones due to dynamic constraints [8].
Designing an effective distance function for samplingbased kinodynamnic motion planning is challenging [19]. The commonly used Euclidean and weighted Euclidean distance for configuration space planning is inefficient as kinodynamic robot states have limited reachability [13]. The minimum TTR between states is a highly effective distance function [19, 27] but is often too computationallyexpensive to be used as a distance function [19]. While learned TTR of a nearoptimal differential drive steering function [19] can overcome the computational complexity, this approach still requires a nearoptimal steering function. Indirect optimal control has also been used to generate training samples composed of minimum TTR and optimal control actions along trajectories [27]. However, this approach currently only works for low dimensional systems such as inverted pendulum and does not handle limited action bounds. Our approach addresses these challenges by bypassing the necessity of a nearoptimal steering function via RL. Unlike previous methods, we also take into account obstacle avoidance, which can significantly change the minimum TTR.
Iii Methods
RLRRT is a kinodynamic motion planner that learns local planner and distance function w.r.t the individual robot dynamics. It has three main steps. First, we learn an obstacleavoiding point to point policy with AutoRL [4]. Next, since the RL policy avoids obstacles, we can use the policy to generate obstacleaware reachability training samples by repeatedly rolling out the learned policy. An obstacleaware reachability estimator is trained to predict the time to reach between two robot states in the presence of obstacles. Policy and estimator training is done once per robot in training environments. Third, during planning, RLRRT creates dynamicallyfeasible motion plans using the RL policy as the local planner and the reachablity estimator as the distance function. Note, that the training and planning simulators require simulated depth measurements (e.g. lidar) around the robot, which can be thought of as analogous to motion planning with information about clearance.
Iiia AutoRL Local Planner
We train a RL agent to perform a P2P task that avoids obstacles. The output of the training is a policy that is used as a local planner, an execution policy, and a data generation source for the obstacleaware reachability estimator. Using one RL policy for both local planning and path execution is inspired by [8]. This allows the planner to account for potential noise during path execution.
To train a policy robust against noise, we model the RL policy is a solution for a continuous state, continuous action, partially observable Markov decision process (POMDP) given as a tuple
of observations, state, actions, dynamics, reward, scalar discount,, and observation probability.
The observations are the last three measurements of the noisy robot lidar and potentially noisy relative goal position and robot velocity. We define states as the true robot configuration and its derivative. A blackbox robot dynamics simulator, which maps statesaction pairs to states, is an input to the RL training environment. Another blackbox simulator maps the robot state to noisy lidar observations w.r.t. obstacles. The goal is to train the agent to reach a goal state, , within radius, Note that AutoRL identifies a policy that maps noisy sensor and state observations to action. We explore simulated lidar measurement noise in this work and left state estimation and process noise to future work. AutoRL training is required only once for a given robot.AutoRL [4] over DDPG [15], used for learning the RL agent policy, takes as input: observations, actions, dynamics, goal definition, , and a parametrized reward, . The agent is trained to maximize the probability of reaching the goal without collision. This is achieved by using evolutionary algorithms over populations of agents to find a dense reward that maximizes successful goal reaching. Each generation of agents is trained with a new reward, selected based on the previous experience. At the end, the fittest agent that performs P2P tasks best, is selected as the P2P policy. In this work, all three agents use the same observations, goal definitions, and neural network architectures, but differ in the robot dynamics and reward features used.
As an example, we explain the training of the Asteroid robot here (details of the robot are in the Appendix). Details for the Differential Drive and Car robot can be found in [4] and [8].
The observation is a vector of
noisy lidar returns concatenated with the relative planar position of the goal, the robot velocity and orientation ( dimensional vector). The state is the planar position, velocity and orientation of the robot. The action is the amount of forward thrust and turn rate. The parameterized reward includeswhere is 1 when the agent reaches the goal and 0 otherwise, is the negative Euclidean distance to the goal, is 1 when the agent collides with obstacles and 0 otherwise, is the distance to the closest obstacle, is the agent speed when the clearane is below 0.25m, is a constant penalty step with value 1, and is sum of displacement between the current and positions 3, 6 and 9 steps before. is the weight vector tuned by AutoRL.
IiiB ObstacleAware Reachablity Estimator
We further improve upon work in [19] by learning the TTR of an obstacleavoiding P2P RL policy learned in Section IIIA. Our obstacleaware reachability estimator provides the following benefits: 1) It does not need an engineered nearoptimal steering function for each robot dynamics. This allows TTR learning for robot systems without nearoptimal steering functions. 2) Due to the presence of obstacles, the minimum TTR between states is a function of both robot dynamics and obstacles. Since RL policies can also learn to avoid obstacles, the obstacleaware reachability estimator can provide additional benefit over TTR estimators that consider only obstacle dynamics such as [19].
IiiB1 Training data collection
Algorithm 1 summarizes the training data collection. First, for each episode, we initialize the robot with randomly chosen start and goal states (Alg. 1 line 2). Next, we execute the policy until the episode terminates (lines 411) due to reaching the goal, collision, or reaching a time horizon . During execution, we record the robot observation at each time step (line 8) and compute and record the TTR cost (lines 910). The TTR cost is set to at every time step.
To classify whether the robot can reach the goal, we use a simple heuristic that penalizes trajectories that do not reach the goal. If the robot is in collision or the time horizon is reached (elapsedTime equals to
), the TTR cost of that time step is set to , and the episode is terminated immediately by setting isDone to true.After an episode terminates, we compute the cumulative future TTR cost for all states along the trajectory, i.e., remaining costtogo to the end of the trajectory (line 12). The observation and cumulative future cost of each time step form a training sample and is recorded (line 14). The process repeats for episodes. We designed the TTR cost heuristic such that if the robot reaches the goal, the cumulative future cost of each state along the trajectory is the TTR between that state and the goal. Conversely, if the robot failed to reached the goal due to collision or the episode reaches time horizon, all cumulative future cost along the trajectory will be larger than . By employing a common machine learning technique that uses a regressor and a threshold value as a classifier [10], we can quickly classify whether a goal state can be reached during planning.
IiiB2 Reachability Estimator Network
We train the obstacleaware reachability estimator network with the training data collected above. The network input is the robot observation
and the output is the estimated TTR. We use a simple threelayer fullyconnected network with [500, 200, 100] hidden neurons with each a dropout probability of 0.5. We use the L2 loss between estimated TTR and the Vvalue label from the training data.
IiiC RlRrt
Alg. 2 describes RLRRT. While the standard RRT algorithm was utilized, modifications were made to efficiently utilize the obstacleaware reachability estimator and the obstacleavoiding RL local planner.
Within RLRRT, the obstacleaware reachability estimator can provide insight into the best samples to enhance tree growth. However, as we began to use the estimator, it became clear that the obstacleaware reachability estimator can take longer than the standard Euclidean distance metric to compute (about 0.5 ms vs. 7 s for Euclidean). Therefore, to enhance computation time in large trees, the estimator was integrated into a hierarchical nearest neighbor selector. Similar to [2], the method first identifies candidate nodes closest to using Euclidean distance (Alg. 2, line 8), and subsequently these choices are filtered by the obstacleaware TTR between each candidate node and . To alleviate noise in the TTR estimator, we take the average of the TTR between the selected node and =10 target states around , i.e., within a hypercube of =0.3 units (line 10). The node with the lowest average TTR is selected for RRT extension (line 9). In addition, the obstacleaware reachability estimator can also be used to check whether the randomly sampled state is reachable from the nearest node . Recall that the TTR reward in Section IIIB is setup such that any unreachable from .state has an associated Vvalue larger than . As the result, the estimated TTR can be used to prune out that are unreachable from the tree within . However, since the estimated TTR is not exact, we made the pruning probabilistic, i.e., if is deemed unreachable, it will be pruned with probability (line 10). If is pruned, it is rejected and a new is sampled (line 6).
After the nearest node is selected, RLRRT uses the RL policy as the local planner (lines 1524). Specifically, an observation which includes simulated lidar, robot state, and goal information is made at every policy time step (line 17). This observation is fed to the RL policy, which produces an action that can be used to forward propagate the dynamics to a new state (line 18). This process repeats and a new node storing is created, and added to the tree every seconds (line 21), until is in collision, a maximum extension time is reached (line 20), or is reached (line 20).
RLRRT terminates when either the tree reaches the goal or after a fixed amount of computation time is exhausted (line 3). If the tree reaches the goal, a dynamicallyfeasible motion plan can be returned (line 25).
Iv Evaluation
To demonstrate RLRRT, we evaluate our method on three kinodynamic robots in two environments unseen during training, and we experimentally verify the method on a physical differential drive Fetch robot from Fetch Robotics.
Iva Setup
The three robots we evaluate are: Car, Asteroid, and Fetch. Car is a kinematic car with inertia [18] with a maximum steering angle and a 1.0 maximum acceleration and speed of 1.0 . Asteroid has similar dynamics to those found in the popular video game Asteroid, and we chose it since it is highly kinodynamic, unintuitive for a human to control, and has no known optimal steering function. The details are available in the supplemental materials. The Fetch robot has a radius of , 1.0 m/s maximum speed and 2.0 rad/s turn rate.
The sensor noise is simulated by a zero mean Gaussian with a standard deviation of 0.1 m.
We use the Fetch robot as a differential drive platform for onrobot experiments.All pointtopoint policies are trained in the environment depicted in Figure (a)a. We evaluate RL policies and plan in two office building environments, Map 1 (Figure (a)a) and Map 2 (Figure (c)c), which are roughly 15 and 81 times larger than the training environment, respectively. Map 1 is is generated from a floor plan, while Map 2 is generated using a noisy SLAM of the Fetch physical testbed where we ran the experiments. These environments include parts that are cluttered, as seen in Map 1, and very narrow corridors, such seen in Map 2.
We compare RLRRT to SST [14], a state of the art steering function free kinodynamic motion planner. For Fetch robot, we also compare to RRT with Dynamic Window Approach (DWA) [7] as local planner (denoted RRTDW). Additionally, we test disabling the clearance term of DWA, essentially turning it into a MPCbased steering function (denoted RRTS). All experiment are repeated 50 times. Besides AutoRL training, all computation was done
on an Intel Xeon E51650 @ 3.6GHz using TensorFlow 1.x (Google release) and Python 2.7. AutoRL policies were implemented with Google Vizier
[9] and TFAgents [24].IvB AutoRL Policy Performance
We use pretrained P2P policies for Fetch [4] and Car [8] robots. Their short description is available in the Appendix. The Asteroid P2P policy is original to this paper. All agents are trained with AutoRL over DDPG [4]. The goals are randomly placed within . We train 100 agents in parallel over 10 generations as in [4]. The training took roughly 7 days.
Figure 2 shows the success rate of the P2P agents compared to goal distance. Notice that when the goal distance is or farther than the trained policy, the performance degrades. We also notice that the Car policy is best performing, while the Asteroid policy is the most challenging. These results show that AutoRL produces, without handtuning, effective local planners, i.e., both a steering function and an obstacle avoidance policy for a variety of robot dynamics.
IvC Reachability Estimator Performance
The obstacleaware reachability estimator is trained in the training environment with goals sampled within from the initial states, twice the distance used for P2P training. The estimator network was trained on 1000 episodes with about 100,000 samples. Data generation takes about 10 minutes. The reachability thresholds are 20 seconds for differential drive and Asteroid, and 40 seconds for Car.
Each estimator was trained over 500 epochs and took about 30 minutes.
Robot  Confusion Matrix  Prec.  Recall  Accur.  
True (%)  (%)  (%)  (%)  
Fetch  Predicted  42.7  21.6  66.4  92.2  74.8 
(%)  3.6  32.1  
Car  Predicted  44.5  14.2  75.8  90.2  81.0 
(%)  4.8  36.5  
Asteroid  Predicted  26.5  16.3  61.9  73.4  74.1 
(%)  9.6  47.6 
Accuracy of the models is between 70% and 80% (Table I). Notice that a high recall means that the estimator misses fewer nodes, and suggests that the paths RLRRT produces should be nearoptimal. On the other hand, relatively low precision implies that RLRRT will explore samples that end up not being useful. This means that we can speedup RLRRT further by learning a more precise predictor.
The reachability estimator overestimates the TTR of reachable states across all robots (Fig. 3). However, overestimation disappears when trained and evaluated only on reachable states (see Fig. 1 in Appendix for more detail). This suggests that the overestimation of TTR is likely due to the TTR cost heuristic uses a penalty for states unreachable within . We leave identifying better TTR cost heuristics and estimator network architectures for future work.
In general, the estimator captures the regions of start states that cannot reach the goal (blue dot) (Fig. 4). This is most visible at the bottom right region of the environment, which has a TTR larger than the 40s horizon which indicates that the policy failed to escape that region. We also see that the estimated TTR captures the dynamics of Car robot, i.e., since the goal orientation is facing right, it takes less time to reach the goal from the left, top or bottom than from the right. Note that the network is never trained on trajectories that start inside of obstacles and thus cannot accurately predict TTR starting from those states, an event which should not occur in samplingbased planning.
IvD Planning Results
RLRRT finds a solution faster than SST for all three robots in both environments (Fig. (a)a, (b)b, (c)c). Note that Car shows the best improvement over the baseline (up to 2.3 times faster), which matches the high success rate of the P2P Car policy. Conversely, the least improvement is for Asteroid, which is the most challenging for the RL agent. Figure (a)a also shows that RLRRT finds a solution faster than steering functionbased methods, where DWA was used as the steering function (yellow, RRTS) and obstacleavoiding steering function (red, RRTDW). These results are expected as RLRRT learns a obstacleavoiding local planner that can often go through very narrow corridors and move around corners (Figure (a)a). In comparison, DWA often gets stuck around corners. To separate the impact of the RL local planner as compared to the reachability estimator, we tested RLRRT without the estimator and use Euclidean distance to identify the nearest state in the tree instead. Figures (a)a, (b)b and (c)c show that RLRRT without the reachability estimator (magenta curves) performs worse than RLRRT for all robots. This is expected as the reachability estimator prunes potentially infeasible treegrowth, thereby biasing growth towards reachable regions. Also, the reachabilty estimator encodes the TTR and is thus more informative than the Euclidean distance for kinodynamic robots such as Asteroid.
The finish time of trajectories identified by RLRRT are significantly shorter (up to 6 times shorter) than SST for all robots (Fig. (d)d, (e)e, (f)f) and comparable to RRTDWA and RRTS on differential drive. This is expected as SST does not use steering functions. Instead, it randomly propagates actions, resulting in a “jittery” behavior (visible in Figure (a)a) and long finish time. The comparable finish time with steering functionbased methods show that RLRRT learns a nearoptimal steering function.
IvE Physical Robot Experiments
In order to verify that the RLRRT produces motion plans that can be used on real robots, we executed the motion plans on the Fetch robot (Figure. (b)b) in Map 2 environment. We ran 10 different motion plans, repeated 3 times. Figure (c)c presents one such trajectory. The straight line distance between the start and goal is 20.8 m. In green are tree nodes for a path, and the blue line is the executed robot path with the P2P AutoRL policy. We notice two things. First, the path is similar to the one humans would take. The shortest path leads through cubicle space, which is cluttered. Because the P2P policy does not consistently navigate the cubicle space, the TTR estimates are high in that region and the tree progress slowly in that area. At the same time, in the uncluttered space near the start position (left and right) the tree grows quickly. The executed trajectory (in blue) stays close to the planned path. Enclosed video contains the footage of the robot traversing the path.
V Discussion
Deep actorcritic RL methods approximate the cumulative future reward, i.e., statevalue function with the critic net. Intuitively, the statevalue function captures the progress towards the goal and may be used as a distance function during planning. Here we show that this is not the case when proxy rewards are used. AutoRL uses proxy rewards (shown in Section IIIA) since they significantly improve learning performance, especially for tasks with sparse learning sigals such as navigation [4]. Fig (a)a shows examples of two Asteroid trajectories and Fig. (b)b shows the corresponding the estimated TTR (solid lines) and negative of DDPG statevalue function extracted form the critic net (dashed lines). The obstacleaware reachability estimator correctly predicted the TTR while the DDPG’s critic net has a significant local maximum, thus unsuitable as a distance function. This finding motivated the supervised reachability estimator.
One limitation of RLRRT is that the obstacleaware reachability estimator approximates reachability using only local information such as simulated lidar measurements around the robot. However, the true reachability is often impacted significantly by largescale obstacle structures. Figure 7 demonstrates this limitation. The ground truth shows that the Car policy generally fails to reach the goal outside of the center box due to the complex mazelike obstacles (Figure (b)b). The reachability estimator failed to predict this as some regions outside of the center box are incorrectly predicted as reachable (Figure (a)a). On the other hand, we also demonstrated that the estimator performs well when the training and planning environments are similar (Figure 4). This suggests that the reachability estimator should to be trained in environments similar to the planning environment or perform online adaptation/learning during planning. We leave the latter to future work.
Vi Conclusions
This paper contributes RLRRT, a kinodynamic planner which works in three steps: 1) learning obstacleavoiding local planner; 2) training an obstacleaware reachability estimator for the learned local planner; and 3) using the estimator as the distance function and to bias sampling in RRT. Unlike traditional kinodynamic motion planners, RLRRT learns a suitable steering and distance function. The robot is trained once, and the policy and estimator transfer to the new envrionments. We evaluated the method on three kinodynmic robots in two simulated environments. Compared to the baselines, RRT plans faster and produces shorter paths. We also verified RLRRT on a physical differential drive robot. For future work, following PRMRL, we plan to improve the noise robustness of RLRRT by Monte Carlo rollouts during tree extensions. We also plan to identify better TTR cost heuristics, network architectures and online adaptation of the reachability estimator.
Acknowledgment
We thank TsangWei Edward Lee for assisting with robot experiments, and Brian Ichter for the helpful feedback. Tapia and Chiang are partially supported by the National Science Foundation under Grant Numbers IIS1528047 and IIS1553266 (Tapia, CAREER). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References
 [1] R. Allen and M. Pavone. A realtime framework for kinodynamic planning with application to quadrotor obstacle avoidance. In AIAA Guidance, Navigation, and Control Conference, page 1374, 2016.

[2]
H.T. L. Chiang, A. Faust, S. Satomi, and L. Tapia.
Fast swept volume estimation with deep learning.
In Proc. Int. Workshop on Algorithmic Foundations of Robotics (WAFR), page To appear, 2018.  [3] H.T. L. Chiang and L. Tapia. Colregrrt: An rrtbased colregscompliant motion planner for surface vehicle navigation. Robotics and Automat. Lett., 3(3):2024–2031, 2018.
 [4] L. Chiang, A. Faust, M. Fiser, and A. Francis. Learning navigation behaviors endtoend with autorl. IEEE Robotics and Automation Letters (RAL), 2019.
 [5] T. Fan, X. Cheng, J. Pan, P. Long, W. Liu, R. Yang, and D. Manocha. Getting robots unfrozen and unlost in dense pedestrian crowds. Robotics and Automat. Lett., 2019.
 [6] A. Faust, O. Ramirez, M. Fiser, K. Oslund, A. Francis, J. Davidson, and L. Tapia. PRMRL: Longrange robotic navigation tasks by combining reinforcement learning and samplingbased planning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 5113–5120, Brisbane, Australia, 2018.
 [7] D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. IEEE Robot. & Automation Mag., 4(1):23–33, 1997.
 [8] A. Francis, A. Faust, H.T. L. Chiang, J. Hsu, J. C. Kew, M. Fiser, and T.W. E. Lee. Longrange indoor navigation with prmrl. arXiv preprint arXiv:1902.09458, 2019.
 [9] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for blackbox optimization. In Proc. of ACM Intl. Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017.
 [10] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
 [11] Y. Kato, K. Kamiyama, and K. Morioka. Autonomous robot navigation system with learning based on deep qnetwork and topological maps. In 2017 IEEE/SICE International Symposium on System Integration (SII), pages 1040–1046, Dec 2017.
 [12] A. Layek, N. A. Vien, T. Chung, et al. Deep reinforcement learning algorithms for steering an underactuated ship. In IEEE Int. Conf. on Multisensor Fusion and Integration for Intell. Sys. (MFI), pages 602–607. IEEE, 2017.
 [13] Y. Li and K. E. Bekris. Learning approximate costtogo metrics to improve samplingbased motion planning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 4196–4201. IEEE, 2011.
 [14] Y. Li, Z. Littlefield, and K. E. Bekris. Asymptotically optimal samplingbased kinodynamic planning. Int. J. Robot. Res., 35(5):528–564, 2016.
 [15] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [16] S. Liu, M. Watterson, K. Mohta, K. Sun, S. Bhattacharya, C. J. Taylor, and V. Kumar. Planning dynamically feasible trajectories for quadrotors using safe flight corridors in 3d complex environments. Robotics and Automat. Lett., 2(3):1688–1695, 2017.
 [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [18] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli. A survey of motion planning and control techniques for selfdriving urban vehicles. IEEE Trans. on intel. vehicles, 1(1):33–55, 2016.
 [19] L. Palmieri and K. O. Arras. Distance metric learning for rrtbased motion planning with constanttime inference. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 637–643. IEEE, 2015.
 [20] M. Pfeiffer, M. Schaeuble, J. I. Nieto, R. Siegwart, and C. Cadena. From perception to decision: A datadriven approach to endtoend motion planning for autonomous ground robots. Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 1527–1533, 2017.
 [21] J. M. Phillips, N. Bedrossian, and L. E. Kavraki. Guided expansive spaces trees: A search strategy for motionand costconstrained state spaces. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), volume 4, pages 3968–3973. IEEE, 2004.

[22]
A. Richards, T. Schouwenaars, J. P. How, and E. Feron.
Spacecraft trajectory planning with avoidance constraints using mixedinteger linear programming.
J. of Guidance, Control, and Dynamics, 25(4):755–764, 2002.  [23] E. Schmerling, L. Janson, and M. Pavone. Optimal samplingbased motion planning under differential constraints: the drift case with linear affine dynamics. In IEEE Conf. on Decision and Control (CDC), pages 2574–2581. IEEE, 2015.
 [24] Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo. TFAgents: A library for reinforcement learning in tensorflow. https://github.com/tensorflow/agents, 2018.
 [25] L. Tai, G. Paolo, and M. Liu. Virtualtoreal deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 31–36, Sept 2017.
 [26] D. J. Webb and J. Van Den Berg. Kinodynamic rrt*: Asymptotically optimal motion planning for robots with linear dynamics. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 5054–5061. IEEE, 2013.
 [27] W. J. Wolfslag, M. Bharatheesha, T. M. Moerland, and M. Wisse. RRTcolearn: towards kinodynamic planning without numerical trajectory optimization. Robotics and Automat. Lett., 3(3):1655–1662, 2018.
 [28] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard. Deep reinforcement learning with successor features for navigation across similar environments. In Proc. IEEE Int. Conf. Intel. Rob. Syst. (IROS), pages 2371–2378. IEEE, 2017.
Supplemental Material:
Rewards for the P2P
Via P2P for differential drive robots
The P2P agent was developed in [4]. The reward is:
(1) 
where is 1 when the agent reaches the goal and 0 otherwise, is the negative Euclidean distance to the goal, is 1 when the agent collides with obstacles and 0 otherwise, is the distance to the closest obstacle, is a constant penalty step with value 1, and is the negative angular speed.
ViB P2P for kinodynamic car robots
Supplemental Material:
Asteroid
Asteroid has a similar dynamics to those found in the popular video game Asteroid.
(3)  
(4)  
(5) 
is the thruster acceleration action ranged from [0.5, 1.0] while is the turn rate action ranged from [0.5, 0.5] rad/s. is the first order drag coefficient, resulting in a maximum speed of 1.0 . .
Supplemental Material:
Time To Reach Estimators
The obstacleaware reachability estimator combines rechable state classification and TTR estimation in order to bias treegrowth towards reachable regions and identifying nearest neighbors. Here we explore estimating only the TTR by training a TTR estimator that is trained only by trajectories that reached the goal. Fig. 8 shows the predicted TTR and the ground truth for various robots. Unlike the reachability estimator (Fig. 4 in the main paper), the TTR estimator does not overestimate TTR. This suggests that the overestimation of the reachability estimator is caused by the TTR cost heuristic penalizing unreachable states.
Comments
There are no comments yet.