I Introduction
Longrange indoor robot navigation requires humanscale robots (Fig. (a)a) to move safely over buildingscale distances (Figs. (b)b). To robustly navigate over long distances while adapting to new environments, we factor the problem into longrange path planning and endtoend local control, while assuming mapping and localization are provided. Longrange path planning finds collisionfree paths to distant goals not reachable by local control [38]. Endtoend local control produces feasible controls to follow ideal paths while avoiding obstacles (e.g., [35], [22]) and compensating for noisy sensors and localization [12]. We enable endtoend local control to inform longrange path planning through samplingbased planning.
Samplingbased planners, such as Probabilistic Roadmaps (PRMs) [34] and Rapidly Exploring Random Trees (RRTs) [37, 39], plan efficiently by approximating the topology of the configuration space with a graph or tree constructed by sampling points in free space and connecting these points if there is a collisionfree local path between them. Typically, these local paths are created by lineofsight tests or an inexpensive local planner, which are then connected in a sequence to form the full collisionfree path.
Regardless of how a path is generated, executing it requires handling sensor noise, unmodeled dynamics, and environment changes. Recently, reinforcement learning (RL) agents [36] have solved complex robot control problems [61], generated trajectories under task constraints [20], demonstrated robustness to noise [19], and learned complex skills [51] [49], making them good choices to deal with task constraints. Many simple navigation tasks require only lowdimensional sensors and controls, like lidar and differential drive, and can be solved with easily trainable networks [63, 25, 7]. However, as we increase complexity of the problem by requiring longer episodes or providing only sparse rewards [18], RL agents become more difficult to train, and RL doesn’t always transfer well to new environments [30] [29].
Longrange navigation presents all of these challenges. Rewards are sparse and episodes are long, making longrange agents hard to train. On complex maps shortrange agents are vulnerable to local minima like wide barriers and narrow passages. Even within deployment categories, environments have vast variety: the SunCG dataset has 45,000 distinct houselike environments [58], and the US alone has over 5.6 million office buildings [9].
We present PRMRL, an approach to longrange navigation which combines PRMs and RL to overcome each other’s shortfalls. In PRMRL, an RL agent learns a local pointtopoint task, incorporating system dynamics and sensor noise independent of longrange environment structure. The agent’s learned behavior then influences roadmap construction; PRMRL builds a roadmap by connecting two configuration points only if the agent consistently navigates the pointtopoint path between them collision free, thereby learning the longrange environment structure. PRMRL roadmaps perform better than roadmaps based on pure connectivity because they respect robot dynamics. RL agents perform better with roadmap guidance, avoiding local minima even in untrainedfor environments. PRMRL thus combines PRM efficiency with RL robustness, creating a longrange navigation planner that not only avoids localminima traps, but is resilient to environmental changes and transfers well to new environments, as shown in Figure 2, where a policy trained on a small training environment scales to a fourbuilding complex almost two hundred times larger in map area.
This paper contributes PRMRL as a hierarchical kinodynamic planner for navigation in large environments for robots with noisy sensors. This paper is an extension of [21], which contributes the original PRMRL method. Here, we investigate PRMRL in the navigation context and make the following contributions beyond the original paper: 1) Algorithm 2 for PRMRL roadmap building 2) Algorithm 3 for robot deployment; 3) PRMRL application to kinodynamic planning on a car model with inertia; 4) indepth analysis of PRMRL, including: 4.1) correlation between the quality of the local planner and the overall hierarchical planner; 4.2) impact of improving planners and changing parameters on PRMRL computational time complexity; 4.3) impact of a robust local planner on the effective topology of the configuration space. All the evaluations and experiments are new and original to this paper. We evaluate PRMRL with a more effective local planner [12], compare it in simulation against several baselines in seven different buildings, and deploy it to two physical robot testbeds.
Overall, we show improved performance, better roadmap generation, and easier onrobot transfer, including a relative success rate increase of over [21], and over [11], while maintaining good performance despite increasing noise. We also show that only adding edges when agents can always navigate them makes roadmaps cheaper to build and improves navigation success; denser roadmaps also have higher simulated success rates but at substantial roadmap construction cost. Floorplans are not always available or up to date, but we show roadmaps built from SLAM maps close the simulation to reality gap by producing planners that perform in simulation virtually as well as they do on the robot. SLAMbased PRMRL enables real robot deployments with hundredmeter collisionfree trajectories at two different sites on two different robots with success rates as high as . We also show that PRMRL functions well on robots with dynamic constraints, with an success rate in simulation.
While this paper focuses on navigation, we believe that the analysis and empirical findings will be of interest to the wider motion planning research community for two reasons. First, PRMRL presented here is an example of a hierarchical motion planner that can incorporate models of sensor, localization, and control uncertainties into the roadmap construction phase, resulting in a planner that performs as well in simulation as it does on robots. Second, we present a comprehensive analysis of the tradeoffs in performance and computational complexity and the interplay between local and global planner that is not specific to navigation.
Ii Related Work
Ii1 Probabilistic roadmaps
PRMs [34] have been used in a wide variety of planning problems from robotics [26, 45] to molecular folding [3, 53, 60]. They have also been integrated with reinforcement learning for state space dimensionality reduction [43, 53] by using PRM nodes as the state space for the reinforcement learning agent. In contrast, our work applies reinforcement learning to the full state space as a local planner for PRMs. In prior work for an aerial cargo delivery task, we trained RL agents to track paths generated from PRMs constructed using both a straightline local planner [20]. PRMs have also been modified to work with moving obstacles [27, 55], noisy sensors [44], and localization errors [2, 4]. Safety PRM [44] uses probabilistic collision checking with a straightline planner, associating with all nodes and edges a measure of potential collision. All those methods address one source of errors at the time. In contrast, PRMRL uses a RLbased local planner capable of avoiding obstacles and handling noisy sensors and dynamics, and at the node connection phase the RL local planner does Monte Carlo path rollouts with deterministic collision checking. We only add edges if the path can be consistently navigated within a tunable success threshold.
PRMs are easy to parallelize, either through parallel edge connections [5], sampling [8], or building subregional roadmaps [15] in parallel. To speedup building of our large scale roadmaps, we use approach similar to [5] across different computers in a cluster. Individual Monte Carlo rollouts of the edge connecting can be parallelized across multiple processors or run sequentially to allow for early termination.
Ii2 Reinforcement learning in motion planning
Reinforcement learning has recently gained popularity in solving motion planning problems for systems with unknown dynamics [36], and has enabled robots to learn tasks that have been previously difficult or impossible [1, 10, 40]. For example, Deep Deterministic Policy Gradient (DDPG) [41] is a current stateoftheart algorithm that works with very high dimensional state and action spaces and is able to learn to control robots based on unprocessed sensor observations [40].
Deep RL has been successfully applied to the navigation problem for robots, including visual navigation with simplified navigation controllers [7, 14, 25, 50, 56, 64], more realistic controllers environments in gamelike environments [6, 13, 48], extracting navigation features from realistic environments [10, 23]. In the local planner setting similar to ours, differential drive robot with 1d lidar sensing several approaches emerged recently using asynchronous DDPG [59], expert demonstrations [54], DDPG [42], and curriculum learning [62], and AutoRL [12]
. While any of the obstacleavoidance agents can be used as a local planner in PRMRL, we choose AutoRL for the simplicity of training, because AutoRL automates the search for the reward and neural network architecture.
Ii3 Hierarchical planners with reinforcement learning
Several recent works feature hierarchical planners combined with reinforcement learning, over a grid [17], or manually selected waypoints [32]. In these works, the roadgraph points are connected w.r.t. to a straightline planner, and the reinforcement learning agent is used as an execution policy at runtime. We use the obstacleavoidance RL policy for both execution policy and a local planner as a basis for connecting the edges in the roadgraph. This approach results in the roadmap connectivity that is tuned to the abilities of the particular robot.
Iii Problem Statement
PRMRL addresses the general problem of generating a valid executable path for a robotic system given task constraints. Given the collisionfree portion of the configuration space and optionally a task predicate that must be satisfied , a valid point is a point in that satisfies the task predicate . Define a path as a sequence of waypoints ; we assume time is discrete for the purposes of exposition, but this is not strictly required. A valid path consists of only valid waypoints: .
A robot’s observation space is generally not identical to , but is produced by a sensor process that can be modeled as a combination of inherent sensor dynamics and a source of noise: . Similarly, actions in the robot’s action space have a statedependent effect which also can be modeled as a combination of inherent robot dynamics and a source of noise: .
Given a valid start configuration and a policy , an executed trajectory is a sequence of configuration states that results from drawing actions from the policy and its noise processes: . A trajectory is considered a failure it produces an invalid point by exiting or failing to satisfy , or if it exceeds a taskspecific limit . Given a valid observable goal configuration , a nonfailed trajectory is considered a success if it reaches a point sufficiently close to the goal with respect to a taskdependent threshold , at which point the task is considered to have completed.
With respect to a given pathinterpreting policy , a valid executable path is one which the policy can reliably execute to achieve task completion – guiding the agent from the start state of to within of the goal state within time steps, all along an executed trajectory which remains within and satisfies the task predicate
. Because noise makes execution stochastic, a path is defined to be reliable if the policy’s probability of task completion using it exceeds a taskdependent
success threshold .In other words, PRMRL is a tool for generating paths which an RL agent can reliably satisfy without violating the constraints of its task. In [21] we demonstrated PRMRL’s success on tasks with constraints, but in this work, we focus solely on the navigation task, which collapses the task predicate to remaining within , and collapses the full configuration space available to the robot to a task space limited to the robot’s position and orientation.
Iv Methods
The PRMRL method has three stages: training an environmentindependent local planner policy with RL, creating a roadmap specific to that local planner and an environment, and deploying that roadmap and policy to the environment for querying, trajectory generation and execution. Fig. 4 illustrates the method.
First, in the training stage (Fig. (a)a), to enable a robot to perform a specific task, we train a local planner policy with RL. For indoor navigation, that task is shortrange pointtopoint navigation endtoend from sensors to actions. This task is independent of the environment that the robot will eventually operate in. The RL agent learns to perform a task on an environment comparable to the deployment environment, but with a small state space to make training tractable. This is a Monte Carlo simulation process: we train multiple policies and select the fittest one for the next stage of PRMRL, regardless of the learning algorithm used.
Next, in the creation phase (Fig. (b)b upper), to prepare a robot to work in a specific environment, we use this best policy as a local planner to build a PRM with respect to the target site. Obstacle maps, such as floor plans or SLAM maps, can be used for any robot we wish to deploy in a building as long as the policy transfers well to the real world. This is a onetime setup per robot and environment. Points are sampled from as in standard PRM, but PRMRL only adds an edge to the roadmap only when the agent can navigate it in simulation with greater probability than over trials. Rather than being determined by the geometry of free space, the resulting roadmap is tuned to capabilities of the particular robot, so different robots over the same floor plan may generate different roadmaps with different connectivity.
Finally, in the deployment phase, (Fig. (b)b lower), to perform the task in the environment, the constructed roadmap is queried to generate trajectories, which are executed by the same RL agent used to generate the roadmap. Querying a roadmap can be the same as standard PRM; alternately, during construction we can record additional data about executed trajectories in the roadmap to enable other trajectory metrics, such as minimal energy consumption, shortest time, and so on, which are not available with geometryonly approaches. At execution time, the RL agent navigates to the first waypoint. Once the agent is close enough to the waypoint, the next waypoint becomes its new goal; the process repeats until the agent has traversed the whole trajectory.
Iva RL Agent Training
To maintain independence from environments, we train the RL agent to perform its task without knowledge of the workspace topology: the transfer function
that leads the system to task completion is only conditioned on what the robot can observe and what it has been commanded to do. Formally, we learn policies to control an agent that we model as a partially observable Markov decision process (POMDP) represented as a tuple
of observations, actions, dynamics, noise, reward and discount. Observations, actions, dynamics and noise are determined by the characteristics of the robot and are continuous and multidimensional. Reward and discount are determined by the requirements of the task: is a scalar discount factor, whereas the reward has a more complicated structure , including a true scalar objective representing the task and a weighted dense scalar reward based on observable features. We assume a presence of a simplified blackbox simulator without knowing the full nonlinear system dynamics.PRMs require a local planner to connect points; in its basic form, PRMRL’s local planner must navigate a robot from a start position to a goal position without collision; we call this the pointtopoint (P2P) task. The task space for navigation is , that is, a robot at a position and orientation. The navigable portion of the task space is typically many times the robot’s radius – for the spaces we consider, meterscale robots in hundredmeter buildings.
IvA1 P2P for differential drive robots
Our primary focus on this paper is on approximately cylindrical differential drive robots with lidar, represented by a unicycle or Type (2,0) model [57] with configuration space
. The observations are pairs of 1d lidar vectors and goal sets observed over
steps: . The actions are a 2dimensional continuous action vector that encodes the robot’s linear and angular velocity, . The dynamics and noise are implicit in the real world but are encoded separately in simulation in the system dynamics and an added noise model.The true objective of the P2P agent is to maximize the probability of reaching the goal without collisions,
(1) 
where is an indicator function, is the goal pose, and the goal size, and the zerocollision property is enforced by terminating episodes on collisions. The goal observation is the relative goal position in polar coordinates, which is readily available from localization. The reward for P2P for differential drive robots is:
(2) 
where is 1 when the agent reaches the goal and 0 otherwise, is the negative Euclidean distance to the goal, is 1 when the agent collides with obstacles and 0 otherwise, is the distance to the closest obstacle, is a constant penalty step with value 1, and is the negative angular speed. We train this model with AutoRL [12] over the DDPG [41]
algorithm. AutoRL automates hyperparameter search in reinforcement learning using an evolutionary approach. AutoRL takes as inputs a true objective used to evaluate the agent, a parameterized dense reward that the agent uses to train itself, and optionally neural network architecture and algorithm hyperparameters. To train in agent, AutoRL typically optimizes these hyperparameters in phases. First, given an arbitrary or handtuned architecture, it trains several populations of RL agents with different reward parameters and optimizes over the true objective. Optionally, a second phase repeats the process with the dense reward fixed while searching over neural network architectures instead.
IvA2 P2P for carlike robots
To confirm that PRMRL generalizes to other classes of agents, we also model carlike robots represented with a kinematic singletrack model [52] with configuration space a subset of restricted to the range of turn of the front wheels. The true objective of P2P does not change for car drive, but because the turn radius of the car is limited and the car must perform more complex maneuvers, we choose a slightly different reward model:
(3) 
where all values are the same as for diff drive except rewards the delta change of Euclidean distance to goal, is the negative of backwards speed and 0 for forward speed, and and are dropped. We train this model with hyperparameter tuning with Vizier [24] over the DDPG [41] algorithm in a different training regime in which the car model is allowed to collide up to 10 times in training, but is still evaluated on the true objective of zero collisions.
IvB PRM Construction
The basic PRM method works by sampling robot configurations in the the robot’s configuration space, retaining only collisionfree samples as nodes in the roadmap. PRMs then attempt to connect the samples to their nearest neighbors using a local planner. If an obstaclefree path between nodes exists, an edge is added to the roadmap.
We modify the basic PRM by changing the way nodes are connected. Formally, we represent PRMs with graphs modeled as a tuple of nodes and edges. Nodes are always in free space, , and edges always connect valid nodes , but we do not require that the line of sight between those nodes is in , allowing edges that “go around” corners. Since we are primarily interested in robustness to noise and adherence to the task, we only connect configurations if the RL agent can consistently perform the pointtopoint task between two points.
The robot’s state space is a superset of the
because it represents not just the configuration of the robot but its possible observations and other taskdependent state. Therefore, we sample multiple variations around the start and goal configuration points, and add an edge only if the success rate exceeds a threshold. Note this means PRMRL trajectories aren’t guaranteed to be collisionfree paths; this is not a limitation of the method, but is rather the unavoidable result of sensor noise. A standard PRM would not produce a collisionfree trajectory for noisy or changing environments. Nevertheless, when discussing the results, we estimate a lower bound on the probability of collision.
Algorithm 1 describes how PRMRL adds edges to the PRMs. We sample multiple points from the state space, which correspond to the start and goal in the configuration space, and attempt to connect the two points over trials. An attempt is successful only if the agent reaches sufficiently close to the goal point. To compute the total length of a trajectory, we sum the distances for all steps plus the remaining distance to the goal. The length we associate with the edge is the average of the distance of successful edges. The algorithm recommends adding the edge to the roadmap if the success probability is above the threshold .
The worstcase number of collision checks in Algorithm 1 is , because multiple attempts are required for each edge to determine . While each trial of checking the trajectory could be parallelized with processors, an alternative approach is to serialize trajectory checking within Algorithm 1, terminating early if the tests fail too many times. Mathematically, for a given success threshold and desired number of attempts, at least trials must succeed; therefore we can terminate when or when the failures exceed the complementary probability . This can provide substantial savings if is high, as shown in Section VD.
We use a custom kinematic 3D simulator which provides drive models for agents and supports visual sensors such as cameras and lidars. The simulator also provides parameterized noise models for actions, observations, and robot localization, which we find improves model training and robustness. Because the simulation is kinematic, stepping is fast compared to the fullphyisics simulations, speeding up both RL training and roadmap building.
For larger roadmaps, an effective strategy is to parallelize calls to Algorithm 1. Algorithm 2 describes the roadmap building procedure, where a RL agent is trained once, and used on several environments. While building a roadmap for each environment, we first sample the obstacle map to the given density, and store all candidate edges. Two nodes that are within the RL policy range, are considered candidates. Next, we partition all the candidate edges is subsets for distributed processing. Each candidate edge is considered and it is added, along with its nodes, to the roadmap if the AddEdge Monte Carlo rollouts return positive outcome.
IvC Navigation
Finally, Algorithm 3 describes the navigation procedure, which takes a start and a goal position. As in traditional PRMs, these are added to the roadmap if not already there. Then roadmap is queried to result in a list of waypoints. If no waypoints are returned, the algorithm returns the start and goal as the path to give the RL agent the opportunity to attempt to navigate on its own. In execution, the RL agent is given one waypoint at the time as a subgoal, clearing them sequentially as it gets within goal distance, until the final destination is reached or is exceeded.
V Results
We evaluate PRMRL’s performance on both floormaps and SLAM maps, with respect to noise resilience, and with onrobot experiments. Section VA describes the robot and training setup, evaluation environments, and baselines. Section VB evaluates the performance on the floormaps against the baselines. With the performance established against the baselines, the following sections examine PRMRL’s characteristics in more depth. Section VC assesses navigation robustness to noise, and Section VD studies the role of success threshold and sampling density, and applicability of PRMRL with for robots with dynamic constraints in Section VH. Since one of our goals is to assess PRMRL for realworld navigation, the remainder of the results focuses on onrobot evaluation and sim2real transfer. Section VE evaluates PRMRL on noisy SLAM maps, and Section VG analyzes the sim2real transfer and onrobot experiments.
Va Methodology
VA1 Robot setup
We use two different robot kinematic models, differential drive [38] and simple car model [38]. Both are controlled with linear and angular velocities commanded at Hz, receive goal observation from offtheshelf localization, and are represented as circles with radius. The obstacle observations are from 1D lidar data (Fig. 5), with a FOV resampled to rays. Following [49], we use an observation trace of the last 3 frames to provide a simple notion of time to the agent. We use differential drive robots like the Fetch [47] for physical experiments.
Reward  Description  Min  Max  Diff  Car 

Drive  Model  
1 when goal reached  0.0  100.0  14.30  0.82  
and 0 otherwise  
Negative Euclidean  0.0  1.0  0.17  N/A  
distance to goal  
Delta of Euclidean  0.0  5.0  N/A  2.03  
distance to goal  
1 on collision  100.0  0.0  31.75  1.80  
and 0 otherwise  
Distance to  0.0  1.0  0.45  N/A  
closest obstacle  
Constant perstep  1.0  0.0  0.34  0.10  
penalty  
Negative angular  0.0  1.0  0.41  N/A  
speed  
Negative backward  1.0  0.0  N/A  0.64  
speed 
VA2 Obstacleavoidance local planner training
We use the P2P agent from [12] for the differential drive robot, and train an agent with AutoRL over DDPG [12] for the car robot. In both cases, the true objective is goal reached within , allowing the RL agent to cope with noisy sensors and dynamics. Table I depicts learned reward hyperparameters. DDPG actor and critic are feedforward fullyconnected networks. Actor networks are threelayers deep, while the critics consists of a twolayer observation networks joined with the action networks by two fully connected layers, resulting in actor layer widths of 50, 20, and 10, and critic widths of 10 and 10. Appendix VII contains the training hyperparameter details.
The training environment is by (Fig. (a)a). To simulate imperfect realworld sensing, the simulator adds Gaussian noise, , to its observations.
Environment  Type  Dimensions  Visual 

Training  Floor map  by  Fig. (a)a 
Building 1  Floor map  by  Fig. (d)d 
Building 2  Floor map  by  Fig. (c)c 
Building 3  Floor map  by  Fig. (b)b 
Building Complex  Floor map  by  Fig. 2 
Physical Testbed 1  SLAM  by  Fig. (a)a 
Physical Testbed 2  SLAM  by  N/A (private) 
VA3 Evaluation environments
Table II lists our simulation environments, all derived from realworld buildings. Training, Building 13, depicted in Fig. 3, and Building Complex (Fig. 2) are metric maps derived from real building floor plans. They are between 12 to 200 times larger than the training environment by area. Physical Testbed 1 (Fig. 6) and Physical Testbed 2 are derived from SLAM maps used for robot deployment environments.
VA4 Roadmap building
For simplicity we use unform random sampling to build roadmaps. We connect PRMs with a effective threshold of over 20 attempts, with a max connection distance of based on the effective navigation radius for our P2P agents per [12], except where stated differently.
Label  Local  Execution  Obstacle Avoid  Monte Carlo  Description  
Planner (LP)  Policy  LP  Execution  rollouts  
AutoRL  N/A  AutoRL  N/A  Yes  No  The AutoRL policy [12] without guidance by a PRM. 
PRMSL  Straight Line  Straight Line  No  No  No  Straightline PRMs [34] with a straightline execution policy. 
PRMGAPF  Straight Line  Guided APF  No  Yes  No  Straightline PRMs [34] executed by guided APF similar to [11]. 
ICRA18  DDPG  DDPG  Yes  Yes  Yes  PRMRL w/ DDPG for roadmap & execution ([21]). 
PRMRL  AutoRL  AutoRL  Yes  Yes  Yes  PRMRL w/ AutoRL for roadmap & execution (ours). 
VA5 Baselines
We select four baselines shown in Table III. The baselines differ in the local planner, which is used for building the roadmap, and execution policy, which guides the robot. The baseline selection critia was determined by the obstacle avoidance ability of a policy, its stochasticity. Recall that, PRMRL uses a stochastic policy capable of obstacleavoidance, for connecting nodes in the roadmap, and relies on Monte Carlo rollouts of the policy.
The selected baselines are AutoRL [12], PRMSL [34], PRMGAPF (a modification of [11]), and our original PRMRL with a handtuned RL policy [21]. The AutoRL [12] baseline policy is not guided by a PRM. PRMSL [34] uses roadmaps built with a straightline planner and a straightline execution policy. PRMGAPF uses PRMs built with a straightline local planner, and an execution policy guided by an artificial potential field over the PRMSL path, similarly to [11]. ICRA18 [21] is PRMRL with handtuned DDPG as the planner. Where not otherwise specified, PRMRL refers to our current approach. We do not compare PRMRL with RRTs because they are onetime planners and are relatively expensive for building onthefly for long range trajectories.
For comparisons to baselines, unless otherwise stated, each roadmap is evaluated on 100 queries selected from the between start and goal positions from to .
VB PRMRL Performance on Floorplan Maps
Figure 7 shows the success rate and path length comparison over 10,000 queries. PRMRL’s average success rate is over all three maps, which outperforms the baselines by factors of for pure AutoRL, for PRMSL, for PRMGAPF, and for ICRA18. Numerical results are in Table IV in Appendix VII. Outperforming the nonguided AutoRL policy is not surprising as it does not have the knowledge of the obstacle map; note that AutoRL’s peformance degrades rapidly on maps other than the training environment, while PRMguided policies show less degradation and occasional improvement.
We can draw two observations from these results. First, they provide evidence that PRMRL successfully enables a local planner trained on one environment to perform well in environments that it was not trained on  and furthermore can enable it to perform better on the original training environment than it can on its own. This is not surprising as our local planners were not designed to travel long distances, but guiding the local policy via the PRM can more than double its success rate. Second, the PRM success rate is tightly correlated with the success of the local planner. The P2P policy used in PMRRL outperforms the one used in ICRA2018 by 31% and APF by 48% as reported in [12]. Extended to PRMs, the performance lifts are 19.9% and 33%. This is strong evidence that investing in a highfidelity local planner increases PRMRL’s performance of the overall navigation task.
PRMRL shows more robustness to noise than PRMGAPF. PRMRL performance is in blue while PRMGAPF is in green; individual maps are marked with T, 1, 2, 3, and S for Training, Buildings 13, and the SLAM map. The thick lines marks the means and their light borders the standard deviations. a) PRMRL smoothly degrades as lidar noise rises towards
with typically no more than a drop, while PRMGAPF degrades quickly as lidar noise rises towards , up to drop. b) PRMRL also shows smooth degradation with position noise (modeled as goal uncertainty); PRMGAPF degrades up to with position noise (modeled as localization uncertainty). c) Action noise causes the sharpest degradation of up to on PRMRL; action noise has less effect on PRMGAPF, up to of peak, but it is still worse than PRMRL except on the SLAM map at high noise levels.VC PRMRL Robustness to Noise
Real robot sensors and actuators are not perfect, so it is important for navigation methods to be resilient to noise. Fig. 8 shows the evaluation of PRMRL and PRMGAPF over simulated Gaussian noise sources with mean and in the following ranges:

Lidar sensor noise from , over three times the radius of the robot.

Goal position noise from , over twice the radius of the goal target.

Action noise of velocity  and angular velocity .
As lidar and goal position noise increases, PRMRL shows only a slight degradation, typically no more than an drop in success even at . Action noise shows a steeper degradation, from to depending on the environment. In contrast, PRMGAPF degrades steeply with respect to lidar noise, achieving success rates of and dropping from the peak score. Localization noise for PRMGAPF, while not directly comparable to the goal noise collected for PRMRL, also dropped off to success scores of . While PRMGAPF seems to be somewhat more resistant to increased action noise, dropping off only from its highest score, it nonetheless never achieves more than success rate and was almost always worse than PRMRL.
PRMRL is resilient to lidar, localization and actuator noise on the order of tens of centimeters, which is larger than the typical lidar, localization and actuator errors we observe on our robot platforms, indicating that PRMRL is a feasible approach to deal with the kinds of errors our robots actually encounter. This is similar to the trend to noise sensitivity reported in [12], and suggests that overall navigation resilience to noise is correlated to the that of the local planner.
VD The Impact of Sampling Density and Success Threshholds
To deploy PRMRL on real robots, we need to choose sampling density and success threshold parameters that provide the best performance.
Figure (a)a shows that PRMRL success rate increases steadily up to a sampling density of samples per meter, which is roughly twice the size of our robot, and then levels off. The trend appears to be robotspecific, rather than the environment specific. At the same time, collision checks increase rapidly with sampling density (Fig. (b)b); While traditional PRM theory predicts that performance would continue to improve with increased sampling [33, 28], these results raise the possibility that beyond this density PRMRL performance is robotnoisebound, and that sampling beyond this density provides little benefit at rapidly increasing cost. For that reason, we suggest finding the optimal sampling density in the small training environment at the fraction of the cost, and using it when building larger roadmaps.
To determine the appropriate connection success threshold, we survey navigation success over our training environment and a SLAM map (Fig 10). These results show that the agent’s ability to navigate the map increases as the threshold increases (Fig. (a)a). Because our connection algorithm terminates earlier for higher thresholds when it detects failures, collision checks drop as the success threshold rises (Fig. (b)b). At the end, for larger roadmaps, the success threshold of 100% not only produces the most reliable roadmaps, but requires less collision checks to build them.
For the evaluations in simulation reported earlier in this paper, we choose parameters which enabled comparison with [21]: a sampling density of samples per meter and an effective success connection threshold of . For the two roadmaps in Fig 10, this results in an average collision check savings over .
However, if comparison to a baseline is not important, these results suggest choosing map densities as high as samples per meter with as high a success connection threshold as possible. For the physical deployments we discuss later, we choose a success connection threshold of and generate maps at and samples per meter and empirically measure performance; as predicted by thr simulation results, we observe better performance on denser maps.
VE PRMRL Performance on SLAM Maps
Floorplans are not always available or uptodate, but many robots can readily generate SLAM maps and update them as buildings change. To assess the feasibility of using SLAM maps, we evaluate PRMRL on SLAM maps generated from some of the same buildings as our floorplan roadmaps. Figure 6 illustrates a sample roadmap, corresponding to Physical Testbed 1, part of floor two of Building 1 in (d)d and the upper center section of the largescale roadmap in Figure 2. This SLAMderived roadmap has 195 nodes and 1488 edges with 2.1 million collision checks.
We compare PRMRL’s performance with PRMGAPF on this map. PRMGAPF achieves a success rate of , whereas PRMRL’s success rate is , a relative increase. Close inspection of the failure cases indicated some of these connections are genuinely hard for our current policies due to features of this map. For example, the inset in Figure (b)b shows an area which appears navigable at first glance which PRMRL refuses to connect: closer inspection reveals pixels in the SLAM map which correspond to a pillar in reality, which humans can easily walk around but which is challenging for our robot to navigate around due to the width of its base.
These results show that while the performance of PRMRL with an AutoRL policy on SLAM roadmaps is not as good as its average on the floorplan roadmaps in our tests, it is better than the baselines on the floorplan maps, and is roughly equivalent to PRMRL with the ICRA2018 policy on these floorplan maps. These results indicate PRMRL performs well enough to merit tests on roadmaps intended for real robots at physical sites, which we discuss in the following two sections.
VF Scaling PRMRL to LargeScale Maps
Our robot deployment sites substantially larger than our simulated test maps, raising the question of how PRMRL would scale up. For example, the SLAM map discussed in the previous section is only part of one building within a 4building complex; where the SLAM map is by , a map of the 4building complex is by . To assess PRMRL’s performance on largescale maps, we built and test roadmaps for maps covering the complete footprint of both deployment sites.
Figure 2 depicts a large floorplan roadmap from the 4building complex. This roadmap has 15.9 thousand samples and 1.4 million candidate edges prior to connection attempts, of which 689 thousand edges were confirmed at a success threshold. This roadmap took 4 days to build using 300 workers in a cluster and required 1.1 billion collision checks. PRMRL successfully navigates this roadmap of the time evaluated over 1,000 random navigation attempts with a maximum path distance of (note our other experiments use , which generally will not cross the skybridge in this larger map; for reference, using our default evaluation settings, PRMRL navigates this roadmap of the time).
For our other candidate robot deployment site, we use a large SLAM map, by . We constructed a roadmap with 2069 nodes and 53,800 edges, collected with 42 million collision checks at the higher success threshold of . PRMRL successfully navigated this of the time, evaluated over 1,000 random navigation attempts; as on our smaller SLAM map, examination of the failure cases indicate that the more complex geometry recorded by SLAM proves problematic for our current policies.
These results indicate that PRMRL’s simulated performance on roadmaps continues to surpass the average success threshold we observed previously in [21], making it worthwhile to test on real robots at the physical sites.
VG Transfer of PRMRL to Physical Robots
We empirically evaluate PRMRL on two physical environments on two differentialdrive robots. First, we evaluate PRMRL with the AutoRL policy for a differentialdrive robot in Physical Testbed 1 (Fig. (a)a). We collected 43 trajectories accumulating over traveled. PRMRL exhibited an overall success rate of over 39 runs; the longest successful trajectory was meters. Fig. (c)c shows a sample of 10 of these runs. The video is available at https://youtu.be/xNOWX5gKvQ
Second, we evaluate PRMRL in Physical Testbed 2. For a variant of the roadmap generated at success rate and density of samples per meter, we collected 31 trajectories over of travel, shown in Fig. (a)a; the longest successful trajectory was . We cannot directly compare this evaluation to our simulated runs because the estop policies designed to protect the robot do not match our simulation termination conditions; nevertheless, we recorded 14 successful runs out of 31, a success rate. We then tested a variant of the roadmap generated at success rate and a density of samples per meter over 49 runs over of travel for an improved success rate, shown in Fig. (b)b; the longest successful trajectory was . These runs include a complete circuit where PRMRL successfully navigated between a loop of waypoints over . Interestingly, this much more successful map showed qualitatively worse behavior on some paths by attempting to navigate through PRM waypoints blocked by temporary obstructions.
Figure 12 summarizes these results. Despite the more aggressive episode termination policies onrobot (nearcollisions are treated as failures onrobot) and the more limited available search space, we nonetheless observe similar results: over several different roadmaps constructed at different densities and success criteria, PRMRL built with noisy SLAM maps achieves success in simulation and success onrobot, closing the sim2real gap. This makes SLAMbased PRMRL simulation results a good predictor of onrobot performance.
VH PRMRL with Dynamic Constraints
To demonstrate that PRMRL could guide more complicated robots, we develop a drive model for a simulated F1/10 car [16] with dynamics following [52]. Average success over the four maps in Figure 3 is with a standard deviation of ; average success in simulation on Physical Testbed 1 is . Figure (a)a illustrates a roadmap built with this model over our training map with samples per meter connected with a success rate; this roadmap has 32 nodes and 313 edges connected after 403 thousand edge checks. On this roadmap, PRMRL exhibits an success rate, including cases where the car needs to make a complex maneuver to turn around (Fig. (b)b). These results are as good as the onrobot results for the dense map of our larger robot deployment, indicating that PRMRL is a viable candidate for further testing on more complicated robot models.
Vi Analysis
In the previous section, we empirically established correlations between the local planner’s competence and PRMRL’s resilience to noise; we also explored the contributions of sampling densities, success thresholds and obstacle map sources to the success of overall navigation. We concluded that 1) using an execution policy that is resilient to noise and avoids obstacles as a local planner improves the overall success of the hierarchical planner; 2) a success threshold of 100% improves overall navigation success, 3) the upper bound to navigation success is not dependent on density but policy performance and robot type, and 4) using realistic obstacle maps, such as SLAM maps, as a basis for building roadmaps provides simulation results closer to reality.
This section provides a deeper analysis of those empirical findings. Section VIC analyzes the impact of local planner’s obstacle avoidance and noise resilience on roadmap construction. Section VIB examines the computational complexity of PRMRL, and Section VIC discusses causes of failure for trajectories over multiple waypoints.
Via PRMRL Roadmap Connectivity
Unlike straightline planners, RL agents can often go around corners and smaller obstacles; Figure 14 shows how this effectively transforms the configuration space to make obstacles smaller. While the agent never traverses these corner points, as they are not in , they nevertheless do not serve to block the agent’s path, unlike central portions of a larger body which might block or trap the control agent in a local minimum. If we model this phenomenon as an effective reduction in radius of a circular obstacle with respect to a policy , and further model the connection region as a disc filled with randomly sampled obstacles from to in total area density , we can estimate an upper bound on connection success as the idealized case in which obstacles do not occlude and the chance of connection is just the complementary probability of encountering an obstacle over a region of space, ; this corresponds to the looser bound in the RL case. Combining these, a conservative estimate of the proportion of samples connected by PRMRL to those connected by PRMSL is:
(4) 
This simplified model indicates that as obstacle density increases, it becomes harder to connect points, but as RL’s ability to cut corners increases, PRMRL has an increasing advantage over PRMSL in connecting these difficult roadmaps. Hence, in environments with dense obstacles, it makes sense to invest in execution policies that avoid obstacles really well, and use them as local planners. Alternately, it suggests that policies that can learn strategies for dealing with frequently occurring obstacle topologies, like box canyons and corners, are a fruitful area for future work.
Conversely, PRMRL does not connect nodes in a roadmap where a policy cannot navigate reliably. This is the key difference from standard PRMs, and is the cause for the upper limit on performance improvements as the roadamps increase their density  the roadmaps are policybound, rather than sampling bound. One question to ask is why local control policies can’t simply learn to “drive safe” under the guidance of some onthefly path planner like A*. However, an analysis of how noise impacts the behavior of policies indicates that policies which do not memorize the environment may overestimate their ability to navigate because the hazards that they can observe locally may not represent an accurate picture of the hazards of the global environment.
To see why, suppose a policy has learned to navigate safely in the presence of noise by maintaining a distance from walls. Modeling time as discrete and assuming the robot is travelling at a constant speed, so that on each time slice the robot moves a constant distance units forward, let us further model sources of variability as transient Gaussian noise orthogonal to the robot’s travel with zero mean and standard deviation ; this results in a probability of collision per step of
(the cumulative distribution function of the Gaussian noise model
evaluated at , expressed in terms of the complementary Gauss error function ). Figure (a)a shows that when the robot is traveling in a narrow corridor, it is twice as likely to collide as it does when hugging a wall, even though it may be maintaining from any given wall at all times. Over a path of length , a conservative lower bound on the chance of collision rises exponentially with the number of steps it takes to traverse the path,(5) 
causing the narrow corridor case to become unsafe faster than the wallhugging case as shown in the blue region of Figure (b)b. This means that a RL policy that judges its safety based on locally observable features can overestimate the safety of a path in the region between where both PRMRL and RL would succeed and where both PRMRL and RL would fail (Figure (c)c); the same would be true of RL guided by PRMSL based on clearance, such as in gridbased sampling [17] and Safety PRM [44]. In this case, RL or RL guided by PRMSL can make erroneous navigation choices, whereas PRMRL simply does not add that link to the roadmap. While in theory an agent could be designed to cope with this specific issue, other scenarios can present similar problems: no local prediction of collision probability can give a true guarantee of trajectory safety. While this is an idealized analysis, our experience is that realworld environments can be more pathological  for example, at one site curtains behind a glass wall induced a persistent 0.5 meter error in the robot’s perceived location, causing it to drive dangerously close to the wall for long distances despite the presence of a safety layer. PRMRL provides an automated method to avoid adding those unreliable connections in the roadmap, resulting in the roadmaps that more sparsely connect but transfer more reliably onrobot.
ViB PRMRL Computational Complexity
The computational complexity of connecting a node to a roadmap PRMRL is impacted by a constant factor over PRMSL because point connection tests for PRMRL require a sequence of simulated MC rollouts, whereas PRMSL requires a single straightline collision test. In addition, the empirical evaluations show that there is a sample density beyond which the navigation success rate of PRMRL does not improve. Here we assess the computational complexity of building a PRMRL roadmap with known RL policy range, , optimal sampling density, , and workspace volume
Approaches exist to construct PRMs with nodes in edge tests, but generally achieve this by imposing various limits, e.g. connecting the k nearest neighbors or making the radius of connection an inverse function of the number of nodes [31]. PRMRL instead attempts to connect all nodes within a given radius, both to give the RL agent the highest chance of success and to satisfy user expectations on navigation behavior, making it more similar to Simplified PRM [33] which requires edge calculations. However, for practical purposes most RL algorithms cannot connect points longer than some effective range which can be determined empirically (see [12]). This can be used to restrict connection attempts to a local neighborhood, reducing the edge calculations to if a spatial data structure is used for the nearest neighbor lookup.
To ensure coverage, PRMRL samples the configuration space to a given density rather than choosing a fixed number of points , so the number of points sampled is a function of the workspace volume multiplied with density, where is the workspace volume, and is the optimal sampling density. Next, the complexity of connecting a single node to the roadmap given connection attempts is , because the search volume for the policy with effective range is where is the workspace dimensionality, and therefore the number of expected samples in that volume is . Therefore, the computational complexity of building a roadmap is
(6) 
dropping the term because it is a constant. Eq. (6) exposes the following power sources:

Complexity is linear with the workspace volume and success threshold.

The roadmap building time is in sampling density, matching empirical observations in Fig. (b)b, and it is worthwhile in assessing the optimal sampling density before investing in building the roadmaps.

The complexity is in the policy range, and a sophisticated local planners that can reliably cover long distances increase the computational cost of the roadmap (unless a lower arbitrary connection limit is chosen).
Note each node connection is an independent operation, and with a sufficient number of processors the roadmap building can be parallelized up to the expected samples up . So, given processors, the effective time complexity is possibly alleviating some of the cost of increased sampling.
Last, note that when using early termination, increasing the success threshold often but not always reduces the required number of connection attempts . In the worst case scenario where we require , early termination can at best cut to , but as increases the number of failures needed to exclude an edge, , drops toward 1. Conversely, if navigation is successful then the full samples need to be collected for an edge; the distribution of successes and failures thus has a large effect on the cost. One way to control this cost to reduce the max connection distance to less than the effective policy navigation distance ; in this case the agent is more often expected to succeed, and can potentially be reduced. We have observed that these tradeoffs can significantly affect the cost of a run, but must be studied empirically for the environments of interest.
ViC PRMRL Trajectory Execution
Because PRMRL construction calculates the probability of success before adding an edge, we can estimate the expected probability of success of a longrange path that passes through several waypoints. Recall that to add an edge to the roadmap we collect Monte Carlo rollouts and require an observed proportion of successes typically of and . Given that expected probability of success of a Bernoulli trial observing successes out of samples is
(7) 
[46], the actual probability of successful navigation over an edge with successful samples is , and similarly for thresholds of . Extrapolating over the sequence of edges in a PRMRL path, the expected success rate is where is number of waypoints. In [21] we observe PRMRL paths with waypoints averaged over our three deployment maps, yielding an estimated probability of success of for the threshold and for the threshold. Therefore, for the lengths of paths we observe in our typical deployment environments, the threshold improves PRMRL’s theoretical performance to the point that it is more likely to succeed than not, which is what we observe empirically.
Vii Conclusion
We presented PRMRL, a hierarchical planning method for longrange navigation that combines samplingbased path planning with RL agent as a local planner in very large environments. The core of the method is that roadmap nodes are connected only when the RL agent can connect them consistently in the face of noise and obstacles. This extension of [21] contributed roadmap construction and robot deployment algorithms, along with roadmap connectivity, computational complexity, and navigation performance analysis. We evaluated the method on a differential drive and a car model with inertia used on floormaps from five building, two noisy obstacle maps, and on two physical testbeds. We showed that 1) the navigation quality and resilience to noise of the execution policy directly transfers to the hierarchical planner; 2) a 100% success threshold in roadmap construction yields both the highest quality and most computationally efficient roadmaps; 3) building roadmaps from the noisy SLAM maps that the robot uses at execution time virtually closes the sim2real gap, yielding simulation success rates very similar to those observed on robots. PRMRL with SLAM maps embed into the roadmap information that the robot uses at execution time, providing a better estimate of onrobot performance. Failure modes include pathologies of the local policy, poorly positioned samples, and sparsely connected roadmaps. In future work, we will examine improved policies with more resistance to noise, better sampling techniques to position the samples strategically, and techniques for improving map coverage with better localization and obstacle maps.
[Training Hyperparameters] Both the actor and critic use the AdamOptimizer with , , ; the actor’s learning rate is and the critic’s is
. The actor uses DQDA gradient clipping and the critic uses
with the Huber loss for temporal difference errors. Our number of initial stabilization steps is , after which we use a soft target network update of on every step. Our training batch size is and our replay buffer has a capacity of million. We let training run for million steps, but save policies every steps and select the best policy over the course of the run.[Tabular results for PRMRL success rate]
Env.  Navigation Success (%)  

AutoRL  PRMSL  PRMGAPF  ICRA18  PRMRL  
Bldg. 1  9.0  15  35  43  69 
Bldg. 2  9.3  18  32  38  63 
Bldg. 3  4.0  8  39  66  74 
Avg.  7.4  13.7  35.3  49.0  68.7 
Impr. %  825  402  94.3  40.1  N/A 
[Table of Symbols]
Symbol  Units or Domain  Meaning 

Configuration space of dimension  
Free portion of configuration space  
State space of robot plus task state  
Task space for navigation  
Free portion of the task space  
Physical workspace of dimension  
Observation space of dimension  
Action space of dimension  
Task dynamics  
Noise model or  
Reward model  
True objective  
Dense reward  
Named reward component  
Discount  
Indicator function  
Task predicate  
Sensor w/ dyn. & noise  
Action w/ dyn. & noise  
Waypoint on path  
Point along trajectory  
Start state  
Goal state  
meters  Goal success distance  
Max trajectory execution steps  
Diff drive action lin, ang vel.  
Line from to in graph  
Num. edge connection attempts  
Num. observed connection successes  
Edge connection success threshold  
meters  Max attempted edge connection dist.  
meters  Policy effective nav. distance  
Policy effective obst. shrinkage  
Sampling density per meter  
points  Number of points to sample  
Probability of successful navigation  
Number of waypoints on a path  
Volume of the workspace  
Workspace dimensionality  
Number of processors  
noise for 
Acknowledgments
The authors thank Jeffrey Bingham, James Davidson, Brian Ichter, Ken Oslund, Peter Pastor, Oscar Ramirez, Lydia Tapia, Alex Toshev, and Vincent Vanhoucke for their helpful discussions and contributions to the PRMRL project.
References
 [1] D. Abel, A. Agarwal, F. Diaz, A. Krishnamurthy, and R. E. Schapire. Exploratory gradient boosting for reinforcement learning in complex domains. arXiv preprint arXiv:1603.04119, 2016.
 [2] A. Aghamohammadi, S. Chakravorty, and N. Amato. FIRM: Samplingbased feedback motion planning under motion uncertainty and imperfect measurements. Int. J. Robot. Res., 33(2):268–304, 2014.
 [3] I. AlBluwi, T. Siméon, and J. Cortés. Motion planning algorithms for molecular simulations: A survey. Computer Science Review, 6(4):125 – 143, 2012.
 [4] R. Alterovitz, T. Simeon, and K. Goldberg. The stochastic motion roadmap: A sampling framework for planning with markov motion uncertainty. page 246 – 253, Atlanta, GA, USA, June 2007.
 [5] N. M. Amato and L. K. Dale. Probabilistic roadmap methods are embarrassingly parallel. In Proceedings 1999 IEEE International Conference on Robotics and Automation (Cat. No.99CH36288C), volume 1, pages 688–694 vol.1, May 1999.
 [6] S. Bhatti, A. Desmaison, O. Miksik, N. Nardelli, N. Siddharth, and P. H. Torr. Playing doom with slamaugmented deep reinforcement learning. arXiv preprint arXiv:1612.00380, 2016.
 [7] S. Brahmbhatt and J. Hays. Deepnav: Learning to navigate large cities. arXiv preprint arXiv:1701.09135, 2017.
 [8] c. Ichnowski and c. Alterovitz. Parallel samplingbased motion planning with superlinear speedup. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1206–1212, Oct 2012.
 [9] A Look at the U.S. Commercial Building Stock: Results from EIA’s 2012 Commercial Buildings Energy Consumption Survey (CBECS). https://www.eia.gov/consumption/commercial/reports/2012/buildstock/, 2012.

[10]
C. Chen, A. Seff, A. Kornhauser, and J. Xiao.
Deepdriving: Learning affordance for direct perception in autonomous
driving.
In
Proceedings of the IEEE International Conference on Computer Vision
, pages 2722–2730, 2015.  [11] H.T. Chiang, N. Malone, K. Lesser, M. Oishi, and L. Tapia. Pathguided artificial potential fields with stochastic reachable sets for motion planning in highly dynamic environments. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 2347–2354, 2015.
 [12] L. Chiang, A. Faust, M. Fiser, and A. Francis. Learning navigation behaviors endtoend with autorl. IEEE Robotics and Automation Letters (RAL), 2019.
 [13] A. Dosovitskiy and V. Koltun. Learning to act by predicting the future. arXiv preprint arXiv:1611.01779, 2016.
 [14] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. RL: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
 [15] C. Ekenna, S. A. Jacobs, S. Thomas, and N. M. Amato. Adaptive neighbor connection for prms: A natural fit for heterogeneous environments and parallelism. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1249–1256, Nov 2013.
 [16] F1/10 autonomous racing competition. http://f1tenth.org/about, 2018. Accessed: 20181203.
 [17] T. Fan, X. Cheng, J. Pan, P. Long, W. Liu, R. Yang, and D. Manocha. Getting robots unfrozen and unlost in dense pedestrian crowds. IEEE Robotics and Automation Letters (RAL), 2019.
 [18] A. Faust, H.T. Chiang, and L. Tapia. PEARL: PrEference appraisal reinforcement learning for motion planning. arXiv:1811.12651.
 [19] A. Faust, N. Malone, and L. Tapia. Preferencebalancing motion planning under stochastic disturbances. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 3555–3562, 2015.
 [20] A. Faust, I. Palunko, P. Cruz, R. Fierro, and L. Tapia. Automated aerial suspended cargo delivery through reinforcement learning. Artificial Intelligence, 247:381 – 398, 2017. Special Issue on AI and Robotics.
 [21] A. Faust, O. Ramirez, M. Fiser, K. Oslund, A. Francis, J. Davidson, and L. Tapia. PRMRL: longrange robotic navigation tasks by combining reinforcement learning and samplingbased planning. CoRR, abs/1710.03937, 2017.
 [22] D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine, 4(1):23–33, 1997.

[23]
A. Giusti, J. Guzzi, D. C. Cireşan, F.L. He, J. P. Rodríguez,
F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al.
A machine learning approach to visual perception of forest trails for mobile robots.
IEEE Robotics and Automation Letters (RAL), 1(2):661–667, 2016.  [24] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for blackbox optimization. In Proc. of ACM International Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017.
 [25] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mapping and planning for visual navigation. CoRR, abs/1702.03920, 2017.
 [26] K. Hauser, T. Bretl, J. Claude Latombe, and B. Wilcox. Motion planning for a sixlegged lunar robot. In The Seventh International Workshop on the Algorithmic Foundations of Robotics, pages 16–18, 2006.
 [27] D. Hsu, R. Kindel, J.C. Latombe, and S. M. Rock. Randomized kinodynamic motion planning with moving obstacles. Int. J. Robot. Res., 21(3):233–256, 2002.
 [28] D. Hsu, J.C. Latombe, and H. Kurniawati. On the probabilistic foundations of probabilistic roadmap planning. In Robotics Research, pages 83–97. Springer, 2007.
 [29] Deep Reinforcement Learning Doesn’t Work Yet. https://www.alexirpan.com/2018/02/14/rlhard.html, 2018.
 [30] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
 [31] S. Karaman and E. Frazzoli. Samplingbased algorithms for optimal motion planning. The international journal of robotics research, 30(7):846–894, 2011.
 [32] Y. Kato, K. Kamiyama, and K. Morioka. Autonomous robot navigation system with learning based on deep qnetwork and topological maps. In 2017 IEEE/SICE International Symposium on System Integration (SII), pages 1040–1046, Dec 2017.
 [33] L. E. Kavraki and J. claude Latombe. Probabilistic roadmaps for robot path planning, 1998.
 [34] L. E. Kavraki, P. Švestka, J. C. Latombe, and M. H. Overmars. Probabilistic roadmaps for path planning in highdimensional configuration spaces. IEEE Trans. Robot. Automat., 12(4):566–580, August 1996.
 [35] O. Khatib. Realtime obstacle avoidance for manipulators and mobile robots. The international journal of robotics research, 5(1):90–98, 1986.
 [36] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
 [37] J. Kuffner and S. LaValle. RRTconnect: An efficient approach to singlequery path planning. In IEEE International Conference on Robotics and Automation, volume 2, pages 995 –1001 vol.2, 2000.
 [38] S. M. LaValle. Planning Algorithms. Cambridge University Press, Cambridge, U.K., 2006. Available at http://planning.cs.uiuc.edu/.
 [39] S. M. Lavalle and J. J. Kuffner. Rapidlyexploring random trees: Progress and prospects. In Algorithmic and Computational Robotics: New Directions, pages 293–308, 2000.
 [40] S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.
 [41] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
 [42] P. Long, T. Fan, X. Liao, W. Liu, H. Zhang, and J. Pan. Towards optimally decentralized multirobot collision avoidance via deep reinforcement learning. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 2125, 2018, pages 6252–6259, 2018.
 [43] N. Malone, A. Faust, B. Rohrer, R. Lumia, J. Wood, and L. Tapia. Efficient motionbased task learning for a serial link manipulator. Transactions on Control and Mechanical Systems, 3(1), 2014.
 [44] N. Malone, K. Manavi, J. Wood, and L. Tapia. Construction and use of roadmaps that incorporate workspace modeling errors. In Proc. IEEE Int. Conf. Intel. Rob. Syst. (IROS), pages 1264–1271, Nov 2013.
 [45] N. Malone, B. Rohrer, L. Tapia, R. Lumia, and J. Wood. Implementation of an embodied general reinforcement learner on a serial link manipulator. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 862–869, May 2012.
 [46] N. D. Megill and M. Pavicic. Estimating bernoulli trial probability from a small sample. CoRR, abs/1105.1486, 2011.
 [47] D. K. E. D. Melonee Wise, Michael Ferguson and D. Dymesich. Fetch & freight: Standard platforms for service robot applications. In Workshop on Autonomous Mobile Service Robots held at the 2016 International Joint Conference on Artificial Intelligence, 2016.
 [48] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
 [49] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [50] A. Mousavian, A. Toshev, M. Fiser, J. Kosecka, and J. Davidson. Visual representations for semantic target driven navigation. CoRR, abs/1805.06066, 2018.
 [51] K. Mülling, J. Kober, and J. Peters. A biomimetic approach to robot table tennis. Adaptive Behavior, 19(5):359–376, 2011.
 [52] B. Paden, M. Cáp, S. Z. Yong, D. S. Yershov, and E. Frazzoli. A survey of motion planning and control techniques for selfdriving urban vehicles. CoRR, abs/1604.07446, 2016.
 [53] J.J. Park, J.H. Kim, and J.B. Song. Path planning for a robot manipulator based on probabilistic roadmap and reinforcement learning. In International Journal of Control, Automation, and Systems, pages 674–680, 2008.
 [54] M. Pfeiffer, M. Schaeuble, J. I. Nieto, R. Siegwart, and C. Cadena. From perception to decision: A datadriven approach to endtoend motion planning for autonomous ground robots. Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 1527–1533, 2017.
 [55] S. Rodríguez, J.M. Lien, and N. M. Amato. A framework for planning motion in environments with moving obstacles. In Proc. IEEE Int. Conf. Intel. Rob. Syst. (IROS), pages 3309–3314, 2007.
 [56] A. Seff and J. Xiao. Learning from maps: Visual common sense for autonomous driving. arXiv preprint arXiv:1611.08583, 2016.
 [57] B. Siciliano and O. Khatib. Springer handbook of robotics. Springer, 2016.

[58]
S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser.
Semantic scene completion from a single depth image.
IEEE Conference on Computer Vision and Pattern Recognition
, 2017.  [59] L. Tai, G. Paolo, and M. Liu. Virtualtoreal deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 31–36, Sept 2017.
 [60] L. Tapia, S. Thomas, and N. M. Amato. A motion planning approach to studying molecular motions. Communications in Information and Systems, 10(1):53–68, 2010.
 [61] A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, and S. Levine. Collective robot reinforcement learning with distributed asynchronous guided policy search. CoRR, abs/1610.00673, 2016.
 [62] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard. Deep reinforcement learning with successor features for navigation across similar environments. In Proc. IEEE Int. Conf. Intel. Rob. Syst. (IROS), pages 2371–2378. IEEE, 2017.
 [63] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. FeiFei, and A. Farhadi. Targetdriven visual navigation in indoor scenes using deep reinforcement learning. arXiv preprint arXiv:1609.05143, 2016.
 [64] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. FeiFei, and A. Farhadi. Targetdriven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3357–3364. IEEE, 2017.
Comments
There are no comments yet.