Learning Navigation Behaviors End to End

by   Hao-Tien Lewis Chiang, et al.

A longstanding goal of behavior-based robotics is to solve high-level navigation tasks using end to end navigation behaviors that directly map sensors to actions. Navigation behaviors, such as reaching a goal or following a path without collisions, can be learned from exploration and interaction with the environment, but are constrained by the type and quality of a robot's sensors, dynamics, and actuators. Traditional motion planning handles varied robot geometry and dynamics, but typically assumes high-quality observations. Modern vision-based navigation typically considers imperfect or partial observations, but simplifies the robot action space. With both approaches, the transition from simulation to reality can be difficult. Here, we learn two end to end navigation behaviors that avoid moving obstacles: point to point and path following. These policies receive noisy lidar observations and output robot linear and angular velocities. We train these policies in small, static environments with Shaped-DDPG, an adaptation of the Deep Deterministic Policy Gradient (DDPG) reinforcement learning method which optimizes reward and network architecture. Over 500 meters of on-robot experiments show , these policies generalize to new environments and moving obstacles, are robust to sensor, actuator, and localization noise, and can serve as robust building blocks for larger navigation tasks. The path following and point and point policies are 83


page 1

page 2

page 6

page 9


Learning Composable Behavior Embeddings for Long-horizon Visual Navigation

Learning high-level navigation behaviors has important implications: it ...

Indoor Point-to-Point Navigation with Deep Reinforcement Learning and Ultra-wideband

Indoor autonomous navigation requires a precise and accurate localizatio...

Vision-Based Goal-Conditioned Policies for Underwater Navigation in the Presence of Obstacles

We present Nav2Goal, a data-efficient and end-to-end learning method for...

End-to-End Partially Observable Visual Navigation in a Diverse Environment

How can a robot navigate successfully in a rich and diverse environment,...

Deep Reactive Planning in Dynamic Environments

The main novelty of the proposed approach is that it allows a robot to l...

Towards navigation without precise localization: Weakly supervised learning of goal-directed navigation cost map

Autonomous navigation based on precise localization has been widely deve...

Learning with Stochastic Guidance for Navigation

Due to the sparse rewards and high degree of environment variation, rein...

I Introduction

Assistive robots, last-mile delivery, warehouse navigation, and robots in offices are all applications that require robust robot navigation in dynamic environments (Fig. 1). While good methods exist for robot navigation sub-tasks such as localization and mapping [18], motion planning [22], and control [14], current local navigation methods typically must be tuned for each new robot and environment [8]. For example, vision-based navigation is robust to noisy sensors, but typically relies on high-level motion primitives such as “go straight” and “turn left” that abstract away robot dynamics [30]. On the other hand, motion planning handles robot dynamics, but often requires near-perfect sensor observations. For the reliable navigation, the two stacks needs to be integrated with each other which is difficult and brittle. Robot navigation at scale requires methods that create behaviors useful to existing navigation stacks, and easily transfer from simulation to physical robots and new large-scale environments with minimum tuning, while robustly avoiding obstacles given noisy sensors and actuators.

(a) Narrow corridor
(b) Unstructured dynamic
Fig. 1: Executing navigation behaviors on a Fetch robot: (a) path following in a narrow corridor (b) point to point navigation in an unstructured dynamic environment.

We define navigation behaviors as intelligent agents that depend on the robot’s sensors, actuators, dynamics and geometry, without relying on foreknowledge of its environment [2, 3]. Many robot navigation tasks can be accomplished using two basic navigation behaviors: local planning, i.e., point to point, which creates trajectories from robot’s current configuration to a given target [13], and path following, which stays near a given guidance path [9]. Both behaviors need to produce dynamically feasible trajectories that are: robust to many kinds of noise and inaccuracies, only rely on observations coming from primitive sensors, and avoid unexpected obstacles. In this paper, we present a reliable method to implement these navigation behaviors by learning end to end polices that directly map sensors to controls, and we show that these policies transfer from simulation to physical robots and new environments while exhibiting robust obstacle avoidance.

(a) Training, 23 by 18m
(b) Building 1, 183 by 66m
(c) Building 2, 60 by 47m
(d) Building 3, 134 by 93m
Fig. 2: (a) Training and (b, c, d) evaluation environments with 40 moving obstacles (white circles). The evaluation environments are 6.8 to 30.1 times bigger than the training, generated from real office building plans.

To do so, we rely on deep Reinforcement Learning (RL), which has shown great progress in visual navigation [35] and video games [29]. RL learns policies which map between observations and controls to maximize the cumulative return [34], and deep RL has enabled learning tasks that have been difficult or impossible to program manually, from matching human performance at Atari [29] to locomotion of humanoid figures [16]. In particular, we focus on DDPG [25], a state of the art continuous state and action deep reinforcement learning algorithm. However, training even simple tasks can fail if rewards are sparse - that is, if the success conditions of the true objective are hard to discover [4]; this is true for navigation in large spaces. Reward shaping [17, 33, 21] addresses this problem by introducing a proxy reward function that is less sparse than the true objective. However, poorly chosen shaped rewards can lead to pathologies where agents learn to exploit the reward function or to terminate the episode early [31]. Designing good reward functions is well understood for some areas (e.g., video games [29]), but for most tasks, including navigation, it remains a black art.

Even with a well-behaved reward function, many state-of-the-art deep RL methods for continuous state and action spaces, including DDPG, are unstable learners. To overcome this, hyperparameter tuning can help determine the best network architecture and algorithmic parameters for a class of problems. Gradient-free methods, such as Covariance Matrix Adaptation Evolution Strategy (CMA-ES)

[19], can transform an algorithm that fails to converge into one that produces near-optimal policies. This strategy suggests learning the reward function from a space of possible reward functions using large-scale hyperparameter optimization over the true objective is a possibility.

As our solution to these problems, we present Shaped-DDPG, an algorithm that combines deep RL with gradient-free hyperparameter optimization to shape both the rewards and network architectures of basic navigation policies. Shaped-DDPG takes a parameterized reward function along with the true objective; it then uses large-scale hyperparameter optimization to shape the reward. With this reward fixed, it then optimizes network layer sizes to identify the most successful policies for the given navigation task. Trained in simulation in small environments with only static obstacles, Shaped-DDPG learns robust point to point and path following policies that cope with both static and dynamic obstacles.

We evaluate these policies in three large building environments (Fig. 2) with respect to various noise levels and number of moving obstacles. Our simulation results show the policies have much higher success rate, up to 83% and 56% higher over the baseline, respectively, and are robust to noise. In experiments on a real robot (Fig. 1), we observe reliable obstacle avoidance in dynamic environments and collision-free navigation along 80+ m paths.

Ii Related Work

Practical robot navigation typically uses a layered navigation stack [26]: given an occupancy map generated by a mapping algorithm such as SLAM [7], a global planner such as PRM [22] or RRT [24] finds a near-optimal path that a local planner (e.g., [14]) executes. This “stacked” approach has demonstrated reliable operation over long distances [27], but often requires manual tuning for each robot and environment [36]. Like these approaches, we assume reasonably accurate localization, but we couple sensing directly with controls.

A study that classifies planners for dynamic environments by methodology and required obstacle information

[10] reports that the typical methodologies include Artificial Potential Field (APF), velocity obstacle, and sampling-based approaches. The best performing planners such as DRT [9] are sampling-based methods that plan in robot’s state-time space [10]. However, these methods require knowledge of obstacle dynamics, which is difficult to obtain when the robot has limited sensors, like 1D lidar. RL was used for navigation in dynamic environments [12], but it assumed perfect obstacle and robot knowledge while here we use only a noisy 1D lidar.

Machine learning for navigation has seen a recent boom, particularly in point-to-point navigation with vision [30, 35]. Some approaches discretize state and action spaces to enable simple RL algorithms [1]

, whereas others learn low-level controllers end to end using deep learning

[8]. Our prior work on PRM-RL uses a point to point policy trained with DDPG as the local planner for PRM [13]. Like these approaches, we map from sensors to velocity control, but we traine with Shaped-DDPG that tunes the agent’s reward and network architectures to improve the policy performance.

Learning by optimizing multiple objective functions, rules, or perceptual features has proven effective for variants of the navigation problem [11, 6]; other work uses hierarchical reinforcement learning to break the problem apart into tractacble components [5]

. Unlike these methods, we learn the proxy reward function and neural network architecture instead of imposing it.

Iii Methods

To learn point to point and path following navigation behaviors, we chose continuous reinforcement learning with partial observations. Section III-A

describes the robot’s Partially Observable Markov Decision Process (POMDP) shared between both policies. Sections

III-B and III-C describe the true task objectives, specific rewards, and additional observations specific to each behavior. Section III-D presents the Shaped-DDPG algorithm.

Iii-a POMDP Setup

We model the agent as an tuple representing a partially observable Markov decision process (POMDP) with continuous observations and actions. The first three are determined by the robot. The observations, are

pairs of 1-D Lidar vectors,

and goal set, observed over the last steps. The agent is a unicyle robot and is controlled by a 2-dimensional continuous action vector, that encodes the robot’s linear and angular velocity. The dynamics, , is encoded in the simulator or implicit in the real world. The remaining factors encode the task: is a scalar discount factor, whereas the structure of the reward R is more complicated and we discuss it next.

The goal of a navigation agent is to complete a true navigation objective. For point-to-point navigation this is arriving at a goal location, while for path following this is traversing the entire path by reaching all its waypoints. We can formalize this problem as learning a policy that maximizes the probability of reaching the true objective,



where means that true objective is reachable from the state under control of policy At the same time, RL learns a policy that maximizes the cumulative reward. We could use the true objective as a reward, but it is sparse, and there are other requirements, such as dynamical feasibility, smooth trajectories, avoiding obstacles, and sensory/motor limitations. We can formulate these requirements as parameterized atomic rewards which provide more timely feedback to aid learning. To that end we represent rewards with


where is a parameterized atomic reward, and becomes a hyperparameter to be tuned.

Fig. 3: Training end-to-end policies with Shaped-DDPG. (right) Shaped-DDPG shapes the reward function w.r.t. task’s true objective with a fixed network architecture, then shapes the network architecture with a fixed reward function. (middle right) Shaped-DDPG runs parameterized vanilla DDPG agents in parallel. After each agent’s training is complete, its performance is evaluated w.r.t. the true goal objective and stored in the database. Gradient-free optimization selects the next set of reward and network parameters, and a new agent is spawned. At the same time, the agent outputs the current best policy. (left middle) Each individual agent is initialized with reward and network weights and trained with vanilla DDPG. (left) The actor and critic of the vanilla DDPG are multi-layered feed-forward networks with parameterized layer widths.

Iii-B Point to Point (P2P) Task

The goal of the P2P behavior is to navigate a robot to a goal position without collision. We assume the robot is well-localized using traditional methods. The P2P behavior can be used as a local planner for sampling-based planners [13] in order to navigate large environments. We require the agent to navigate to goals that are not within clear line of sight, but not far enough to require higher-level knowledge of the environment. For example, we expect it to navigate around a corner, but not to a room in a maze of hallways.

The true objective of P2P is to maximize the probability of reaching the goal during an episode,


where is an indicator function, is the goal pose, and the goal size. The goal observation is the relative goal position in polar coordinates, which is readily available from localization. The reward for P2P is:


where is a constant penalty step with value 1, is the negative Euclidean distance to the goal, is 1 when the agent collides with obstacles and 0 otherwise, is the negative angular speed, is the distance to the closest obstacle, and is 1 when the agent reaches the goal and 0 otherwise.

Iii-C Path Following (PF) Task

The goal of the PF behavior is to follow a guidance path represented by a sequence of waypoints in the workspace. We assume the guidance path is collision free, and can be generated by off-the-shelf path planners like PRMs or A*, or even manually, because it does not need to be dynamically feasible. In navigation stack terms, PF is trajectory tracking.

While PF can be used with higher-level planners, it has several complications. Since waypoints in the guidance path are intermediate goals that the agent should approach with a speed and orientation appropriate to reach the next point, the spacing and number of waypoints matters. In addition, when the agent veers off the path near a waypoint to avoid an obstacle, it becomes less clear if it should continue towards the next waypoint, or prioritize getting back on the path. We use Shaped-DDPG to tune these parameters as well.

Real-world navigation involves long guidance paths (100+ m) with many waypoints and varied separation, but varied input size and non-uniformity challenge neural networks. To address this problem, we linearly interpolate the original guidance path

to generate a new guidance path consisting of new waypoints with a constant separation between consecutive waypoints. The waypoint is considered reached when the robot is within of the waypoint and the previous waypoint is also reached.

The true objective of PF is to reach as many waypoints per episode as possible:


where the goal observation, , is a partial path consisting of the first un-reached waypoint and the next waypoints. For example, suppose , , , the robot is at and consists of , the observation is . The reward for PF is:


where is a constant penalty step with value 1. is the Euclidean distance to the first un-reached waypoint. is 1 when the agent collide with obstacles and 0 otherwise. is a penalty with value 1 when the the agent is within m away from obstacles and 0 otherwise.

Iii-D Training

With observations, action space, true objectives, and rewards defined, training a vanilla DDPG just requires selecting actor and critic network architectures. Network architecture affects the quality of the trained agent: the capacity of the network determines what the agent can learn. For actor and critic, we choose feed-forward fully-connected networks (Fig. 3, left most) and fix network depth, leaving the size of each layer as our tunable hyperparameter. We choose shallow and wide networks because their fast inference time makes them feasible for on-board high frequency robot control.

Let for

be a feed-forward fully-connected neural network with RELU units and

layers, where layer contains neurons. Denote the learnable weights of the feed forward network with The size of the layers plays role on how well an agent will perform.

Next, we formalize our DDPG parametrization. Let be a DDPG algorithm with actor , critic and reward given in (2). Denote as the true objective the trained actor achieved for the corresponding

We automate DDPG hyperparameter selection with off-the-shelf gradient-free optimization [15]. The search space grows exponentially in the tuning parameters, so we split shaping in two phases, reward shaping and network shaping. First, we find the best reward function w.r.t. task’s true objective for a fixed actor and critic. Then, we find the best actor and critic w.r.t. to the previously selected reward.

For reward shaping, actor and critic network shapes are fixed sizes where is a closed interval in n-dimensional space bounded by points We run trials, at most in parallel. At each trial , we initialize the reward function from based on all completed trials according to a black-box optimization algorithm [15]. The first trials select reward weights randomly. Next, we train an asynchronous instance of After the agent is trained, its policy is evaluated in accordance to the true task objective (1). For P2P that is (3) and for PF it is (5). Upon completion of all trials, the best reward

corresponds to the trial with the highest true objective.

Next, we repeat a similar process to find the best actor and critic w.r.t. to . In this case, the optimization objective is to maximize the cumulative reward. This time, at each trial we train asynchronously and evaluate the objective w.r.t. (2). For P2P that is Eq. (4), and for PF it is Eq. (6). Lastly the best actor and critic widths corresponds to the trial with the best return,

and the final policy trained by Shaped-DDPG is

Fig. 3 outlines the Shaped-DDPG algorithm. It shapes parameterized DDPG in two phases, first finding the best reward while keeping the actor and critic fixed, and then finding the best actor and critic for the best reward. The shaping runs DDPG agents with different parameterization in parallel, for up to total trials. Each agent’s performance is evaluated w.r.t. the true goal objective for reward shaping, and cumulative reward for the network shaping. The next set of reward and network parameters is selected with a blackbox gradient-free optimization method to maximize the objective, and a new agent is spawned. The training stops when all the trials are completed, and the most fit agent is accepted.

Iv Results

Iv-a Setup

The training environment, generated from a real building floor plan, is 23m by 18m and contains no moving obstacles (Fig. (a)a

). The simulated differential drive robot is a circle with a 0.3 m radius, with a 64 beam 1D lidar with 220 degrees field of view with Gaussian distributed noise

(Fig. 3). The robot’s action space are the linear and angular velocities with bounds of m/s and radian/s, respectively. To simulate robot process noise, we added Gaussian distributed process noise and to both the linear and angular velocities. We assume the robot can localize itself but has a Gaussian distributed localization noise . Unless otherwise specified, the noise is set to m, m/s, rad/s, m.

The actor networks (Figs. 3 leftmost) are three-layers deep for both tasks. The critic consists of a two-layer observation embedding then joined with the action network by two fully connected layers. The inputs to the actor and critic are observations from the last 3 time-steps concatenated together. The agent operates at Hz. The training episodes for both P2P and PF tasks are 100 seconds long. In the evaluation, we extend the episode durations to accommodate longer trajectories. For the P2P task, the goal size is 0.5 m.

We train the P2P and PF agents for 1000 trials, running 100 agents in parallel, each for 5 million training steps. Each agent takes about 12 hours to train, with the complete Shaped-DDPG completing in several days. We use Vizier [15] with Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [19] for the hyperparameter tuning. For reproducibility, the parameters found by Shaped-DDPG are included at the end of our video111https://youtu.be/Fs0Mvl8IhrE.

We compare Shaped-DDPG to two methods on the opposite sides of the spectrum. First are Artificial Potential Fields (APF) [23], which we chose because it is popular, well compared [10] and utilizes only clearance information which can be obtained from 1D lidar. Many popular motion planners for dynamic environments compared in [10] are not suitable since they require knowledge of obstacle velocity or dynamics while we use only a 1D lidar sensor. To avoid the local minima problem that often plagues APF methods, we implemented path-guidance similar to [9], where the attractive potential is computed along a guidance path. Our second comparison method is hand-tuned DDPG [25].

We test the Path Following and P2P policies in three, previously unseen large environments (Figs. (b)b, (c)c, (d)d). For the point to point policy the start and goal are randomly selected to be between 5 and 10 maters apart. For the path following, the starts and goals are randomly chosen requiring at least 35 meters Euclidean distance between a start and goal. A guidance path is generated by a PRM once the start and goal positions are chosen.

Iv-B Training

(a) Reward shaping generations.
(b) Reward shaping best policy learning curve
(c) Network shaping generations.
(d) Network shaping learning curve.
Fig. 4: Results of Shaped-DDPG for the path following policy.

Fig. 4 shows the impact of the Shaped-DDPG on the path finding task. Across both reward and network shaping there are certain parameter values where the agent does not perform well (Figs. (a)a and (c)c), although in during the network shaping there are fewer bad trials. The learning curve of the non-shaped objective resembles that the one depicted in Fig. (d)d. The best agent after reward shaping (Fig. (b)b) shows signs on catastrophic forgetting that plagues DDPG training. Notice, that the best agent after both shaping phases (Fig. (d)d) does not exhibit such forgetfulness, and the training could have been terminated sooner.

The path following shaped policy’s true objective of 0.56, while non-shaped policy reaches objective of 0.26. The point to point agent trained with Shaped-DDGP has training objective of 0.90, while the hand-tuned one is 0.54. This is a promising result for the utility of Shaped-DDPG.

Iv-C Generalization to New Environments

(a) Success rate
(b) Path length
Fig. 5: Path following: Shaped-DDGP (light green, labeled S), hand-tuned DDPG (dark green, labeled NS), and APF (blue, labeled APF) path Following (a) success rate and (b) path length. B1, B2, B3 stands for building 1 to 3, respectively.

Iv-C1 Pf

Shaped-DDPG has near perfect success rate across all three environments (Fig. (a)a), and is the only method that can transfer to unseen, large, real building sized environments. The hand-tuned policy has lower success rate (up to 22% lower than Shaped-DDPG), while APF has the lowest success rates across all buildings (up to 85% lower). For the APF method we observed that the failures are caused by collision with the wall and getting stuck in local minima despite the use of a guidance path. Path guidance for APF methods does not completely solve the local minima problem as the path can be very close to obstacles, creating local minima [9]. The average path length for successful run in these environments ranges from 42 to 91 m (Fig. (b)b), which is representative to typical navigation scenarios in office buildings. APF trajectories are the shorest, due to APF succeeding only for short guidance paths. We will discussion this further in Section IV-D.

(a) Shaped-DDPG
(b) Handed tuned DDPG
Fig. 6: Sample trajectories (red) and guidance paths (black) of the Path Following policy found by (a) Shaped-DDPG and (b) hand-tuned DDPG.

Curiously, the hand-tuned policy creates longer trajectories length than shaped one. Inspecting the trajectories (Fig. 6), it is clear that DDPG learns completely different behaviors under the two parameterizations. The shaped policy exhibits smooth, forward moving behavior while the hand-tuned policy alternates between forward and reverse motion, resulting in a twirling behavior with longer path length. Since the 220 degree field of view lidar cannot detect obstacles in the back, the twirling behavior has a lower success rate.

(a) Success rate
(b) Finish time
Fig. 7: P2P: Shaped-DDPG (red), hand-tuned DDPG (dark red), and APF (blue) (a) success rate and (b) path length.

Iv-C2 P2p

The P2P generalization results are consistent to the path following’s. The shaped policy has highest success rates (up to 56% higher than non-shaped and up to 41% higher than APF) across all environments, and exhibits smooth forward motion, while the failure cases are mostly due to complex static obstacles. APF has slightly higher success rates than the non-shaped policy (Fig. (a)a). Once again, the non-shaped policy exhibits twirling behavior, leading to low success rate and high path lengths (Fig. (b)b).

Iv-D Moving Obstacle Avoidance

We evaluate the shaped PF and P2P policies in a large environment (Fig. (c)c) among up to 40 moving obstacles. The moving obstacles motion follows the social force model (SFM) [20] to mimic pedestrian motion. We use a path guidance similar to [9], to create a guidance path for each moving obstacle and compute the desired velocity component in SFM, alleviate the SFM limitation of moving obstacles getting trapped in local minima [32].

Iv-D1 Pf

Shaped-DDPG policy outperforms APF in all numbers of obstacles (Fig. (a)a). Note that despite the SFM avoids collision with the robot, simply executing the path using APF without reacting to obstacles (no repulsive potential) results in no success. This implies the robot must also part-take in collision avoidance with moving obstacles.

(a) Path Following
(b) 0 moving obstacles
(c) 20 moving obstacles
Fig. 8: Success rate over (a) number of moving obstacle, (b) guidance path length with no obstacles, and (c) guidance path length with 20 moving obstacles for Path Following (green bars) and APF (blue).

PF policy success is uniform across path guidance lengths and decreases sligthy in the presence of moving obstacles, while APF degrades rapidly with guidance path length (Figs. (b)b and (c)c).

Iv-D2 P2p

(a) P2P, 0 moving obstacles
(b) P2P, 20 moving obstacles
Fig. 9: P2P policy (red) and APF (blue) success rates (a) without and (b) with 20 moving obstacles over distance to the goal.

Unlike Path Following, P2P policy’s success degrades with the path length increase (Fig. 9) and its success rate is not affected with the number of obstacles (Fig. (b)b). At the same time the APF’s behavior is similar to the path following case. This indicates that the the main source of failure of P2P policy is the inability to navigate among complex static obstacles rather than collision with moving obstacles. This is not surprising, as it is a behavior that it was not designed to do.

Iv-E Robustness to Noise

Fig. 10 isolates one noise source at a time in order to investigate the impact on performance in an environment with 20 moving obstacles. Both PF and P2P policies policies are resilient to noise, even when the lidar noise is more than three times the radius of the robot ( m). On the other hand, lidar and process noise heavily influence the success of APF, as expected, because APF uses obstacle clearance to compute the repulsive potential. In addition, the action is computed by taking the gradient of the potential, such greedy approach often guides the robot to collision or local minima.

(a) Path Following LiDar noise
(b) P2P LiDar noise
(c) Path Following Localization
(d) P2P Localization
(e) Path Following Process
(f) P2P Process
Fig. 10: Success rate over (top) lidar, (middle) localization noise, and (bottom) action noise. (left) Path Following compared to the APF (blue bars) and (right) P2P compared to APF navigating from random starts to goals, 5-10 m apart.

Iv-F Physical Robot Experiments

We deployed the PF and P2P policies on the Fetch robot [28]. The localization and orientation observations are provided by the ROS navigation stack. The enclosed video demonstrates the Fetch robot executing these policies.

(a) Path following
(b) P2P
Fig. 11: (a) Path Following and (b) P2P on a real robot in an office building. The actual robot trajectories (magenta), the guidance path (black) and the trajectory in simulation (green for Path Following and red for P2P) overlaid over the 2D lidar map.

First, we investigate the difference between simulation and reality. For the PF task (Fig. (a)a), we manually specify a sequence of waypoints (black dots) as the guidance path (80.6 m in length). For the P2P task (Fig. (a)a) start and goal are about 13.4 maters apart. We ran the experiment three times for each policy, for a total of 300 meters. The robot reaches the goal without collision in all three runs. The magenta curve in Figs. (a)a and (b)b shows one of the three actual robot trajectory. These trajectories are very close to simulated ones (green for PF and red for P2P) but exhibits more turning behavior than simulation. This may be caused by the delay between sensing and action execution.

Next, we tested the policy’s performance in a very narrow corridor (only 0.3 m wider than the robot, which is 0.3 cm in diameter) and with moving obstacles qualitatively. We executed the PF policy in the very narrow corridor shown in Fig. (a)a four times with a guidance path specified roughly at the center of the corridor. We found that the robot can reliably navigate in the corridor all four times. We also executed the P2P policy in the in a highly un-structured dynamic environment with a person playing with a dog (Fig. (b)b). We found the robot stops and avoids obstacles reliably. Notice in the attached video how the agent adapts when blocked, moves away from the goal, and around the obstacles. Overall, the on-robot experiments totaled about 500 meters, and the only failure case we observed was caused by the obstacles undetectable by the 1D lidar such as human feet and dog tails since they are lower than the lidar.

V Discussion

Shaped-DDGP is not sample efficient, the 1000 agents were each trained by experience totaling at 12 days of simulation time, or 32 years of collective experience. However, it learns high-quality navigation behaviors that transfer well across environments and are easy to deploy on the robot. For navigation agents, the extra training cost can be justified to provide better quality policies.

The end to end learning with Shaped-DDPG effectively creates a tightly-coupled vertical navigation stack, combining portions of the perception, planning and controls. The end to end POMDP set-up enables robustness to noise. The comparison with the two agents on the opposite side of the spectrum, traditional APF and hand-tuned DDPG agents, highlights the benefits of the end to end learning. Hand tuned DDPG agent, although produces sub-optimal trajectories, is still robust to noise. The traditional APF method is not designed for the tight integration between the controls and sensor, and is brittle in the presence of the noise.

The robustness to noise has an additional perk of overcoming the guidance path imperfections. We have observed during the hand-tuned training that path following is very sensitive to the waypoint spacing and determining appropriate waypoint radius. It is only after including these parameters into the reward tuning that the path following agents learn high-quality behaviors. This is likely due to the tuning finds the optimal distance from a waypoint w.r.t. to robot’s noise and abilities, essentially deciding to give a credit to reaching a waypoint for what is feasible on the robot.

The P2P policy found by Shaped-DDPG is more robust to local minima than APF. It is likely due to the agent learned that following a wall usually leads to completing the goal. As demonstrated in the enclosed video, the agent is also willing to move away from the goal to avoid local minima. The failure point of P2P policy remains the inability to avoid large-scale local minima, such as moving between rooms, as these behaviors are not what it was not designed to do.

Vi Conclusions

This paper presents two learning building blocks for end to end navigation at scale, P2P and PF behaviors. The policies are trained using Shaped-DDPG, an adaptation of DDPG algorithm that automatically tunes reward and network architecture parameters. The resulting policies, although computationally expensive to train, exhibit more desirable behaviors compared to DDPG with hand-crafted hyperparameters and an APF-based method. They generalize to new environments with moving obstacles, are robust to noise and are deployed to a physical Fetch robot without tuning.


The authors would like to thank J. Chase Kew, Oscar Ramirez, Alexander Toshev, and Vincent Vanhoucke for helpful discussions.


  • [1] N. Altuntaş, E. Imal, N. Emanet, and C. N. Öztürk. Reinforcement learning-based mobile robot navigation. Turkish Journal of Electrical Engineering & Computer Sciences, 24(3):1747–1767, 2016.
  • [2] E. A. Antonelo and B. Schrauwen. On learning navigation behaviors for small mobile robots with reservoir computing architectures. IEEE Transactions on Neural Networks and Learning Systems, 26(4):763–780, 2015.
  • [3] R. C. Arkin. Behavior-based robotics. 1998.
  • [4] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In Proc. of Neural Information Processing Systems, pages 1471–1479, 2016.
  • [5] B. Bischoff, D. Nguyen-Tuong, I. Lee, F. Streichert, A. Knoll, et al. Hierarchical reinforcement learning for robot navigation. In Proc. of European Symposium on Artificial Neural Networks, 2013.
  • [6] W. Böhmer. Robot navigation using reinforcement learning and slow feature analysis. CoRR, abs/1205.0986, 2012.
  • [7] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. Transactions on Robotics, 32(6):1309–1332, 2016.
  • [8] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In

    Proc. of IEEE International Conference on Computer Vision

    , pages 2722–2730, 2015.
  • [9] H.-T. Chiang, N. Malone, K. Lesser, M. Oishi, and L. Tapia. Path-guided artificial potential fields with stochastic reachable sets for motion planning in highly dynamic environments. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 2347–2354, 2015.
  • [10] H.-T. L. Chiang, B. HomChaudhuri, L. Smith, and L. Tapia. Safety, challenges, and performance of motion planners in dynamic environments. In Proc. of International Symposium of Robotics Research, pages 1–16, 2017.
  • [11] F. Fathinezhad, V. Derhami, and M. Rezaeian. Supervised fuzzy reinforcement learning for robot navigation. Applied Soft Computing, 40:33–41, 2016.
  • [12] A. Faust, H.-T. Chiang, N. Rackley, and L. Tapia. Avoiding moving obstacles with stochastic hybrid dynamics using pearl: preference appraisal reinforcement learning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 484–490. IEEE, 2016.
  • [13] A. Faust, O. Ramirez, M. Fiser, K. Oslund, A. Francis, J. Davidson, and L. Tapia. PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), page to appear, 2018.
  • [14] D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. IEEE Robotics Automation Magazine, 4(1):23–33, Mar 1997.
  • [15] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In Proc. of ACM International Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017.
  • [16] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curriculum learning for neural networks. In International Conference on Machine Learning, pages 1311–1320, 2017.
  • [17] M. Grzes and D. Kudenko. Multigrid reinforcement learning with reward shaping. In Proc. of International Conference on Artificial Neural Networks, pages 357–366, 2008.
  • [18] R. L. Guimarães, A. S. de Oliveira, J. A. Fabro, T. Becker, and V. A. Brenner. ROS navigation: Concepts and tutorial.
  • [19] N. Hansen, A. Ostermeier, and A. Gawelczyk. On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. In

    Proc. of the International Conference on Genetic Algorithms

    , pages 57–64, 1995.
  • [20] D. Helbing and P. Molnar. Social force model for pedestrian dynamics. Physical review E, 51(5):4282, 1995.
  • [21] K. Judah, A. P. Fern, P. Tadepalli, and R. Goetschalckx. Imitation learning with demonstrations and shaping rewards. In C. E. Brodley and P. Stone, editors, AAAI, pages 1890–1896, 2014.
  • [22] L. E. Kavraki, P. Švestka, J. C. Latombe, and M. H. Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Trans. Robot. Automat., 12(4):566–580, 1996.
  • [23] O. Khatib. Real-time obstacle avoidance for manipulators and mobile robots. The international Journal of Robotics Research, 5(1):90–98, 1986.
  • [24] S. M. LaValle and J. James J. Kuffner. Randomized kinodynamic planning. The International Journal of Robotics Research, 20(5):378–400, 2001.
  • [25] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • [26] D. V. Lu, D. Hershberger, and W. D. Smart. Layered costmaps for context-sensitive navigation. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 709–715, 2014.
  • [27] E. Marder-Eppstein, E. Berger, T. Foote, B. Gerkey, and K. Konolige. The office marathon: Robust navigation in an indoor office environment. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 300–307, 2010.
  • [28] D. K. E. D. Melonee Wise, Michael Ferguson and D. Dymesich. Fetch & freight: Standard platforms for service robot applications. In

    Workshop on Autonomous Mobile Service Robots held at the 2016 International Joint Conference on Artificial Intelligence

    , 2016.
  • [29] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [30] A. Mousavian, T. Alex, M. Fiser, J. Kosecka, and J. Davidson. Visual representations for semantic target driven navigation. In (submitted to) ECCV, 2018.
  • [31] M. Olsen, N. Siegelmann-Danieli, and H. Siegelmann. Robust artificial life via artificial programmed death. Artificial Intelligence, 172(6):884 – 898, 2008.
  • [32] J. H. Reif and H. Wang. Social potential fields: A distributed behavioral control for autonomous robots. Robotics and Autonomous Systems, 27(3):171–194, 1999.
  • [33] P. Su, D. Vandyke, M. Gašić, N. Mrkšić, T. Wen, and S. Young.

    Reward shaping with recurrent neural networks for speeding up on-line policy learning in spoken dialogue systems.

    In Proc. of Special Interest Group on Discourse and Dialogue, pages 417–421, 2015.
  • [34] R. Sutton and A. Barto. Reinforcement Learning: an Introduction. MIT Press, 1998.
  • [35] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel. Value iteration networks. In Proc. of Neural Information Processing Systems, pages 2154–2162, 2016.
  • [36] K. Zheng. Ros navigation tuning guide. arXiv preprint arXiv:1706.09068, 2017.