Track Advancement of SLAM 跟踪SLAM前沿动态【IROS 2019 SLAM updated】
We present a navigation system that combines ideas from hierarchical planning and machine learning. The system uses a traditional global planner to compute optimal paths towards a goal, and a deep local trajectory planner and velocity controller to compute motion commands. The latter components of the system adjust the behavior of the robot through attention mechanisms such that it moves towards the goal, avoids obstacles, and respects the space of nearby pedestrians. Both the structure of the proposed deep models and the use of attention mechanisms make the system's execution interpretable. Our simulation experiments suggest that the proposed architecture outperforms baselines that try to map global plan information and sensor data directly to velocity commands. In comparison to a hand-designed traditional navigation system, the proposed approach showed more consistent performance.READ FULL TEXT VIEW PDF
In this paper, we present a goal-driven autonomous mapping and explorati...
How can a delivery robot navigate reliably to a destination in a new off...
Autonomous navigation in dynamic environment heavily depends on the
The problem of autonomous navigation is to generate a set of navigation
We exploit the complementary strengths of vision and proprioception to
We present NEO, a fast and purely reactive motion controller for manipul...
This paper considers the integration of gap-based local navigation metho...
Track Advancement of SLAM 跟踪SLAM前沿动态【IROS 2019 SLAM updated】
Autonomous robot navigation in known environments encompasses two main problems: 1) finding a safe path for a robot to reach a desired goal location, and 2) following the path while adapting to environmental conditions . While global planners can efficiently find optimal motion paths , translating these paths into robot commands – which is traditionally the job of a reactive controller – can be challenging. Reactive controllers not only need to take into account kinodynamic constraints , but also consider plan execution and adaptation to the environment, e.g., to avoid obstacles  and respect social conventions  (Fig. 1).
At the core of classical approaches to reactive control is a hand-designed objective function that must balance navigation criteria to output motion commands [3, 6, 7]. While successful in simple situations, these approaches can be difficult to tune for dynamic human environments. One reason is that relevant criteria, like social norms, are hard to define mathematically. Even when models exist, complex interactions may emerge between model parameters and the resulting navigation behavior . Some navigation criteria may even be contradictory at times, e.g., reaching a goal in a crowded environment without violating personal space.
to improve reactive robot control. Our approach does not require hand-specifying all the parameters of the reactive controller; instead, most parameters are optimized based on example navigation data through imitation learning[14, 15, 8]. Similar to , we assume that localization information is available during robot operation, and focus on studying mechanisms to combine high-level planning and learning for low-level motion control. In contrast to , though, we use learning not only for controlling robot velocities, but also for predicting a local motion plan, which guides the velocities output by our approach. Our main insight is that by adding structure to the learning component of our system we can guide the learning process to find an appropriate complex mapping from the high-level plan to motion commands, without incurring in additional annotation costs. As suggested by our experimental evaluation, the added structure can improve overall navigation behavior in comparison to mapping a goal or a global plan directly to commands [10, 11]. Predicting a local plan also facilitates system interpretability upon execution.
The proposed local trajectory planner and velocity controller of our approach adjust the behavior of the robot through simple, deep attention mechanisms. These mechanisms enable the robot to dynamically focus on different tasks: obstacle avoidance, social interaction, or following navigation plans. As a result, the robot is able to account for changing elements in the environment. It behaves in a manner that resembles complex rules that govern human use of space [16, 17].
In summary, this work has several main contributions. First, we introduce a new system for autonomous navigation which combines planning and learning. The system’s learning component predicts a local plan and motion commands. Second, we propose an attention mechanism for multimodal data fusion. Finally, we conduct controlled experiments on a simulated platform to evaluate the proposed system.
Autonomous navigation has long been studied within robotics. Early navigation methods focused on path planning , while more recent approaches tend to leverage machine learning to make navigation systems less brittle to new environmental conditions [19, 20]
. In some cases, machine learning is used to create systems that improve as they explore more of the environment, e.g., with reinforcement learning or via knowledge transfer in lifelong learning settings . In other cases, machine learning is used to model preferences for the navigation task with the help of a teacher . For instance, prior approaches have used structured prediction  or inverse reinforcement learning  to find cost functions that encode preferences for navigating through parts of an environment. Closer to our work, some methods have focused on directly learning motion policies with imitation learning, e.g., [14, 15, 8]. As discussed in , directly mapping states to actions can be more efficient than learning a cost function for imitation. Due to limited space, we encourage readers interested in more details of these different approaches to refer to [18, 20, 21, 23, 27].
, we use deep learning to parameterize a motion policy. This approach allows us to forgo hand-engineered features for sensor data. In contrast to most of these efforts, though, we do not aim to solve the whole navigation problem with a single function trained in an end-to-end fashion. Instead, we leverage planning in conjunction with deep learning for autonomous navigation.
Inspired by IntentionNet , we use a global planner to solve for the general direction that a robot should follow to reach a desired destination in a known environment, as well as use deep learning for motion control. But different to , (1) our approach explicitly considers the presence of people nearby the robot, (2) processes raw lidar measurements instead of RGB images, (3) represents global plans via trajectories, and (4) enforces additional structure on the learning component of the navigation system. Our rationale behind these considerations are as follows. First, providing information about people’s motion directly to our navigation system facilitates interactions in human environments. Second, providing raw lidar measurements to our system reduces the complexity of the input space and facilitates system development through simulation in comparison to using raw images. Lidar can also help with obstacle avoidance, as in , because it measures depth directly and typically has a wider field of view than cameras. Third, representing global plans via trajectories, instead of rendering them on maps, reduces even further the dimensionality of the input space. Fourth, adding structure to the learning component of our navigation system offers an opportunity to add supervision and facilitate interpretability. Our approach does not aim to build an environment map as the robot navigates , but assumes that a map is given for high-level, global planning.
The study of how people use and share space, or proxemics , is relevant for social robot navigation, e.g., see  for a recent review. Generally, most methods for social navigation explicitly model interactions among agents or social conventions [5, 31, 7, 32, 33, 34, 35, 36, 37, 38, 39]. Other methods, like ours, implicitly model these aspects and let navigation strategies emerge through imitation [11, 40, 41].
We leverage simulations to facilitate training motion policies and conduct system evaluations. Similar to work by Tai et al.  and Müller et al. , we avoid providing complex sensory input, like images, to our method. Instead, our deep model takes as input lidar and information derived from complex signals, like people’s position nearby the robot.
Building on a long tradition in robotics [43, 44, 45, 46, 11], we use a hierarchical approach for locomotion. Our particular hierarchy is composed of three levels: Global Planning, Local Planning, and Velocity Control. As shown in Fig. 2, the Global Planner takes as input a 2D map of the environment with static obstacles, e.g., walls, the current location of the robot, and the coordinates of a desired destination (or goal) in the map. Based on these inputs, the global planner computes a path in the map for the robot to reach the desired goal location. The Local Planner then focuses on a shorter time horizon: what are the immediate navigation steps that the robot should take to follow the global plan in an acceptable manner given observed environmental conditions? These conditions include dynamic elements, such as people, as well as static obstacles sensed in the vicinity of the robot. Finally, the Velocity Controller commands motions to the robot based on the navigation plans.
Explicitly or manually defining socially appropriate navigation behavior is hard. Human behavior is malleable and social norms often change from one social context to another [16, 17]. Thus, we resort to imitation learning to model appropriate behavior. Our main assumption in this work is that expert demonstrations are available to learn what constitutes good navigation patterns for a robot.
We implement the Local Planner and the Velocity Controller of our navigation approach as Neural Networks and optimize their parameters based on expert motion data. While it would have been possible to also implement the Global Planner as a differentiable function and optimize for the whole system in a complete end-to-end fashion, we opt for combining machine learning with traditional planning in this work. This combination aims to take advantage of the benefits of both approaches, while keeping the implementation of our system practical. We use planning because it is fast and reliable for computing obstacle-free routes in static environments, and we use deep networks because they have a great capacity for reasoning about raw sensor data and modeling complex phenomena like social behavior [48, 49]. Note that the learning component of our hierarchical navigation approach has more structure than similar prior work [10, 11]. We use a differentiable Local Planner to provide additional supervision to the learning component of our system and facilitate interpreting its execution.
As a first step when navigation starts, our system computes a collision free path from the initial location of the robot to the desired goal location in a static map of the environment. The path is computed with Dijkstra’s shortest path algorithm  and saved as a reference during navigation.
The Global Planner of our system provides a down-sampled version of the full global plan to the Local Planner. Down-sampling preserves the desired motion direction and highlights the relevant part of the global plan for successive processing. Let the output of the Global Planner be a down-sampled path , with each a waypoint belonging to the original full plan. We enforce that the first waypoint in is the closest waypoint in the full plan to the current location of the robot, and that the remaining waypoints are at least 0.5 meters from the previous one in . If not enough waypoints that satisfy these constraints remain in the full plan, then the goal is used instead.
Let the inputs to the local planner be the down-sampled global plan , the lidar range measurements , the robot’s linear and angular velocities from odometry, and a set of observed human trajectories with the 2D coordinates of person for the last time-steps relative to the current pose of the robot. Features for these inputs are computed by:
The models projects into a higher dimensional space through two fully connected ReLU layers, similar to the Global Plan features. The result is a feature vector .
In the spirit of , our model uses a PointNet network  to compute features for a set of past human trajectories. This choice frees us from needing to specify an order in which human coordinates should be input to our model. More specifically, the local planner organizes observed human trajectories into a matrix
, where each row corresponds to the data of one person. The rows are independently projected into a higher dimensional space of 1024 dimensions through a three layer perceptron with ReLU activations. The resulting matrix in
is then applied a max pooling operation and flattened to convert it into a vector inthat encodes information about the motion of the people nearby the robot. This data is finally transformed by a fully connected layer with ReLU activation into the feature vector , which has the same dimensionality as the other features.
In initial experiments, we considered combining the above features through concatenation [36, 11] or outer product operations [53, 54]. We found, however, that it was hard for the robot to pay attention to multiple sources of information with concatenation. For example, the robot would follow the global plan and ignore obstacles observed through its lidar in some cases. We also found that, as in , the outer product led to very high-dimensional features that made training prohibitively expensive or prone to over-fitting.
Fusion of Input Features. Inspired by prior work in text translation , we propose to fuse input features with a simple concatenation-based attention mechanism:
with coefficients that reflect the relative importance of the features. We compute the attention coefficients by first concatenating the input features and transforming them into a higher dimensional space, with a parameter matrix. Then, the desired coefficients are given by with trainable parameters.
The Local Planner finally generates a plan by transforming the combined features
through 3 fully connected layers with 512, 256 and 64 units and ReLU activations, and one final linear transformation. The plan is an ordered set of posessampled at 1Hz that represent a time-dependent trajectory that the robot could follow to adapt to the dynamic environment within the next 5 seconds. guides the robot in the direction of the global plan, and encodes temporal information to enable variations in speed as necessary while navigating.
We provide the Velocity Controller three inputs: , the lidar features , and the global plan features computed during local planning. While might suffice in cases where the prediction of the local plan is appropriate for the current environmental conditions, there might be cases when is problematic, e.g., bringing the robot close to collision due to a prediction error, or getting the robot stuck due to local minima. In these cases, range information and the global plan can help the robot avoid obstacles in close proximity and navigate towards the goal.
The Velocity Controller first computes features for the local plan by projecting it to a higher dimensional space through two fully connected ReLU layers with and units, respectively. The result are local plan features with the same dimensionality of features and .
The Controller combines features with attention, as in eq. (1), except that only three attention coefficients are needed in this case. Let these coefficients be denoted by , , . Then, the combined features .
Finally, the Velocity Controller projects the combined features
to a 2D space through 2 fully connected ReLU layers with 512 and 128 units, followed by a fully connected ReLU layer and a linear activation function. The two outputs are the linear velocity () and angular velocity () that command the robot to move as desired, as shown in Fig. 2.
The joint network of the Local Planner and the Velocity Controller can be seen as a policy that maps states into actions based on examples from an expert. The state is composed of the inputs to the Local Planner; the actions are the low-level, velocity commands that are output by our navigation system. Under this formulation, our objective is to compute a policy that minimizes the expected loss with respect to the expert policy under the distribution of states induced by the expert:
This particular view of the imitation problem follows close related work [10, 11] and is typically known as Behavioral Cloning [14, 8, 40].111It is possible to frame the objective as finding a policy that minimizes the loss under its distribution of states . Even though this is computationally more expensive, prior work suggests that it can improve imitation [57, 58]. This alternative formulation is interesting future work.
Our main assumption is that expert motion trajectories are available for supervised learning.
We learn the parameters of the Neural Networks of our system through back-propagation with three Adam optimization procedures  aimed at providing an easier path to learning, similar to curriculum learning  or shaping 
. Unless otherwise noted, we use L2 loss with a batch size of 300 samples. We pick the best parameters after 100 epochs.
Training Procedure. First, we compute lidar features with the help of the navigation model proposed by Pfeiffer et al. . We train their model by minimizing the loss on robot velocity. We then load the pre-trained weights for into the Residual Network of our Local Planner (Fig. 2) and optimize for the rest of this component. Our insight is that we can use expert motion trajectories to supervise the Local Planner without collecting additional data. Our ground truth for the local plan at a given time-step is the pose of the robot at s, s, etc., in the future, which is readily available after collecting expert trajectories. Finally, we load the Local Planner into a network with all of the learning components of our navigation system, and train the Velocity Controller.
We use a simulated JackRabbot 2 robot for data collection and evaluation. The robot is a differential-drive mobile manipulator with a forward-facing 2D SICK lidar and an Occam stereo camera, among other sensors. JackRabbot’s software stack uses the Robot Operating System (ROS)  for inter-process communication and logging, OpenSlam’s Gmapping for creating environment maps 
, ROS’s Adaptive Monte Carlo Localization algorithm for pose estimation, and YOLO  and DeepSort  for visually detecting and tracking pedestrians all around. Pedestrian tracks are converted from 2D to 3D with calibrated cameras and lidars.
We use the Gazebo simulator for our experiments. To test navigation algorithms on the simulated robot, we use the robot’s true pose from Gazebo instead of its localization algorithm, the true pedestrian motion relative to the robot instead of its vision pipeline, and a simulated forward-facing lidar. These changes allowed us to systematically study robot behavior without potential confounds due to perception errors.
To evaluate navigation in varied scenarios, we added obstacles to the simulation, e.g., 0.5-0.75 m boxes and cylinders, and implemented a Social Forces model  for the pedestrians in Gazebo. The Social Forces model enables each pedestrian to adapt their path to reach navigation sub-goals.
Overall, we consider 7 different environmental conditions in our experiments (Table I). Six conditions (E1-E6) use the same map of the foyer of a university building, which is depicted in Fig. 1. The other condition uses the map of a laboratory environment. We use four different environmental conditions for training or tuning parameters (E1-E4) and four for testing (E4-E7). In particular, we use E5 and E6 for testing generalization to new environmental conditions in a known map, and E7 for testing generalization to a new environment. Each scenario indicated under the “# Train” and “# Test” columns of Table I corresponds to a different pair of start-end locations for the robot, and navigation sub-goals for the pedestrians. The scenarios used for testing in E4 are different than those used for training or tuning methods.
|Env.||Map||# G||# SP||# MP||# Train||# Test|
We collect a dataset of 1446 expert motion trajectories – each corresponding to a different scenario – for the robot with Gazebo. The robot was tele-operated with a gamepad controller for 600 scenarios in E1-E4. The remaining 764 expert motion trajectories were selected from a set of runs of the ROS’ Navigation Stack with social costs enabled 
in E4, as tele-operation is time consuming and deep learning benefits from big amounts of data. We tried various approaches for combining tele-operated and auto-generated data, e.g., transfer learning, but found that using all of it for training at once led to best performance in general.
We compare our system with three baselines:
Deep controller that maps lidar range measurements and a 2D navigation sub-goal to velocity commands . The sub-goal moves along the global plan. It is generally 2 m ahead of the robot until JackRabbot approaches the destination and the sub-goal becomes the goal. Input features are fused with concatenation as in .
The GC and TC baselines can be considered ablation models of the proposed deep local trajectory planner and velocity controller. These baselines use the same lidar features and global plan features as the proposed system.
For a given set of test scenarios, we consider the following metrics:
Total running time for all the scenarios.
Total distance that the robot traversed considering all the scenarios.
Average linear velocity that the robot reported considering all the scenarios.
Percentage of the scenarios in which the robot reached the goal.
Percentage of the scenarios in which the robot failed catastrophically and tipped over, e.g., after a collision.
Number of events in which the robot’s lidar sensed an obstacle closer than 0.3 m. The count considers all scenarios; it is not averaged.
Number of events in which the robot collided with a pedestrian (their distance was less than 0.5 m). The count considers all scenarios.
Number of events in which the robot violated personal space  and was less than 1.2 m from a person. The count considers all scenarios.
We conduct a survey to gather qualitative perceptions of navigation performance. Participants rate navigation behavior on three 5-point Likert scales: Aggressive, Natural, and Efficient navigation.
We consider that the robot reached the goal if its distance to the destination is less than 0.5 m. For all the scenarios, we limit the time that the robot can take to reach the destination to 1.5 min. Finally, the robot has a safe, maximum limit of 0.6 m/s and 1.2 rad/s on its absolute linear and angular speeds.
|Model||Env.||Running Time||Distance||Linear Vel.||RG||F||C||PC||PS|
|Nav. Stack||E4||60.1 min||802.7 m||0.22 m/s||91%||0/100||64||1||82|
|Goal Cont. (GC)||E4||55.4 min||901.9 m||0.27 m/s||89%||2/100||167||12||108|
|Traj. Cont. (TC)||E4||49.8 min||993 m||0.33 m/s||85%||7/100||129||7||112|
|Proposed System||E4||54.2 min||923.3 m||0.28 m/s||95%||0/100||102||1||76|
|Nav. Stack||E5||56.8 min||892.2 m||0.26 m/s||98%||0/100||0||1||28|
|Goal Cont. (GC)||E5||46.4 min||930.2 m||0.33 m/s||98%||0/100||52||1||27|
|Traj. Cont. (TC)||E5||45.3 min||961.5 m||0.35 m/s||97%||1/100||61||0||43|
|Proposed System||E5||59.6 min||941.7 m||0.26 m/s||93%||0/100||27||1||34|
|Nav. Stack||E6||38.7 min||424.8 m||0.18 m/s||82%||0/50||67||0||80|
|Goal Cont. (GC)||E6||34.0 min||333.0 m||0.16 m/s||74%||1/50||178||1||65|
|Traj. Cont. (TC)||E6||39.4 min||472.2 m||0.2 m/s||66%||1/50||152||4||65|
|Proposed System||E6||34.4 min||484.4 m||0.23 m/s||98%||0/50||112||1||62|
|Nav. Stack||E7||25.2 min||276.26 m||0.18 m/s||66%||0/50||12||0||17|
|Goal Cont. (GC)||E7||45.9 min||540.01 m||0.19 m/s||62%||0/50||21||2||28|
|Traj. Cont. (TC)||E7||46.2 min||416.81 m||0.15 m/s||64%||1/50||18||1||28|
|Proposed System||E7||39.4 min||503.95 m||0.21 m/s||84%||0/50||16||0||35|
We first conducted an evaluation on the Foyer map, which was used for training or tuning of algorithms, to evaluate the capacity of the models on a previously-seen environmental condition (E4) and in new conditions (E5 and E6). As indicated in Table I, we considered 100 test scenarios according to each of E4 and E5, and 50 test scenarios according to E6.
Table II presents quantitative results for this experiment. Overall, all methods increased speed as the environments were simpler, suggesting that they adjusted navigation behavior based on environmental conditions. Unfortunately, though, collisions were observed throughout the experiment.
In terms of relative performance on the familiar condition E4, the proposed approach reached the goal the most. For new environmental conditions, the Navigation Stack was the best model in the simplest condition (E5), while the proposed approach was the best model in the more complex condition (E6). In E5, the performance of our system is worse than in the other two environments. We observed that in simulation, our model paid too much attention to the single pedestrian, and halted multiple times to let the pedestrian pass and as a result, could not reach the goal within 90 seconds. Remarkably, the performance of our approach varied only by when reaching the goal in the foyer, while the performance of the Nav. Stack varied by about . It is difficult to manually find a fixed set of model parameters for all environmental conditions.
The GC and TC baselines showed their best performance in the simple condition E5, even in comparison to E4. This result suggests that the controllers were able to follow the global plan, but had trouble adapting to dynamic environments, which were more common in E4 and E6.
We tested generalization capabilities in a new map of a university laboratory (environmental condition E7). The results are presented in the last four rows of Table II. Our proposed model reached the goal the most often, followed by the navigation stack. There is a marked difference in the goal reach percentages of the proposed system and the navigation stack. This result again highlights the difficulty in manually tuning static model parameters for navigation with ROS’ navigation stack.
We surveyed 35 people through Amazon Mechanical Turk in regards to their perception of the motion of the robot in 48 videos of scenarios from environmental conditions E4, E5, E6, and E7 (Table I). We then ran REstricted Maximum Likelihood (REML)  analyses on how Aggressive, Natural, and Efficient the motion looked with Method (Nav. Stack, GC, TC, Proposed) as main effect and participant as random effect. The results for Efficiency were significant, F, p. A Tukey post-hoc test  showed that the proposed navigation approach was perceived as significantly more efficient (M=, SE=) than the Nav. Stack and TC (M=, SE=; M=, SE=). The results for Aggressiveness and Naturalness were significant as well with p. The post-hoc showed that the TC baseline led to the most aggressive (M=, SE=) and least natural (M=, SE=) navigation behavior in comparison to all other methods. In particular, the ratings for our approach in terms of aggressiveness and naturalness were M=, SE=, and M=, SE=, respectively.
We inspected the attention coefficients from the learned components of the proposed navigation system and found that they correlate consistently with specific task conditions, providing a mechanism to interpret the execution of the model. For instance, as the robot navigated nearby obstacles, the attention coefficient for the lidar features of the Velocty Controller tended to grow, while the attention coefficient for the global plan features tended to get lower (left image of Fig. 3). Meanwhile, the opposite result was observed when the robot approached the goal (right image of Fig. 3). These results suggest that the proposed attention mechanisms helped the robot extract the right information from the many inputs to the components of our deep model. More visualizations can be seen in the supplementary video.
We conducted experiments in relatively open simulated environments, but future work should also evaluate the performance of the proposed approach in more narrow spaces, like corridors. Further, it is important to conduct a systematic evaluation of our model in real-world scenarios. While we have attempted such preliminary tests and our system has worked effectively several times, it has also failed due to errors in people tracking and noise in real lidar measurements, e.g., due to material reflections, that were not typical of our simulation. We continue to work to address these issues.
We proposed a new navigation system that combines traditional planning with modern deep learning techniques. The learning components of the system were structured as a local planner and a velocity controller. This structure made our model interpretable. First, by visualizing the local plan, one could get a sense of how the robot would move and react to dynamic elements of the environment. Second, by inspecting the attention coefficients of our model, one could see the type of information that the robot was considering during navigation. Overall, our evaluation of the proposed model suggested that it was effective for navigation in complex dynamic environments. The performance of the proposed model was more consistent than the performance of a traditional two-layer navigation system with social costs . Moreover, the learning component of our approach was more successful at reaching the goal than other deep controllers that mapped sensor data and a global plan directly to velocity commands, e.g., . Finally, our results reinforce the idea that imitation learning can facilitate modeling socially appropriate behavior from example navigation data [40, 41].
The Toyota Research Institute provided funds to assist with this research, but this paper solely reflects the opinions and conclusions of its authors and not of any Toyota entity.
AAAI/ACM Conf. on Artificial Intelligence, Ethics, and Society (AIES), 2018.
A. Vemula, K. Muelling, and J. Oh, “Social attention: Modeling attention in human crowds,” inProceedings of the International Conference on Robotics and Automation (ICRA) 2018, May 2018.
R. R. Corbeil and S. R. Searle, “Restricted maximum likelihood (reml) estimation of variance components in the mixed model,”Technometrics, vol. 18, no. 1, pp. 31–38, 1976.