Learning to Navigate from Simulation via Spatial and Semantic Information Synthesis with Noise Model Embedding

by   Gang Chen, et al.

While training an end-to-end navigation network in the real world is usually of high cost, simulation provides a safe and cheap environment in this training stage. However, training neural network models in simulation brings up the problem of how to effectively transfer the model from simulation to the real world (sim-to-real). In this work, we regard the environment representation as a crucial element in this transfer process and propose a visual information pyramid (VIP) model to systematically investigate a practical environment representation. A novel representation composed of spatial and semantic information synthesis is then established accordingly, where noise model embedding is particularly considered. To explore the effectiveness of this representation, we compared the performance with representations popularly used in the literature in both simulated and real-world scenarios. Results suggest that our environment representation stands out. Furthermore, an analysis on the feature map is implemented to investigate the effectiveness through inner reaction, which could be irradiative for future researches on end-to-end navigation.




Learning to Navigate from Simulation via Spatial and Semantic Information Synthesis

While training an end-to-end navigation network in the real world is usu...

Learning On-Road Visual Control for Self-Driving Vehicles with Auxiliary Tasks

A safe and robust on-road navigation system is a crucial component of ac...

Sim-Real Joint Reinforcement Transfer for 3D Indoor Navigation

There has been an increasing interest in 3D indoor navigation, where a r...

Driving Policy Transfer via Modularity and Abstraction

End-to-end approaches to autonomous driving have high sample complexity ...

Out of the Box: Embodied Navigation in the Real World

The research field of Embodied AI has witnessed substantial progress in ...

Towards Autonomous Grading In The Real World

In this work, we aim to tackle the problem of autonomous grading, where ...

Deep Semantic Segmentation at the Edge for Autonomous Navigation in Vineyard Rows

Precision agriculture is a fast-growing field that aims at introducing a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The fundamental objective of mobile robot navigation is to arrive at a goal position without collision. The mobile robot is supposed to be aware of obstacles and move freely in different working scenarios. Mathematically modeling various situations that a mobile robot may encounter is hardly possible, while end-to-end learning provides a promising data-driven solution to this high-dimension problem. End-to-end learning maps sensor data directly to control outputs and has been proved to be promising in coping with many scenarios [9, 12, 19].

As a data-driven approach, end-to-end learning often requires a large amount of training data. While collecting training data in the real world is usually of high cost, collecting data in simulation is much more convenient. Therefore, training the model in simulation and transfer it directly to the real world, namely sim-to-real learning, is an attractive approach. Many researchers have studied sim-to-real learning for robot manipulators [36, 45]. The working environment of a robot manipulator is usually fixed and easy to model in simulation. However, the working environments of a mobile robot are often diversified. Subtly building these working environments in simulation is hardly possible. Under such circumstances, how to transfer the network model trained in a limited number of roughly simulated environments to various real-world scenarios has to be concerned.

Fig. 1: Method Overview: Different environment representations are compared to evaluate their influence to sim-to-real navigation tasks.

To fulfill an effective transfer process, high generalization ability of the network is demanded. A crucial factor for this generalization ability is the environment representation [11]. As vision sensor provides abundant information of the view field, state-of-the-art works use RGB image [37], depth image [43] and segmented semantic image [27] as the representation in end-to-end navigation. However, no systematical analysis approach is raised to compare these representations and explore a better one. Inspired by the visual information abstraction behavior of human operators, we propose the VIP model for vision-based end-to-end navigation and derive three criteria for a feasible representation in sim-to-real learning. Accordingly, an environment representation composed of spatial and semantic information synthesis is designed. The spatial information is presented by a noise-model-embedded depth image while the semantic information is expressed with a categorized detection image. Then a training dataset from expert operations in a coarse simulated scenario is obtained to compare the performance of this representation with others popularly used in the literature. Eight network models with different representations are trained and evaluated in two approaches. First in the commonly utilized approach, the models are tested in both simulated and real-world testing scenarios to get quantitative results. Furthermore, a fast and intuitive comparing approach is presented, which reveals the internal reactions of the network by constructing feature maps with the hidden convolutional layers of the network. Both ways indicates our representation behaves best, which supports our VIP model in turn.

The contributions of this work are:

  • Proposed the VIP model and three criteria for the design of environment representation in vision-based sim-to-real navigation.

  • Designed a representation with spatial and semantic information synthesis. Noise model for the real-world sensor is particularly considered.

  • Presented a fast evaluation and analysis approach through constructing feature maps in CNN layers.

The remaining content is organized as follows: Section 2 describes the related work. Section 3 presents the design process of our representation and the learning paradigm. Section 4 gives the experiment results. The conclusion is drawn in Section 5.

Ii Related Work

End-to-end learning dates back to the 1980s [33] and has been proved to be a promising approach in navigation tasks for mobile robots [28]. Benefiting from the development of deep neural networks in recent years, the performance of end-to-end learning- based navigation has embraced a great improvement. The fundamental ability in navigation is obstacle avoidance. End-to-end learning networks have achieved compelling obstacle avoidance performance in many scenes, such as highway [7], trail [40] or corridor [12]. Global direction command given by a high level planner is also concerned in the literature [9, 13] to help robots make turns at intersections. A more complicated situation is in the environment with dynamic obstacles like pedestrians. The mobile robot must act more subtly and rapidly to avoid collision [32, 4].

In these learning-based navigation works, training data is important. However, operating a mobile robot to collect training data in the real world is inconvenient and time-consuming. Any damage to the environment or the robot itself could cause a lot of trouble. To enhance the efficiency in the data collection process, some researchers use the data acquired from the cameras mounted on a person [14, 40] or a car [25] to imitate the behaviors of a mobile robot. Another effective approach is to use sim-to-real learning. Tai et al. [42] adopt few sparse distance points measured by laser range finders as the network input and achieve a good sim-to-real transferability in indoor environments. In vision-based navigation, several works use the RGB image as the environment representation for sim-to-real learning networks. The RGB image can be directly fed to a single end-to-end network to get the control commands [3] or be divided into girds firstly to learn the best heading direction [37]. The result is excellent in simulation but less satisfying in the real-world tests. A special approach is to utilize two auto-encoders to generate a real RGB image from a simulated image [31]. This approach only suits a fixed number of simple scenarios since a refined mapping from simulated images to real images is quite difficult.

Except for the usage of RGB image, depth image has also been adopted in sim-to-real learning for mobile robots. Depth image is easy to acquire from a stereo camera or an RGB-D camera and has been proved to be an effective environment representation when training with the data collected in the real world [41, 19]. Few works have tried to train the network with simulated depth images. One recent work tries this the depth image based sim-to-real learning in a pedestrian-rich scenario [43]. The performance is excellent in simulation but barely satisfying in the real world due to the lack of modeling the noise in real depth images. The navigation model based on depth image in [18] behaves poorly out of the same reason.

Motivated by the traditional free space searching and path planning paradigm, some works utilize a semantic image showing the free space area to form the environment representation. One implicit approach is to adopt an image segmentation network as semantic feature extraction layers and add new layers to output the control commands

[46]. Another explicit approach with better performance is to generate a semantic segmentation image first and then feeds it to another network to get waypoints [27] or velocity output [18, 26, 17]. The above works behave well when the simulated training environment is elaborate and the testing scenario is not cluttered. For a more practical sim-to-real navigation, a better environment representation is still demanded.

Iii Methods

To systematically analyze the vision-based environment representation and explore a feasible representation for sim-to-real learning, one theoretical model and three criteria are proposed in this section. Then a representation composed of spatial and semantic information synthesis is designed accordingly, in which the noise model is considered. Finally, the utilized training approach and network architecture are described.

Iii-a Design Criteria

Consider a human operator who controls a mobile robot remotely based on a first-person-view (FPV) camera. The perception of the operator comes from an RGB image composed of basic intensity information () on each pixel and textual information () given by the distribution of the intensity. From these detailed low-level information the human operator can abstract high-level spatial information () and semantic information () through his perception experience to the world and controls the robot based on these two kinds of high-level information. In addition , artificial neural networks has been proved to have the ability to abstract from and [24, 16] or the combination with [23, 15], as well as the ability to abstract from and [49, 10]. We define this abstraction process as visual information pyramid (VIP), as is illustrated by the pyramid in Fig. 2.

tells the position of all the objects in the environment. The importance of is intuitional and has also been proved in neural science area [44]. Works have been conducted to apply in obstacle avoidance [19, 41, 30]. Meanwhile, enables the robot to distinguish the objects to perform complex behaviors [2, 1] or handle potential obstacles [38].

In vision-based sim-to-real learning, an intuitive principle is to make the simulated input image in training stage as similar as possible to the real-world input image testing stage. Constructing sophisticated simulation environments to imitate real working environments is difficult and expensive. Therefore, utilizing a coarse simulation environment but a feasible environment representation that is able to keep the useful information while narrow the difference between simulation and the real world is a better solution. and differ a lot from simulation to the real world and from place to place while they matter little in navigation. On the contrary, high-level and differ little but are significant in navigation. Therefore, the first two criteria to design a feasible environment representation for a high-performance sim-to-real navigation network are:

  • The representation should express and as explicitly as possible.

  • The representation should contain little dispensable information like or .

which can be expressed as:


where the operator denote one information is explicitly contained in an environment representation.

Furthermore, observation results are usually perfect in simulation but noisy in the real world. To further narrow the gap between simulation and the real world, noise model of the environment representation must be considered. Hence the third criterion is:

  • A noise model for the representation satisfying the following condition can be built.


Iii-B Environment Representation

RGB image, depth image and segmented semantic image are usually used in former end-to-end navigation works. Let , and denote the environment representation composed of these three images respectively, and the operator denote one information is not explicitly contained in a representation but can be inferred or partially inferred. Then we can get:


In regard of sim-to-real learning approach, the representation of RGB image does not fit the first two criteria we proposed. It is difficult for an end-to-end navigation network to learn the high-level and directly from RGB image and control commands pairs. The representation of depth image explicitly contains and can be quickly acquired by an RGB-D camera or a stereo camera, but the contained

is obscure. The representation of segmented semantic image acquired from deep learning explicitly describes

while is only coarsely given by the layout of segmented objects in the image. Besides, segmented semantic image is usually generated by deep learning methods and the noise is unpredictable.

Fig. 2: The raised VIP theoretical model for vision based navigation and our environment representation.

In our environment representation, and are synthesized to satisfy (1). As is shown in Fig. 2, a depth image considering real-world noise model is adopted to give . Additionally, a semantic image generated through the object detection results, such as the results from the Yolo V3 [34], is deployed to present .

Although object segmentation approaches could generate more elaborate semantic information containing free space area, they are not fast and accurate enough to operate with on-board computers currently. Fast object detection results from Yolo V3 are sufficient to realize object-distinguishable obstacle avoidance. To further decrease the noise caused by false detections and increase the generalization ability, the semantic labels of the detected objects are graded into six categories according to the collision risk level. Pedestrians have the highest level which means the robot should keep a far distance and stay slow when pedestrians show up. The final semantic image is a gray-scale image that has different intensities for different categories. The higher the risk level is, the larger intensity the region is filled with. If two of the detected objects overlap, the object with a higher risk level is shown. This semantic image is named as categorized detection image to distinguish from the segmented semantic image.

Depth image obtained in the real world is pretty noisy. One type of obvious noise lies on the edges of the objects in the view, which is usually subject to a Gaussian distribution. Our work uses the Kinect V2 RGB-D camera and the related noise model for the edges of objects has been studied in a previous work


. The mean of the noise is the true depth and the standard deviation can be described as:


where is the real depth and is a random coefficient added to adjust extreme situations described in the previous work. Edges of objects are detected by the Canny algorithm [6].

Furthermore, we found that the depth near the border of the image is often unmeasurable. The situation varies a lot in different scenarios and different light conditions hence the noise is hard to accurately model. Considering the uncertainty of this noise, a mask following a combination of Gaussian distribution and uniform distribution is added to randomly remove some values on the border of the image. The algorithm is described in Fig.

3, where the input ratio of the masked depth values is sampled from to , is 36 and is 24 for a depth image with pixels. Finally, the salt-and-pepper noise is added randomly on the whole image. A comparison between the original simulated depth image, the noised simulated image, and two real-world depth images is shown in Fig. 4.

Input: (original depth image), (ratio of the masked depth values) Output: (the processed depth image with noise near the edge). 1: 2:for  do 3:      4:      5:     if  then 6:          7:     end if 8:     if  then 9:          10:     end if 11:      12:end for

Fig. 3: The algorithm of adding a noise mask on the border of a depth image.

Fig. 4: A comparison of the simulated depth image, the simulated depth image with noise and two real-world depth images (left to right).

In comparison, the network models with eight representations in four types are tested in our experiments, which are Type 1={RGB image (), RGB image with noise model ()}, Type 2={depth image [43] (), depth image with our noise model ()}, Type 3={segmented image from FC-DenseNet [20] (), segmented image from PSPNet [47, 17] ()}, and Type 4={our representation consists of depth image and categorized detection image (), our representation with noise model ()}. The noise added to RGB image follows the approach in [9], including the change in contrast, tone and brightness and the addition of Gaussian noise, Gaussian blur, and salt-and-pepper noise. As to the generation of the two segmented images, PSPNet is trained with ADE20k dataset [48] as in [17] and FC-DenseNet is trained with CamVid dataset [5]. In simulation, the parameters are refined with a dataset we labeled to increase the accuracy. The segmentation results are also categorized.

Iii-C Learning Approach

In order to prove the effectiveness of our environment representation, the training data for different representations should come from the same operation process of the robot. Hence we built a dataset from expert’s FPV operation in a simulated indoor environment in Gazebo [22]

and utilized imitation learning paradigm to train the network models with different imported representations.

There are two assumptions in imitation learning. One is that the expert performs in the right way under all encountered situations. Another is that the learning network has the input which contains all the necessary information that leads the expert to his action. The first assumption can be satisfied by carefully operating the robot to acquire good moving paths. However, learning how to recover from mistakes is also quite important [35]. Therefore, we randomly initialized many bad situations, such as hitting an obstacle, as the initial state to get recovering samples without bringing the operations that lead the robot to these bad situations.

The second assumption is usually fulfilled by inputting the same view that the expert had to the network. In our data collection stage, the view for the expert is an RGB image. According to our VIP model, the expert infers the spatial information and the semantic information from the RGB image and operates mainly based on these two kinds of information. Our environment representation composed of a depth image and a categorized detection image contains these two kinds of necessary information and satisfies the assumption.

To accomplish a navigation task in an environment with intersections, which is very common in many working scenarios, a global direction command also needs to be considered. Following the similar way in [39], the direction commands consist of move forward, turn left, turn right and stop

. The expert receives the direction command at each intersection from an arrow on the screen when collecting training data, while the network takes the direction command as an input in vector form. Denote the parameters in the network

and the expert action at a discrete time , which includes linear velocity and angular velocity . Assume the network can be represented by a function , where describes the input environment representation and is the global direction command. The objective of our imitation learning can be expressed as:


Iii-D Network Architecture

As has been proved in previous works on goal-directed imitation learning with image input [9, 39], utilizing convolutional layers to generate a vector from the input image and concatenating the vector with global direction commands is an effective network structure. The convolutional layers work as an encoder that extracts valuable features from the image. A similar modularized structure is adopted in our network. Detailed structure can be found in Fig. 5.

Fig. 5: The detailed structure of our network.

The input consists of environment representation and direction command. Four types of environment representations are considered for comparison as mentioned, which are based on RGB image, depth image, segmented semantic image and our representation with both depth image and categorized detection image. In the first three cases, only Encoder 1 is used to extract the features from the RGB image, the depth image or the segmented semantic image. The result is flattened and imported to two dense layers with 512 neurons to get a feature vector. After that, the feature vector is concatenated with a command vector and then connected to another three dense layers to generate the final action. In the forth case with both the depth image and the categorized detection image, two encoders are utilized. Encoder 1 stays the same except that the connected dense layers have 480 neurons each. Encoder 2 is added to extract features from the semantic image. The dense layers after Encoder 2 have only 32 neurons.

The depth image and the categorized detection image in the forth case are processed with two encoders separately. As a result, the information in the two images is not connected at the pixel-wise level. The final output can be treated as an overlay of the influence of the categorized detection image and the depth image. We tried to connect the depth image and the categorized detection image at the pixel-wise level via regarding them as two channels in one image. However, the result was terrible because our categorized detection image usually contains less valuable information than the depth image, especially when there is no object detected. Thus the semantic image is valued less with a separate network.

All the images have a size of

pixels and the networks are light and fast to fit the real-time navigation tasks. The final output is an action vector containing linear and angular velocity control signals. Our loss function can be described as:


The last component in this equation is the regularization item. is for the weights in dense layers. The range of and is normalized before training. is a parameter to balance the effect of the error of linear and angular velocity. In practice, works fine.

Moreover, a 50% dropout is applied after the convolutional layers and the first dense layer after concatenation. The ReLU nonlinearities are used for all hidden layers. The models were trained by Adam solver

[21] with a mini-batch size of 40 and an initial learning rate of .

Iii-E Evaluation Approach

The commonly utilized evaluation approach for an end-to-end navigation network model is to conduct experiments in a testing environment and assess by indicators like collision-free moving time [13] or intervention times [9]. For the sim-to-real paradigm, models trained in simulation should be tested on the real-world robot system to get the evaluation result. However, in the early stage of research, testing the models directly in the real world can be dangerous and time-consuming. One way is to test the models in a simulated environment different from the training environment firstly. The limitation is that the input observation is still simulated. Hence we propose a fast and intuitive approach to analyze the reaction of the network models with real observation. A typical real-world scenario with obstacles is set up and the input images are collected directly from the RGB-D camera on the robot. Then the images are fed to the network model trained in simulation and a feature map can be constructed based on the state of the middle convolutional layer. The feature map intuitively reflects the reaction of the model towards different obstacles and reveals the effectiveness of different environment representations. The details can be found in the feature map analysis part in experiments section.

Iv Experiments

We mainly focus on indoor scenarios in our experiments. All the training data came from a very coarse simulated environment built in Gazebo [22]. The results were firstly evaluated by the navigation performance both in simulation and the real world. Then our fast evaluation approach based on feature map was utilized to investigate the effectiveness through inner reaction. The following presents our system and results.

Iv-a System Setup

Fig. 6: The settings of the simulated training environment, the simulated testing environment and the real world testing environment (from Row (a) to Row (c)).

Two simulated indoor environments were built in Gazebo to train and test the models respectively. Compared to the training environment, the testing environment has a different building structure, and the appearances of some objects are also diverse. The settings of the simulated testing environment are shown in Fig. 6 and a map is presented in Fig. 8.

In the training process, the expert controlled the simulated turtlebot with a joystick in FPV. No global route was planned and the expert followed a direction command generated randomly at each intersection. In the testing environment, the mobile followed a global route that could cover the whole map.

Fig. 7: The physical system to test the models.

The physical system of the mobile robot is composed of a mobile platform, a bottom controller, an on-board computer and an RGB-D camera, as is shown in Fig. 7. Since planning global topology route is not our focus, a joystick is simply utilized to send a direction command at each intersection. The on-board computer is mounted with an NVIDIA GTX 1060 GPU. The mobile robot moves at a speed of about 0.6 m/s.

To evaluate the models trained in simulation in the real world quantitatively, a place in our lab building was chosen as the real-world testing environment. The width of the corridors ranges from to . The mobile robot should pass corridors with a total length of about ninety meters. The start point and the end point shared the same place. We made the corridors cluttered with some chairs and foam boards. Some voluntary pedestrians confronted, crossed or overtook the mobile robot to test its ability to react to dynamical obstacles. The volunteers had no idea of which model was running during the tests. Fig. 9 shows a laser-scanned grid map of the real-world testing environment.

Iv-B Evaluation

Eight models with different environment representations were trained 400 epochs each with one-hour simulated training data. We evaluated the models quantitatively in both simulated and real-world testing environments, and then analyzed the inner reaction of the networks.

In quantitative evaluations, the performance was firstly evaluated through 12 trials for each model in the simulated testing environment and 5 trials in the real-world testing environment. All our models run at a frequency of over 20 Hz in both simulation and the real-world tests. The basic obstacle avoidance ability was evaluated by the average number of intervention times. An intervention happened when the robot hit an obstacle or was stuck in a certain situation. We also evaluate the ability to react to dynamic obstacles, like pedestrians.

Previous works evaluated the ability by minimum distance to the pedestrians [8] or the successful times of avoiding hitting a pedestrian [4]. The limitation of these evaluating indicators is that they depend heavily on the behaviors of pedestrians. Pedestrians would avoid being hit and some may intentionally or occasionally walk very close to the robot. Therefore, a more objective indicator free from the behaviors of pedestrians is necessary. Intuitively, when encountering a pedestrian, the robot decelerates, waits or turns to avoid collision. In all these actions, the linear velocity of the robot would decrease. Thus the statistical average percentage of the linear velocity decrease, compared to the situation with no pedestrian nearby, was adopted to evaluate the ability to react to moving obstacles. The velocity was acquired by the ground truth in simulation and the motor encoders on the wheels in the real world. Furthermore, a score given by pedestrians after each real-world test was adopted to evaluate subjectively. Details are given below.

Fig. 8: A map of the simulated testing environment and typical trajectories of models with different environment representations.
Model Interventions Time (min) Vel. decrease
2.5 7.3 41.5%
[39] * 46.0 14.0 -
[43] 5.6 10.2 42.2%
Our 0.6 5.8 32.4%
* 32.7 10.4 -
[17] * 26.7 8.1 -
Our 0.9 6.2 44.8%
Our 0.2 6.10 36.1%
TABLE I: Results in the simulated testing environment.

Iv-B1 Simulation Tests

Typical trajectories in the simulated testing environment and the quantitative results are shown in Fig. 8 and Tab. I respectively. Representations with star indicates too many collisions happened even without dynamic obstacles, thus velocity decrease was not calculated.

The intervention of models trained with segmented images, and , are more than 100 times of the best model. The model trained with non-noise depth images [43] behaves terribly as well, so does the model trained with non-noise RGB images . When the augmented noisy data is added, the model behaves worse and collides even over 40 times in a single trial. Changes on the percentage of augmented data were also tested but helped little. On the contrary, augmented noise in the depth images helps improve the performance on obstacle avoidance significantly ( in Tab. I). In our consideration, the augmented noise on the RGB image [39], which are designed to improve the generalization ability, never occurs in the simulated environment thus brings negative effects. Meanwhile, depth images are much less diversified than RGB images, leading to a high possibility of overfitting. Augmented noise effectively prevents overfitting.

Compared to the models trained with only depth images, adding a categorized detection image improves the ability to avoid dynamic obstacles as expected. Moreover, the times of intervention reduces remarkably, because the semantic image raises stronger effects on the detected objects, such as furniture and pedestrians, and improves the obstacle avoidance ability. One defect is that the average time to finish one test increases slightly due to the velocity decrease. Considering the trade-off between the finishing time and the ability to safely navigate in the presence of moving obstacles, the model with our , and representations stand out in simulation tests.

Fig. 9: A laser-scanned grid map of the real-world testing environment and trajectories of the two valid models.
Model interventions
ours 3.0 2.8 7.2% 3.3
ours 0.8 3.0 12.2% 4.0
Others 20.0 - - -
TABLE II: Results in the real-world testing environment.

Iv-B2 Real-world Tests

Real-world tests are much more challenging. The mobile robot moved almost randomly when using models with or . When using the models with or , in which the noise in depth image is not considered, the robot just stays still or keeps steering. Models with segmented semantic images, and , behave better but the number of intervention is still over 20. Only the results of the rest two models with noise model embedding are competitive. Fig. 9 and Tab. II present the typical trajectory and quantitative results. In the ninety-meters-long real-world testing route with obstacles and pedestrians, the model with our shows a striking result of less than one intervention in each trial averagely. The the model with also works in the real world but the performance is less excellent. The relative performance of the two models keeps the same tendency in the simulation environment and the real-world environment.

Fig. 10: A box plot of the velocity decrease percentage (Subplot (a)) and typical velocity command curves when the mobile robot confronts a pedestrian (Subplot (b)).

Fig. 11: Six testing environments for qualitative analysis.

Fig. 12: The feature maps of the middle CNN layer for different kinds of environment representations.

In the evaluation of the ability of avoiding moving obstacles, the model with behaves much better both in the velocity decrease percentage and the subjective score given by pedestrians. An illustration of the velocity decrease percentage of the two models is given by Subplot (a) in Fig. 10. Subplot (b) shows typical velocity commands given the network models. Compared to the values in simulations, the decrease percentage in the real world is less satisfying but acceptable because the behaviors of pedestrians in the real world are much more complicated. Subplot (b) presents two typical velocity command curves when the mobile robot confronts a pedestrian in the real world. The curves are aligned by the distance of the pedestrian to show the difference in the reaction distance. An obvious decrease in the output linear velocity command can be seen in both models when a pedestrian is close. The model imported with semantic image responses much earlier compared to the model without semantic image input.

Experiment in the real world shows that the model with our , performs the best in the real world. Noise model for the depth image plays an important role. Besides, tests in a simulated environment different from the training environment could give a prior evaluation of the models before the real-world tests. We further test our best model qualitatively in six testing environments, which are shown in Fig. 11. Three of the environments are indoor and the rest are outdoor.

The model never met similar environments in the simulated training environment, but it is still able to avoid the obstacles. Due to the influence of sunlight, the depth image outdoors is pretty noisy while our model also performs well. The video of the experiments can be found at: https://youtu.be/ucGyuMjlgEk.

Iv-B3 Feature Map Analysis

To study the internal effect of the environment representations, we set up a scenario with a pedestrian and two walls with different appearances and analyzed the feature maps for different kinds of environment representations in this scenario. The analyzed network models use , and . The feature map is constructed by averaging all the channels in the middle CNN layer and mapping to a gray-scale image. The middle layer is chosen because it extracts the useful features for navigation and is not too abstract to understand. The feature maps presented in Fig. 12 are recolored to have a clear view. In the row order, the pedestrian showed up and came closer and closer to the mobile robot.

The network trained with RGB image has no obvious reaction to the pedestrian, no matter how close the pedestrian is. In the meantime, the white wall on the left is clearly shown in the feature map while the colorful wall on the right is not. The reason is that in the simulated training environment, the walls are mostly white and the people wear dark blue pants. We further tested two more situations for the RGB image trained model. When the pedestrian wears a dark blue pant like the people in the simulation, some features clearly occur. When the position of the two walls is reversed, the area where the colorful wall is still has no features. This shows that the model trained with RGB image and control command pairs works mainly at the intensity level. Spatial or semantic information is hardly learned. Objects in the real world environment with different appearance raise little response.

The feature map of the model trained with depth images shows better results. The effect of walls on both sides can be seen clearly. The response to the pedestrian becomes obvious when the distance is shorter than 2 meters. When the distance is over 3 meters, the pedestrian is still clearly presented in the depth image but the network hardly responds, which means the network lowers the effect from the obstacles far-away.

The categorized semantic image is relatively simple but has the strongest response to the pedestrian. The corresponding feature map shows a rectangle representing the person, which is similar to the input semantic image. The main difference is that the intensity of the edge is higher than the inside. This is reasonable because the edge indicates the geometric layout of the pedestrian in the image, which matters in collision avoidance.

Segmented image shows reaction towards the pedestrian and the two walls. However, since the spatial information is ambiguous in the segmented image, the responses towards near obstacle and far obstacle are similar. Besides, the feature map is pretty noisy, which leads to uncertainty in obstacle avoidance.

Compared to RGB image and segmented image, depth image and categorized semantic image arouse much more obvious reactions in the real-world scenario. This accounts to the results in quantitative tests and provides a fast evaluation approach. Besides, it could also be irradiative for future researches on end-to-end navigation.

V Conclusion

As the reinforcement learning-based navigation draws great attention and various learning networks are proposed, it is important to rethink the design of a proper environment representation to realize effective sim-to-real transfer learning. This work systematically investigates the environment representation from theoretical model to evaluation approach. A representation composed of spatial and semantic information synthesis is designed accordingly. Noise model for real-world observation is particularly considered. With mere one-hour-long training data collected from a very coarse simulated environment, the network model trained with our representation can successfully navigate the robot in various real scenarios with obstacles. Feature map analysis also proves the effectiveness of this representation. In future works, we will adopt this representation in reinforcement learning-based sim-to-real navigation to improve the generalization ability. Through trial and error process in simulation, the final model is hoped to well serve the real-world navigation tasks.


  • [1] H. Bai, S. Cai, N. Ye, D. Hsu, and W. S. Lee (2015) Intention-aware online pomdp planning for autonomous driving in a crowd. In IEEE International Conference on Robotics and Automation (ICRA), pp. 454–460. Cited by: §III-A.
  • [2] T. Bastos, R. Alex, and R. Castro Freitas (1999-07) An efficient obstacle recognition system for helping mobile robot navigation. Cited by: §III-A.
  • [3] H. Bharadhwaj, Z. Wang, Y. Bengio, and L. Paull (2018) A data-efficient framework for training and sim-to-real transfer of navigation policies. arXiv preprint arXiv:1810.04871. Cited by: §II.
  • [4] J. Bi, T. Xiao, Q. Sun, and C. Xu (2018) Navigation by imitation in a pedestrian-rich environment. arXiv preprint arXiv:1811.00506. Cited by: §II, §IV-B.
  • [5] G. J. Brostow, J. Fauqueur, and R. Cipolla (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §III-B.
  • [6] J. Canny (1987) A computational approach to edge detection. In

    Readings in computer vision

    pp. 184–203. Cited by: §III-B.
  • [7] C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) Deepdriving: learning affordance for direct perception in autonomous driving. In IEEE International Conference on Computer Vision (ICCV), pp. 2722–2730. Cited by: §II.
  • [8] Y. F. Chen, M. Everett, M. Liu, and J. P. How (2017) Socially aware motion planning with deep reinforcement learning. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1343–1350. Cited by: §IV-B.
  • [9] F. Codevilla, M. Miiller, A. López, V. Koltun, and A. Dosovitskiy (2018) End-to-end driving via conditional imitation learning. In IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §I, §II, §III-B, §III-D, §III-E.
  • [10] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pp. 2366–2374. Cited by: §III-A.
  • [11] L. Frommberger (2007) Generalization and transfer learning in noise-affected robot navigation tasks. In

    Portuguese Conference on Artificial Intelligence

    pp. 508–519. Cited by: §I.
  • [12] D. Gandhi, L. Pinto, and A. Gupta (2017) Learning to fly by crashing. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3948–3955. Cited by: §I, §II.
  • [13] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subramanian (2017) Intention-net: integrating planning and deep learning for goal-directed autonomous navigation. In Conference on Robot Learning (CoRL), pp. 185–194. Cited by: §II, §III-E.
  • [14] A. Giusti, J. Guzzi, D. C. Cireşan, F. He, J. P. Rodríguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al. (2015)

    A machine learning approach to visual perception of forest trails for mobile robots

    IEEE Robotics and Automation Letters 1 (2), pp. 661–667. Cited by: §II.
  • [15] S. Gupta, P. Arbeláez, R. Girshick, and J. Malik (2015)

    Indoor scene understanding with rgb-d images: bottom-up segmentation, object detection and semantic segmentation

    International Journal of Computer Vision 112 (2), pp. 133–149. Cited by: §III-A.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §III-A.
  • [17] Z. Hong, C. Yu-Ming, S. Su, T. Shann, Y. Chang, H. Yang, B. H. Ho, C. Tu, Y. Chang, T. Hsiao, et al. (2018) Virtual-to-real: learning to control in visual semantic segmentation. arXiv preprint arXiv:1802.00285. Cited by: §II, §III-B, TABLE I.
  • [18] Z. Hong, C. Yu-Ming, S. Su, T. Shann, Y. Chang, H. Yang, B. H. Ho, C. Tu, Y. Chang, T. Hsiao, et al. (2018) Virtual-to-real: learning to control in visual semantic segmentation. arXiv preprint arXiv:1802.00285. Cited by: §II, §II.
  • [19] S. Hornauer, K. Zipser, and S. Yu (2018) Imitation learning of path-planned driving using disparity-depth images. In European Conference on Computer Vision (ECCV), pp. 542–548. Cited by: §I, §II, §III-A.
  • [20] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 11–19. Cited by: §III-B.
  • [21] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations (ICLR). Cited by: §III-D.
  • [22] N. Koenig and A. Howard (2004)

    Design and use paradigms for gazebo, an open-source multi-robot simulator

    In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. 3, pp. 2149–2154. Cited by: §III-C, §IV.
  • [23] H. Koppula, A. Anand, T. Joachims, and A. Saxena (2011) Semantic labeling of 3d point clouds for indoor scenes. In International Conference on Neural Information Processing Systems, Cited by: §III-A.
  • [24] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §III-A.
  • [25] A. Loquercio, A. I. Maqueda, C. R. Del-Blanco, and D. Scaramuzza (2018) Dronet: learning to fly by driving. IEEE Robotics and Automation Letters 3 (2), pp. 1088–1095. Cited by: §II.
  • [26] A. Mousavian, A. Toshev, M. Fiser, J. Kosecka, A. Wahid, and J. Davidson (2018) Visual representations for semantic target driven navigation. arXiv preprint arXiv:1805.06066. Cited by: §II.
  • [27] M. Mueller, A. Dosovitskiy, B. Ghanem, and V. Koltun (2018) Driving policy transfer via modularity and abstraction. In Conference on Robot Learning (CoRL), pp. 1–15. Cited by: §I, §II.
  • [28] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun (2005) Off-road obstacle avoidance through end-to-end learning. In Advances in neural information processing systems, pp. 739–746. Cited by: §II.
  • [29] C. V. Nguyen, S. Izadi, and D. Lovell (2012) Modeling kinect sensor noise for improved 3d reconstruction and tracking. In International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), pp. 524–530. Cited by: §III-B.
  • [30] H. Oleynikova, D. Honegger, and M. Pollefeys (2015) Reactive avoidance using embedded stereo vision for mav flight. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 50–56. Cited by: §III-A.
  • [31] X. Pan, Y. You, Z. Wang, and C. Lu (2017) Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952. Cited by: §II.
  • [32] N. Patel, A. Choromanska, P. Krishnamurthy, and F. Khorrami (2017) Sensor modality fusion with cnns for ugv autonomous driving in indoor environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1531–1536. Cited by: §II.
  • [33] D. A. Pomerleau (1989) Alvinn: an autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305–313. Cited by: §II.
  • [34] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §III-B.
  • [35] S. Ross, G. J. Gordon, and J. A. Bagnell (2011) No-regret reductions for imitation learning and structured prediction. In International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §III-C.
  • [36] A. A. Rusu, M. Večerík, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell (2017) Sim-to-real robot learning from pixels with progressive nets. In Conference on Robot Learning (CoRL), pp. 262–270. Cited by: §I.
  • [37] F. Sadeghi and S. Levine (2017) CAD2RL: real single-image flight without a single real image. Robotics: Science and Systems (RSS). Cited by: §I, §II.
  • [38] M. Sadou, V. Polotski, and P. Cohen (2004) Occlusions in obstacle detection for safe navigation. In Intelligent Vehicles Symposium, Cited by: §III-A.
  • [39] A. Sauer, N. Savinov, and A. Geiger (2018) Conditional affordance learning for driving in urban environments. In Conference on Robot Learning (CoRL), pp. 237–252. Cited by: §III-C, §III-D, §IV-B1, TABLE I.
  • [40] N. Smolyanskiy, A. Kamenev, J. Smith, and S. Birchfield (2017) Toward low-flying autonomous mav trail navigation using deep neural networks for environmental awareness. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4241–4247. Cited by: §II, §II.
  • [41] L. Tai, S. Li, and M. Liu (2016) A deep-network solution towards model-less obstacle avoidance. In IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 2759–2764. Cited by: §II, §III-A.
  • [42] L. Tai, G. Paolo, and M. Liu (2017) Virtual-to-real deep reinforcement learning: continuous control of mobile robots for mapless navigation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 31–36. Cited by: §II.
  • [43] L. Tail, J. Zhang, M. Liu, and W. Burgard (2018) Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In IEEE International Conference on Robotics and Automation (ICRA), pp. 1111–1117. Cited by: §I, §II, §III-B, §IV-B1, TABLE I.
  • [44] H. Torkel, F. Marianne, M. Sturla, M. May-Britt, and E. I. Moser (2005) Microstructure of a spatial map in the entorhinal cortex. Nature 436 (7052), pp. 801. Cited by: §III-A.
  • [45] U. Viereck, A. t. Pas, K. Saenko, and R. Platt (2017) Learning a visuomotor controller for real world robotic grasping using simulated depth images. In Conference on Robot Learning (CoRL), pp. 291–300. Cited by: §I.
  • [46] H. Xu, Y. Gao, F. Yu, and T. Darrell (2017) End-to-end learning of driving models from large-scale video datasets. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3530–3538. Cited by: §II.
  • [47] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §III-B.
  • [48] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba (2019) Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision 127 (3), pp. 302–321. Cited by: §III-B.
  • [49] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1851–1858. Cited by: §III-A.