Autonomous Navigation in Complex Environments with Deep Multimodal Fusion Network

by   Anh Nguyen, et al.
Imperial College London

Autonomous navigation in complex environments is a crucial task in time-sensitive scenarios such as disaster response or search and rescue. However, complex environments pose significant challenges for autonomous platforms to navigate due to their challenging properties: constrained narrow passages, unstable pathway with debris and obstacles, or irregular geological structures and poor lighting conditions. In this work, we propose a multimodal fusion approach to address the problem of autonomous navigation in complex environments such as collapsed cites, or natural caves. We first simulate the complex environments in a physics-based simulation engine and collect a large-scale dataset for training. We then propose a Navigation Multimodal Fusion Network (NMFNet) which has three branches to effectively handle three visual modalities: laser, RGB images, and point cloud data. The extensively experimental results show that our NMFNet outperforms recent state of the art by a fair margin while achieving real-time performance. We further show that the use of multiple modalities is essential for autonomous navigation in complex environments. Finally, we successfully deploy our network to both simulated and real mobile robots.


page 1

page 3

page 5

page 6


Autonomous Navigation with Mobile Robots using Deep Learning and the Robot Operating System

Autonomous navigation is a long-standing field of robotics research, whi...

TTM: Terrain Traversability Mapping for Autonomous Excavator Navigation in Unstructured Environments

We present Terrain Traversability Mapping (TTM), a real-time mapping app...

Autonomous Off-road Navigation over Extreme Terrains with Perceptually-challenging Conditions

We propose a framework for resilient autonomous navigation in perceptual...

Real-Time Sonar Fusion for Layered Navigation Controller

Navigation in varied and dynamic indoor environments remains a complex t...

Kilometer-scale autonomous navigation in subarctic forests: challenges and lessons learned

Challenges inherent to autonomous wintertime navigation in forests inclu...

Learning-based 3D Occupancy Prediction for Autonomous Navigation in Occluded Environments

In autonomous navigation of mobile robots, sensors suffer from massive o...

Dynamic-Aware Autonomous Exploration in Populated Environments

Autonomous exploration allows mobile robots to navigate in initially unk...

I Introduction

Autonomous navigation is a long-standing field of robotics research, which provides an essential capability for mobile robot to execute a series of tasks on the same environments performed by human everyday. In general, the task of autonomous navigation is to control a robot navigate around the environment without colliding with obstacles. It can be seen that navigation is an elementary skill for intelligent agents, which requires decision-making across a diverse range of scales in time and space. In practice, autonomous navigation is not a trivial task since the robot needs to close the perception-control loop under the uncertainty in order to obtain the autonomy.

Recently, the learning-based approaches (e.g., deep learning models, etc.) have demonstrated the ability to directly derive end-to-end policies which map raw sensor data to control commands 

[1, 2]. This end-to-end approach also reduces the complexity of the implementation and effectively utilizes input data from different sensors (e.g., depth camera, laser) thereby reducing cost, power and computational time. One more advantage is that the end-to-end relationship between input data and control outputs can result in an arbitrarily non-linear complex model (i.e., sensor to actuation) which has yielded surprisingly encouraging results in different control problems such as lane following [3], autonomous driving [4], and Unmanned Aerial Vehicles (UAV) control [5]. However, most of the previous works are only tested in man-made normal environments, while navigating in complex environments such as collapsed houses/cities suffered from a disaster (e.g. an earthquake) or a natural caves still remains as an open problem.

Fig. 1: A visualization of a collapsed city, a complex environment from our simulation. Top: The collapsed city from a top view; Bottom: A mobile robot’s view.

Unlike normal environments (e.g., man-made road) which have clear visual clues in normal condition, complex environments such as collapsed cities or natural caves pose significant challenges for autonomous navigation [6] [7]. The main reason is that complex environments usually have very challenging visual and physical properties. For example, the collapsed cities may have constrained narrow passages, vertical shafts, unstable pathways with debris and broken objects (Fig. 1); or the natural caves often have irregular geological structures, narrow passages, and poor lighting condition. Autonomous navigation with intelligent robots in complex environments, however, is a crucial task, especially in time-sensitive scenarios such as disaster response, search and rescue, etc. Recently, the DARPA Subterranean Challenge [8] was organized to explore novel methods for quickly mapping, navigating, and searching in complex underground environments such as human-made tunnel systems, urban underground, and natural cave networks.

Inspired by the DARPA Subterranean Challenge, in this work, we propose a learning-based system for end-to-end mobile robot navigation in complex environments such as collapsed cities, collapsed houses, and natural caves. To overcome the difficulty in data collection, we build a large-scale simulation environment which allows us to collect the training data and deploy the learned control policy. We then propose a new Navigation Multimodal Fusion Network (NMFNet) that effectively learns the visual perception from sensor fusion and allows the robot to autonomously navigate in complex environments. To summarize, our main contributions are as follows:

  • We introduce new simulation models that can be used to record large-scale datasets for autonomous navigation in complex environments.

  • We present a new deep learning method that fuses both laser data, 2D images, and 3D point cloud to improve the navigation ability of the robot in complex environments.

  • We show that the use of multiple visual modalities is essential to learn a robust robot control policy for autonomous navigation in complex environments in order to deploy in real-world scenarios.

The remainder of this paper is organized as follows: Section II discusses the related background. Section III presents the visual multimodal input used in our method. Section IV introduces our new multimodal fusion network and its architecture. In section V, we present our extensively experimental results. Section VI concludes the paper and discusses the future work.

Ii Related Work

Multiple sensor fusion for autonomous robot navigation is a popular research topic in robotics [9]

. Traditional methods tackle this problem using algorithms based on Kalman Filter 

[10]. The advantage of this approach is the ability to fuse data from different sensors and sensor types such as visual, inertial, GPS, or pressure sensors. Lynen et al. [11] proposed a method based on Extended Kalman Filter (EKF) for Micro Aerial Vehicle (MAV) navigation. In [12]

, the authors developed an algorithm based on EKF to estimated the state of an UAV in multi-environments in real-time. Mascaro et al. 


proposed a graph-optimization method to fuse data from multi sensors for UAV pose estimation. Apart from the traditional localization and navigation task, multimodal fusion is also used other applications such as object detection 

[14] or semantic segmentation [15, 16] in changing environments. In both [14, 16] multimodal data from visual sensors are combined and learned in a deep learning framework to deal with challenging lighting conditions.

More recently, many methods have been proposed to directly learn control policies from raw sensory data. These methods can be divided into two main categories: reinforcement learning 


and supervised learning 

[18, 19, 20]

. With the rise of deep learning, Convolution Neural Networks (CNN) was widely used to train an end-to-end perception system 

[21, 22, 23, 24, 25, 26]. In [27], Bojarski et al. proposed the first end-to-end navigation system for autonomous car using 2D images. Smolyanskiy et al. [28] extended this idea for flying robots using three cameras as the input. Similarly, the authors of DroNet [29]

used CNN to learn the steering angle and predict the collision probability given the RGB image as the input. Gandhi et al. 

[30] introduced a navigation method for UAV by learning from negative and positive crashing trials. In [31] [32]

, CNN and Variational Autoencoder were combined to estimate the steering control signal. Monajjemi et al. 

[5] proposed a new method for agile UAV control. More recently, the authors in [33] proposed to combine the navigation map with visual input to learn a deterministic control signal.

Reinforcement learning algorithms have been widely used to learn general policies from robot experiences [17, 34, 35]. In [36], the authors introduced a continuous control framework using deep reinforcement learning. Zhu et al. [37] addressed the target-driven navigation problem given an input picture of a target object. Wortsman et al.[38] introduced a self-adaptive visual navigation system using meta-learning. The authors in [39] used semantic information and spatial relationships to let a robot navigate to target objects. In [40], an end-to-end regression system was introduced for UAV racing in simulation. The authors in [41][42] proposed to train the reinforcement policy in simulation environments, then transfer the learned policy to the real-world. In [43] [44], the authors combined deep reinforcement learning with CNN in an effort to leverage the advantages of both techniques. Piotr et al. [7] proposed a method with augmented memory to train autonomous agents to navigate within large and visually rich environments (complicated 3D mazes).

While reinforcement learning methods learn the general control policies with nice mathematical formulation, they require many trial and error experiments which are dangerous and not realistic in real safety-critical robotic platforms [39] [40]. On the order hand, supervised learning methods use pre-collected data to learn the control policies. The supervision data can be obtained from the real human expert trajectories [29, 30] or traditional controllers[45]. This is time-consuming and costly but doable with the real robots. Therefore, the supervised learning approach is usually more favorable over the reinforcement learning method when working with real robot platforms. However, it is not trivial to handle the domain-shift between expert guidance and the real robot trajectories in supervised learning methods.

Fig. 2: An illustration of three visual modalities used in our network. Top row: The RGB image (left) and the distance map from laser scanner (right). Bottom row: The 3D point cloud (left) and the overlay of the distance map in 3D point cloud (right).

In this work, we choose the end-to-end supervised learning approach for the ease of deploying and testing in real robot systems. We first simulate the complex environments in physics-based simulation engine and collect a large-scale for supervised learning. We then proposed NMFNet, an effective deep learning framework to fuse visual input and allow the robots to navigate autonomously in complex environments.

Iii Multimodal Input

Complex environments such as natural cave networks or collapsed cities pose significant challenges for autonomous navigation due to their irregular structures, unexpected obstacles, and the poor lighting condition inside the environments. To overcome these natural difficulties, we use three visual input data in our method: RGB image , point cloud , and distance map obtaining from the laser sensor. Intuitively, the use of all three visual modalities ensures that the robot’s perception system has meaningful information from at least one modality during the navigation under challenging conditions such as lighting changes, sensor noise in depth channels due to reflective materials, or motion blur, etc.

In practice, the RGB images and point clouds are captured using a depth camera mounted in front of the robot while the distance map is reconstructed from the laser data. In complex environments, while the RGB images and point clouds can provide the visual semantic information for the robot, the robot may need more useful information such as the distance map due to the presence of various obstacles. The distance map is reconstructed from the laser data as follows:


where is the coordinate of point on 2D distance map. is the coordinate of robot. is the distance from the laser sensor to the obstacle, and is the incremental angle of the laser sensor.

To keep the low latency between three visual modalities, we use only one laser scan to reconstruct the distance map. The scanning angle of the laser sensor is set to to cover the front view of the robot. This will help the robots aware of the obstacles from its far left/right hand side, since these obstacles may not be captured in the front camera which provides the RGB images and point cloud data. We notice that all three modalities are synchronized at each timestamp to ensure the robot is observing the same viewpoint at each control state. Fig. 2 shows a visualization of three visual modalities used in our method.

Iv Methodology

As motivated by the recent trend in autonomous driving [28, 29, 33], our goal is to build a framework that can directly map the input sensory data , to the output steering commands . To this end, we design NMFNet with three branches to handle three visual modalities. The architecture of our network is illustrated in Fig. 4.

Iv-a 2D Features

Learning meaningful features from 2D images is the key to success in many vision tasks. In this work, we use ResNet8 to extract deep features from the input RGB image and laser distance map. The ResNet8 has

residual blocks, each block consists of a convolutional layer, ReLU, skip links and batch normalization operations. A detailed visualization of ResNet8 architecture can be found in Fig. 

3. As in [29]

, we choose ResNet8 to extract deep features from the 2D images since it is a light weight architecture and can achieve competitive performance while being robust again the vanishing/exploding gradient problems during training.

Fig. 3: A detailed visualization of ResNet8.
Fig. 4: An overview of our NMFNet architecture. The network consists of three branches: The first branch learns the depth features from the distance map. The second branch extracts the features from RGB images, and the third branch handles the point cloud data. We then employ the 2D-3D fusion to predict the steering angle.

Iv-B 3D Point Cloud

While the robot is navigating in complex environments, relying on 2D visual information maybe not enough. For example, the RGB images from the front camera of the robot are widely used in many end-to-end visual navigation systems [29, 33], however, in environments such as natural caves, the lighting condition can be a challenge to obtain clear visual images. Therefore, we propose to use point cloud as another visual input for autonomous navigation in complex environments.

Specifically, we use the point cloud associated with the RGB images from the front camera of the robot. Although the point cloud from depth camera is ordered, it contains many points (e.g., in our camera setting, we have points in each cloud), including many missing points. In practice, due to the memory constraints, it is impractical to learn the geometric information from the huge point clouds [46]. Therefore, we remove all the missing points and randomly select points to represent the cloud, hence the point cloud becomes unordered. The point cloud is expected to provide more geometry information of the environment for the network. However, extracting features from the unordered cloud is not a trivial task since the network needs to be invariant to all permutations of the input set.

To extract the point cloud feature vector, our network has to learn a model that is invariant to input permutation. As motivated by 

[46], we extract the features from the unordered point cloud by learning a symmetric function on transformed elements. Given an unordered point set with , we can define a symmetric function that maps a set of unordered points to a feature vector as follow:


where is a vector max operator that takes input vectors and returns a new vector of the element-wise maximum; and are usually presented by neural networks.

In practice, and function are approximated by an affine transformation matrix with a mini multi-layer preception network (i.e., T-net) and a matrix multiplication operation. Given the unordered input points, we apply this transformation twice to learn the geometric feature from the pount cloud: input transformation and feature transformation. The input transformation uses raw point cloud as input and regresses to a

matrix. It consists of a three multi-layer perceptron network with layer output sizes are 64, 128, 1024, respectively. The output matrix is first initialized as an identity matrix and all layers have ReLU and batch normalization (except the last layer). We then feed the output of the first transformation to the second transformation network which has the same architecture and generates a

matrix as output. This matrix is also initialized as an identity and presents the learned features from the point cloud.

Iv-C Multimodal Fusion

Given the features from the point cloud branch and the RGB image branch, we first do an early fusion by concatenating the features extracted from the input cloud with the deep features extracted from the RGB image. The intuition is that since both the RGB image and the point cloud are captured using a camera with the same viewpoint, fusing their features will let the robot aware of both visual information from RGB image and geometry clue from point cloud data. This concatenated feature is then fed through two

convolutional layers. Finally, we combine the features from 2D-3D fusion with the extracted features from the distance map branch. The steering angle is predicted from a final fully connected layer keeping all the features from the multimodal fusion network.

Iv-D Training

To train an end-to-end navigation network, two popular approaches are used: classification loss [32] and regression loss[29]

. Methods use classification loss first bin the ground-truth steering angles into small and discrete groups, then learn possible controls as a classification problem using a softmax loss function. In practice, we have observed the instability during training due to the highly imbalanced statistic in the dataset. Therefore, we employ the regression loss to train the network end-to-end using the mean squared error (MSE)

loss function between the ground-truth human actuated control, , and the predicted control from the network :


Apart from training with normal data, we also employ the training method using domain randomisation [47]. As shown in [47], this simple technique can effectively improve the generalization of the network when only simulation data are available for training.

V Experiments

V-a Dataset

Data Collection Unlike the traditional autonomous navigation problem for autonomous car or UAVs that can collect data in real-world setting [28, 29, 30], it is not a trivial task to build complex environments such as collapsed cities or a collapsed houses in real life. Therefore, we create the simulation models of these environments in Gazebo and collect the visual data from simulation. In particular, we collect the data from three types of complex environment:

  • Collapsed house: The house suffered from an accident or a disaster (e.g. an earthquake) with random objects on the ground.

  • Collapsed city: Similar to the collapsed house, but for the outdoor environment. In this scenario, the road has many debris from the collapsed house/wall.

  • Natural cave: A long natural tunnel in a poor brightness condition with irregular geological structures.

To build the simulation environments, we first create the 3D model of normal daily objects in indoor and outdoor environments (e.g. beds, tables, lamps, computers, tools, trees, cars, rocks, etc.), including broken objects (e.g. broken vases, broken dishes, and debris). These objects are then manually chosen and placed in each environment to create the entire simulated environment.

For each environment, we use a mobile robot model equipped with a laser sensor and a depth camera mounted on top of the robot to collect the visual data. The robot is controlled manually to navigate around each environment. We collect the visual data when the robot is moving. All the visual data are synchronized with a current steering signal of the robot at each timestamp.

Data Statistic In particular, we create 3D object models to build the complex environments. These objects are used to build environments in total (i.e., instances for each environment). In average, the collapsed house environments are built with approximately objects in an area of . The collapsed city has objects and spread in while the natural cave environments are built with objects in approximately area. We manually control the robot in hours to collect the data.

In total, we collect around visual data triples ( for each environment type, resulting a large-scale dataset with records of synchronized RGB image, point cloud, laser distance map, and ground-truth steering angle. Around of the dataset are collected when we use domain randomisation by apply random texture to the environments (Fig. 5). For each environment, we use data for training and data for testing. All the 3D environments and our dataset will be made publicly available to encourage further research.

Fig. 5: Different robot’s views in our simulation of complex environments: collapsed city, natural cave, and collapsed house. Top row: RGB images from robot viewpoint in normal setting. Other rows: The same viewpoint when applying domain randomisation to the simulated environments.

V-B Implementation

We implement our network using Tensorflow framework 


. The network is optimized using stochastic gradient descent with the fix

learning rate and momentum. The input RGB image and distance map size are () and (), respectively, while the point cloud data are sampled to points. We train the network with the batch size of and the training time is approximately hours on an NVIDIA 2080 GPU.

V-C Baseline

We compare our method with the following recent state-of-the-art methods in autonomous navigation: DroNet [29], VariationNet [31]. We also present the result for Inception-V3 to serve as a baseline of deep architecture. We note that DroNet uses ResNet8 as the backbone to predict the steering angle and collision probability from RGB input. Since our dataset does not have collision ground-truth, we disable the collision probability branch of DroNet, and only use the regression branch to predict the result. Intuitively, DroNet architecture is similar to our RGB branch. All the baselines are trained with the data from domain randomisation. We show the results of our NMFNet under two settings: with domain randomisation (NMFNet with DR), and without using training data from domain randomisation (NMFNet without DR).

V-D Results

(a) Input Image
(b) RGB
(c) RGB + Point Cloud
(d) Fusion
Fig. 6: The activation map when different modalities are used to train the network. From left to right: (a) The input RGB image. (b) The activation map of the network uses only the RGB image as input. (c) The activation map of the network uses both RGB image and point cloud as input. (d) The activation map of the network uses fusion input (both RGB, point cloud and distance map). Overall, the network uses fusion input with all three modalities produces the most reliable heat map for navigation.

Table I summarizes the regression results using Root Mean Square Error (RMSE) of our NMFNet and other state-of-the-art methods. From the table, we can see that our NMFNet outperforms other methods by a significant margin. In particular, our NMFNet trained with domain randomisation data achieves RMSE which is a clear improvement over other methods using only RGB images such as DroNet [29]. This also confirms that using multi visual modalities input as in our fusion network is the key to successfully navigate in complex environments.

Input House City Cave Average
DroNet [29] RGB 0.938 0.664 0.666 0.756
Inception-V3 [49] RGB 1.245 1.115 1.121 1.16
VariationNet [31] RGB 1.510 1.290 1.507 1.436
NMFNet without DR Fusion 0.482 0.365 0.367 0.405
NMFNet with DR Fusion 0.480 0.365 0.321 0.389
TABLE I: RMSE Scores on the Test Set

Within three complex environment types, we observe that the RMSE of the collapsed house results are higher than the collapsed city and the natural cave. A possible reason is that the collapsed house environment is much smaller than others while having more objects. Therefore, it would be more challenging for the robot to navigate in the collapsed house without colliding with the objects. From Table I, we also notice that by employing domain randomisation, our NMFNet with DR shows a good improvement in comparison with the setting without domain randomisation (NMFNet without DR). On the other hand, the setup with VariationNet approach [31] has the highest error in all three complex environments while the Inception-V3 shows reasonable results.

V-E Contribution of Visual Modalities

To understand the contribution of each modality to the results, we perform the following experiment: We first train a network that uses only a single modality (either RGB, distance map, or point cloud) as the input. Technically, each network in this experiment is a branch of our NMFNet. We then perform similar experiments using networks with two branches (i.e., RGB + distance map, RGB + point cloud, and distance map + point cloud) as the input. All the networks use the training set with extra data from domain randomisation.

Table II shows the RMSE scores when different modalities are used to train the system. We first notice that the network that uses only point cloud data as the input does not converge. This confirms that learning meaningful features from point cloud data is challenging, especially in complex environments. On the other hand, we achieve a surprisingly good result when the distance map modality is used as the input. The other combinations between two modalities show reasonable accuracy, however, we achieve the best result when the network is trained end-to-end using the fusion from all three modalities: rgb, distance map from laser camera, and point cloud from the depth camera. To further verify the contribution of each modality, we employ Grad-CAM [50] to visualize the activation map of the network when different modality is used. Fig. 6 shows the qualitative visualization under three input settings: RGB, RGB + point cloud, and fusion. From Fig. 6, we can see that from a same viewpoint, the network that uses fusion data makes the most logical decision since its attention lays on feasible regions for navigation, while other networks trained with only RGB image or RGB + point cloud show more noisy attention.

We also note that the inference time of our NMFNet is approximately 100ms on an NVIDIA 2080 GPU. This allows our method to be used in a wide range of robotic applications. More qualitative results including the deployment of NMFNet on BeetleBot [51] can be found at

Input House City Cave Average
RGB 0.938 0.664 0.666 0.756
Distance Map 0.730 0.579 0.602 0.637
Point Cloud - - - -
RGB + Point Cloud 0.718 0.499 0.783 0.667
RGB + Distance Map 0.568 0.452 0.503 0.508
Distance Map + Point Cloud 0.631 0.474 0.592 0.566
Fusion 0.480 0.365 0.321 0.389
TABLE II: RMSE Scores of Networks using Different Modality

Vi Conclusions and Future Work

We propose NMFNet, an end-to-end and real-time deep learning framework for autonomous navigation in complex environments. Our network has three branches and effectively learns the visual fusion input data. Furthermore, we show that the use of mutilmodal sensor data is essential for autonomous navigation in complex environments. The extensively experimental results show that our NMFNet outperforms recent state-of-the-art methods by a fair margin while achieving real-time performance and generalizing well on unseen environments.

Currently, our NMFNet shows limitation on scenarios when the robot has to cross small debris or obstacles. In the future, we would like to quantitatively evaluate and address this problem. Another interesting direction is to combine our method with uncertainty estimation [25] or a goal-driven navigation task [26] for more wide-range of applications.