## 1 Introduction

In recent years, we have witnessed the development of autonomous robotic systems that have been broadly used in many scenarios (e.g., autonomous driving, manufacturing and surveillance). Drone belongs to the robotic system, and is well-known for its flying capacity. Navigation is extremely important to the drone fly, as it facilitates the effective exploration and recognition of the unknown environments. Yet, the navigation of drone remains a challenging task, especially for planning the pathway as short as possible to the target/destination whilst avoiding the potential collision with objects in the unexplored space. The conventional navigation heavily relies on the expertise of human, who intuitively designs the drone flyby trajectory based on the spatial layout within the visible range. The resulting navigation system lacks of the globe knowledge of scenes, leading to unsatisfactory or even failed path planning.

To better leverage the global information of 3D environment, researches on drone navigation have focused on collecting and memorizing the environmental information during the navigating process. Typically, the existing works [henriques2018mapnet, bansal2019combining, bian2019unsupervised] employ the mapping techniques to construct 2D/3D maps with respect to the vacant/occupied space. The mapping result contains rich geometric relationship between objects, which helps to navigate. There have also been navigation approaches based on visual information [chen2019behavioral, gupta2017cognitive, bansal2019combining], saving the computational overhead to construct maps. Nonetheless, these works purely condition the accuracy of navigation on either geometric or visual information.

In this paper, we utilize 2.5D height map for autonomous drone navigation. There are growing computer applications that use height map to represent the boundaries of objects (e.g., buildings or furniture). Nonetheless, there is nothing guaranteed for the quality of given height maps, as the mapping process likely involves incomplete or out-of-date information. Here, we advocate the importance of fusing geometric and visual information for a more robust construction of the height map. The new trend of researches [tatarchenko2019single, chen2019learning]

on the 3D object/scene understanding has also demonstrated that the geometric relationship between objects and visual appearance of scenes are closely correlated. We thus propose a

*Visual-Geometric Fusion Network*(VGF-Net) to dynamically update the height map during drone navigation by utilizing the timely captured new images (see Figure 1).

More specifically, as illustrated in Figure 2, the network takes an initial rough height map together with a sequence of RGB images as input. We use convolutional layers to compute the visual and geometric information to renew the height map. Next, we apply the simultaneous localization and mapping (SLAM) [mur2017orb] module to extract a sparse set of 3D keypoints from the image sequence. These keypoints are used along with the renewed height map to construct a novel *Visual-Geometric Representation*, which is passed to a *Directional Attention Model*. This attention model exchanges visual and geometric information among objects in the scene, providing quite useful object relationship for simultaneous refinement of the height map and the corresponding keypoints, leading to the successful path planning [koenig2002d]

at each navigation moment. Compared to dense point clouds that require time-consuming depth estimation

[chaurasia2013depth] and costly processing, the sparse keypoints we use are fast to compute yet effective in terms of capturing useful geometric information without much redundancy. As the drone flies over more and more places, our network can achieve and fuse more and more the visual and geometric information to largely increase the precision of height map and consequently the reliability of autonomous navigation.We intensively train and evaluate our method on a benchmark of seven large-scale urban scenes and six complex indoor scenes for height map construction and drone navigation. The experimental results and comparative statistics clearly demonstrate the effectiveness and the robustness of our proposed VGF-Net.

## 2 Related Work

There have been an array of researches on the navigation system that allows robots to smartly explore the real world. Below, we will mainly survey on the drone navigation and environment mapping, as they are highly relevant to our work in the sense that their navigation systems are driven by the critical environment data.

### 2.1 Drone Navigation

The modern drone systems are generally equipped with various sensors (e.g., RGB-D camera, radar and GPS), which help the hardware devices to achieve accurate perception of the real world. Typically, the data captured by sensors is used for mapping (i.e., the construction of map), providing comprehensive information for planning the moving path of drone. During the navigation process, the traditional methods [henriques2018mapnet, savinov2018semi] compute the trajectory of drone based on the pre-scribed maps. However, the construction of a precise map is generally expensive and time-consuming. Thus, the recent works [chen2019behavioral, gupta2017cognitive, bansal2019combining] simplify the construction of map to facility more commercially-cheap navigation.

The advances on deep learning have significantly improved the robustness of visual navigation, leading to the emergency of many navigation systems that do not rely on the given maps. Kim et al.

[kim2015deep] and Padhy et al. [padhy2018deep]use the classification neural network to predict the direction (e.g., right, left or straight) of moving drone. Furthermore, Loquercio et al.

[loquercio2018dronet] and Mirowski et al. [mirowski2018learning]use neural networks to compute the angle of flying and the risk of collision, which provide more detailed information to control the drone flyby. Note that the above methods learn the actions of drone from the human annotations. The latest works employ deep reinforcement learning

[tai2017virtual, zhu2017target, wang2019autonomous] to optimize the network, enabling more flexible solutions for autonomous drone navigation in novel environments.Our approach utilizes a rough 2.5D height map to increase the success rate of navigation in different complex scenes, which may have various spatial layouts of objects. Compared to the existing methods that conduct the mapping before navigation, we allow for real-time intelligent update of the height map during navigation, largely alleviating negative impacts of problematic mapping results.

### 2.2 Mapping Technique

The mapping technique is fundamental in the drone navigation. The techniques of 2D mapping have been widely used in the navigation task. Henriques et al. [henriques2018mapnet] and Savinov et al. [savinov2018semi] use 2D layout map to store useful information, which is learned by neural networks from the image data of 3D scenes. Chen et al. [chen2019behavioral] use the 2D topological map, which can be constructed using the coarse spatial layout of objects, to navigate the robot in an indoor scene. Different from the methods that consider the 2D map of an entire scene, Gupta et al. [gupta2017cognitive] unify the mapping and 2D path planning to rapidly adjust the navigation with respect to the surrounding local environment. Bansal et al. [bansal2019combining] utilize sparse waypoints to represent the map, which can be used to generate a smooth pathway to the target object or destination.

Compared to 2D mapping, 3D mapping provides much richer spatial information for the navigation system. Wang et al. [wang2017stereo] use visual odometry to capture the geometric relationship between 3D points, which is important to reconstruct the 3D scene. Engel et al. [engel2014lsd, engel2017direct] integrate the tracking of keypoints into the mapping process, harnessing temporal information to produce a more consistent mapping of the global environment. Futhermore, Huang et al. [huang2020clustervo, ClusterSLAM]

use a probabilistic Conditional Random Field model and a noise-aware motion affinity matrix to effectively track both moving and static objects. Wang et al.

[piecewiseWang] use plane as a geometric constrain to reconstruct the whole scene. Besides 3D points, depth information is also important to 3D mapping. During the mapping process, Tateno et al. [tateno2017cnn] and Ma et al. [ma2018sparse] use neural networks to estimate the depth map of a single image, for a faster construction of the 3D map. However, the fidelity of depth estimation is bounded by the scale of training data. To enhance, Kuznietsov et al. [kuznietsov2017semi], Godard et al. [godard2017unsupervised] and Bian et al. [bian2019unsupervised] train the depth estimation network in semi-supervised/unsupervised manner, where the consistence in-between images are learned.Nowadays, a vast of real-world 3D models and applications emerge, such as Google earth, and so there is abundant data of height maps available for the training of drone navigation system. Nonetheless, the accuracy and timeliness of such data is impossible to be guaranteed, thus hard to be directly used in practice. We deeply exploit the visual-geometric information fusion representation to effectively and dynamically update the given height map during navigation, yielding a significant increase of the success rate of the autonomous drone navigation in various novel scenes.

## 3 Overview

The core idea behind our approach is to fuse the visual and geometric information for the construction of height map. This is done by our *Visual-Geometric Fusion Network* (VGF-Net) to compute the visual-geometric representation with respect to the visual and geometric consistence between the 3D keypoints and object boundaries characterized in the height map. VGF-Net uses the fused representation to refine the keypoints and height map at each moment during drone navigation. Below, we outline the architecture of VGF-Net.

As illustrated in Figure 2, at the moment (), the network takes the RGB image and the associated height map as input. The image is fed to convolutional layers to compute the visual representation . The height map is also input to the convolutional layers for the geometric representation . The visual and geometric representations are fused to compute the residual update map that updates the height map to , providing more consistent information for the subsequent steps.

Next, we use the SLAM [mur2017orb] module to compute a sparse set of 3D keypoints , based on the images . We project these keypoints to the renewed height map . For the keypoint , we compute a set of distances , where denotes the distance from the keypoint to the nearest object boundary along the direction (see Figure 3(a)). Intuitively, the keypoint, which is extracted around the objects in the 3D scene, is also near to the boundaries of the corresponding objects in the height map. This relationship between the keypoint and the object can be represented by the visual and geometric information in the scene. Specifically, this is done by fusing the visual representation , geometric representation (learned from the renewed height map ) and the distances to form a novel *Visual-Geometric* (VG) representation for the keypoint . For all keypoints, we compute a set of VG representations .

Finally, we employ a *Directional Attention Model* (DAM), which takes input as the VG representations , to learn a residual update map to refine the height map . The DAM produces a new height map that respects the importance of each keypoint to the object boundaries in different directions (see Figure 3(b)). Meanwhile, we use DAM to compute a set of spatial offsets to update the keypoints, whose locations are imperfectly estimated by the SLAM. We use the height map for dynamic path planning [koenig2002d] at the moment, and meanwhile input the image and the height map to VGF-Net at this moment for next update. As drone flies, the network achieves more accurate information and works more robustly for simultaneous drone navigation and height mapping.

## 4 Method

We now introduce our VGF-Net in more detail. The network extracts visual and geometric information from the RGB images, the associated 2.5D height map and 3D keypoints. In what follows, we formally define the information fusion that produces the visual-geometric representation, which is then used for the refinement of the height map and keypoints.

### 4.1 Residual Update Strategy

The VGF-Net refines the height map and keypoints iteratively, as the drone flies to new places and captures new images. We divide this refinement process into separate moments. At the moment, we feed the RGB image and the height map into the VGF-Net, computing the global visual representation and the geometric representation as:

(1) |

where and denote the two sets of convolutional layers. Note that the value of each location on represents the height of object, and we set the height of ground to be 0. We concatenate the representations and for computing a residual update map , which is used to update the height map as:

(2) |

where

(3) |

Here, is a renewed height map, and denotes a set of convolutional layers. Compared to directly computing a new height map, the residual update strategy (as formulated by Eq. (2)) adaptively reuses the information of . More importantly, we learn the residual update map from the new content captured at the moment. It facilitates a more focused update on the height values of regions that are unexplored before the moment. The height map is fed to an extra set of convolutional layers to produce the representation , which will be used for the construction of the visual-geometric representation.

### 4.2 Visual-Geometric Representation

We conduct the visual-geometric information fusion to further refine the height map. To capture the geometric relationship between objects, we use a standard SLAM [mur2017orb] module to extract a sparse set of 3D keypoints from the sequence of images . Given the keypoint in the camera coordinate system, we project it to the 2.5D space as:

(4) |

Here, is decided by a pre-defined scale factor, which could be calculated at the initialization of the SLAM system or by GPS adjustment. and translate the origin of the 3D point set from the camera to the height map coordinate system. In the height map coordinate system, the drone is located at , where represent the width of the height map.

Note that the first two dimensions of indicate the location on the height map, and the third dimension indicates the corresponding height value. The set of keypoints are used for constructing the visual-geometric representations.

Next, for each keypoint , we compute its distances to the nearest objects in different directions. Here, we refer to objects as the regions that have larger height values than the ground (with height value of 0) in the height map . As illustrated in Figure 3(a), we compute the Euclidean distance along the direction, from to the first location, where the height value is larger than 0. We compute a set of distances for directions, then use (see Eq. (1)), and this distance set to form the VG representation as:

(5) |

where

(6) |

Here,

denotes the feature vectors located in

in the map . In Eq. (5), is represented as a weighted map with the resolution equal to the geometric representation ( by default), where plays as a weight of importance that is determined by the distance from the keypoint to the nearest object boundary along the direction. As formulated in Eq. (5) and Eq. (6), longer distance decays the importance. Besides, we use independent set of fully connected layers (i.e., and in Eq. (5)) to learn important information from and . It allows the content, which is far from , to have the opportunity to make an impact on . We construct the VG representation for each keypoint in , while each VG representation captures the visual and geometric information around the corresponding keypoint. Based on the the VG representations, we propagate the information of the keypoints to each location on the height map, where the corresponding height value is refined. We also learn temporal information from the VG representations to refine the spatial locations of keypoints at the moment, as detailed below.### 4.3 Directional Attention Model

We use DAM to propagate the visual and geometric information, from each keypoint to each location on the height map, along different directions. More formally, for a location on the height map , we conduct the information propagation that yields a new representation as:

(7) |

Along the second dimension of the representation

, we perform max pooling to yield

as:(8) |

As illustrated in Eq. (7), summarizes the influence of all keypoints along direction. We perform max pooling on the set (see Eq. (8)), attending to the most information along a direction to form the representation (see Figure 3(b)). To further refine the height map, we use the representation to compute another residual update map , which is added to the height map to form a new height map as:

(9) |

where

(10) |

Again, denotes a set of convolutional layers. We make use of the new height map for the path planning at the moment.

We refine not only the 2.5D height map but also the 3D keypoints at the moment. Assume that we use SLAM to produce a new set of keypoints . We remark that the keypoint sets at the and moments are not necessary the same. To refine the new keypoint , we use DAM to compute the representation as:

(11) |

In this way, DAM distills the information of keypoints at the moment, which is propagated to the next moment. Again, we use max pooling to form the spatial offset for updating keypoint as:

(12) |

We take the average of the updated keypoints and the estimated keypoints in place of the original one to construct the VG representation at the moment.

### 4.4 Training Details

We use the loss function for training the VGF-Net as:

(13) |

where is the ground-truth height map. Actually, we select 8 pairs of RGB image and height map () to construct each mini-batch for the standard SGD solver. We set the height and width of each RGB image () and the height map (). The overall training samples is nearly 24000 images randomly sampled in 3 scenes, while we test the model on the 24000 samples sampled on the other 3 scenes. Details about the dataset could be found in Sec. 5

. We train the network for 30 epochs, and use the final snapshot of network parameters for testing. The learning rate is set to 0.001 at the first 15 epochs, and decayed to 0.0001 for a more stable optimization.

By default, the backbone of and is a ResNet-18, while the remained and is two stacked

convolutional layer with max-pooling and batch normalization.

Note that it is our contribution to learn spatial offsets of 3D keypoints, without explicitly using any ground-truth data. This is done by modeling the computation of spatial offsets as a differentiable function with respect to the VG representation. In this way, we enable the end-to-end learning of spatial offsets, where the related network parameters can be optimized by the back-propagated gradients. It significantly reduces the effort for data annotation, while allows the network training to be flexibly driven by data.

When constructing the VG representation, we set the number of directions for each keypoint, and the number of keypoints at each moment. We remark that these hyper-parameters are chosen based on the validation results.

## 5 Results and Discussion

scene | area () | objects (#) | model size () | texture images (#) | texture size () |
---|---|---|---|---|---|

New York | 7.4 | 744 | 86.4 | 762 | 122 |

Chicago | 24 | 1629 | 146 | 2277 | 227 |

San Francisco | 55 | 2801 | 225 | 2865 | 322 |

Las Vegas | 20 | 1408 | 108 | 1756 | 190 |

Shenzhen | 3 | 1126 | 50.3 | 199 | 72.5 |

Suzhou | 7 | 168 | 191 | 395 | 23.7 |

Shanghai | 37 | 6850 | 308 | 2285 | 220 |

method | average error () | accuracy w.r.t. error (%) | ||||

San Francisco | Shenzhen | Chicago | San Francisco | Shenzhen | Chicago | |

w/o fusion | 4.57 | 4.57 | 4.49 | 68.95% | 68.02% | 70.05% |

w/ fusion | 2.37 | 2.93 | 3.41 | 85.09% | 83.63% | 78.44% |

w/ fusion and memory | 2.81 | 3.44 | 4.02 | 79.86% | 79.20% | 72.86% |

w/ fusion, memory and exchange | 2.35 | 3.04 | 3.80 | 80.54% | 82.36% | 74.73% |

full strategy | 1.98 | 2.72 | 3.10 | 85.71% | 86.13% | 80.46% |

method | accuracy w.r.t. error (%) | accuracy w.r.t. error (%) | ||||

San Francisco | Shenzhen | Chicago | San Francisco | Shenzhen | Chicago | |

w/o fusion | 75.02% | 74.08% | 76.86% | 83.96% | 83.96% | 85.71% |

w/ fusion | 89.20% | 87.39% | 84.12% | 93.87% | 92.25% | 91.18% |

w/ fusion and memory | 86.35% | 84.56% | 80.36% | 93.00% | 91.31% | 89.51% |

w/ fusion, memory and exchange | 86.13% | 86.43% | 81.41% | 93.33% | 91.85% | 89.94% |

full strategy | 89.22% | 88.90% | 85.30% | 94.10% | 92.56% | 91.67% |

### 5.1 Description of Experimental Dataset

To promote the related research on drone navigation, we newly collect a 3D urban navigation dataset. This dataset contains 7 models of different city scenes (see Figure 4).

Note that New York, Chicago, San Francisco, and Las Vegas are Google Earth models we download, which are similar to the real-world scenes with respect to the appearance but most objects inside are only buildings. We have also Shenzhen, Suzhou and Shanghai that are manually built based on the map by professional modelers, which contain rich 3D objects (e.g., buildings, trees, street lights and road signs, etc.) and other stuff (e.g., ground, sky and sea). There are various spatial configurations of objects, building styles and weather conditions in these 3D scenes. Thus, we provide challenging data for evaluating the navigation system. The models are input to the render for producing sequences of RGB images. All RGB images and the associated 2.5D height maps are used to form a training set (i.e., New York, Las Vegas and Suzhou) and a testing set (i.e., San Francisco, Shenzhen, and Chicago). We provides more detailed statistics of the dataset in Table 1.

To train our VGF-Net, which takes as input a rough imperfect height map and outputs an accurate height map, we use 5 types of manipulations (i.e., translation, height increase/decrease, size dilation/contraction, creation and deletion) to disturb the object boundaries in the ground-truth height map. One time of the disturbance increases or decreases height values by 10 in certain map locations. See Figure 5 for an illustration of our manipulations.

### 5.2 Different Strategies of Information Fusion

The residual update, VG representation and DAM are critical components of VFG-Net, defining the strategy of information fusion. Below, we conduct an internal study by removing these components, and examine the effect on the accuracy of height mapping (see Table 2).

First, we report the performance using visual information only for height mapping, disabling any visual and geometric fusion. Here, the visual information is learned from RGB images (see the entries “w/o fusion” in Table 2). But visual information is insufficient for reconstructing height maps, which requires the modeling of geometric relationship between objects, yielding lower performances compared to other methods using geometric information.

Next, we examine the efficiency of residual update strategy. At each moment, the residual update allows VGF-Net to reuse the mapping result produced earlier. This strategy, where the useful visual and geometric contents can be effectively distilled and memorized at all moments, improves the reliability of height mapping. Thus, by removing the residual update (see the entries “w/ fusion” in Table 2) from VGF-Net (see the entries “full strategy”), we degrade the performance of height mapping.

We further study the effect of VG representation on the performance. The VG representation can be regarded as an information linkage. It contains fused visual and geometric information, which is exchanged among objects. Without the VG representation, we use independent sets of convolutional layers to extract the visual and geometric representations from the image and height map, respectively. The representations are simply concatenated for computing the residual update map (see the entries “w/ fusion and memory” in Table 2). This manner successfully disconnects the communication between objects and leads to performance drops on almost all scenes, compared to our full strategy of information fusion.

We find that the performance of using memory of height values lags behind the second method without using memory (see the entries “w/ fusion” in Table 2). We explain that the information fusion with memory easily accumulates errors in the height map over time. Thus, it is critical to compute the VG representation based on the memorized information, enabling the information exchange between objects (see the entries “w/ fusion, memory and exchange”). Such exchange process provides richer object relationship to effectively address the error accumulation problem, significantly assisting height mapping at each moment.

Finally, we investigate the importance of DAM (see the entries “w/ fusion, memory and exchange” in Table 2). We solely remove DAM from the full model, by directly using VG representations to compute the residual update map and spatial offsets for refining the height map and keypoints. Compared to this fusion strategy, our full strategy with DAM provides a more effective way to adjust the impact of each keypoint along different directions. Therefore, our method achieves the best results on all testing scenes.

### 5.3 Sensitivity to the Quality of Height Map

outdoor test | w/ depth | w/o depth | ||||

ground-truth depth | estimated depth [bian2019unsupervised] | VGF-Net | ||||

San Francisco | 100% | 27% | 85% | |||

Shenzhen | 100% | 34% | 83% | |||

Chicago | 100% | 19% | 82% | |||

indoor test | w/ depth | w/o depth | ||||

LSTM [gupta2017cognitive] | CMP [gupta2017cognitive] | VGF-Net | LSTM [gupta2017cognitive] | CMP [gupta2017cognitive] | VGF-Net | |

S3DIS | 71.8% | 78.3% | 92% | 53% | 62.5% | 76% |

As demonstrated in the above experiment, it is important to the iterative information fusion for achieving a more global understanding of 3D scene to perfect the height map estimation. During the iterative procedure, the problematic height values may be memorized to make a negative impact on the production of height map at future moment. In this experiment, we investigate the sensitivity of different approaches to the quality of height maps, by controlling the percentage of height values that are dissimilar to the ground-truth height maps. Again, we produce dissimilar height maps by using disturbance manipulations to change the object boundaries.

At each moment, the disturbed height map is input to the trained model to compute the new height map, which is compared to the ground-truth height map for calculating the average error. In Figure 6, we compare the average errors produced by 4 different information fusion strategies (i.e., see the entries “w/ fusion”, “w/ fusion and memory”, “w/ fusion, memory and exchange” and “full strategies” in Table 2), which learn geometric information from height maps. As we can see, heavier disturbances generally lead to the degradation of all strategies.

The strategy “w/ fusion and memory” performs the worst among all approaches, showing very high sensitivity to the quality of height maps. This result further evidences our finding in Sec. 5.2, where we have shown the unreliability of the method with memory of height information but without information exchange. Compared to other methods, our full strategy yields better results. Especially, given a very high percentage (80%) of incorrect height values, our full strategy outperforms other methods by remarkable margins. These results clearly demonstrate the robustness of our strategy.

### 5.4 Comparison on the Navigation Task

The quality of 2.5D height maps, which are estimated by the height mapping, largely determines the accuracy of drone navigation. In this experiment, we compare our VGF-Net to different mapping approaches. All methods are divided into two groups. In the first group, the approaches apply depth information for height mapping. Note that the depth information can be achieved by scanner [gupta2017cognitive], or estimated by deep network based on the RGB images [bian2019unsupervised]. The second group consists of approaches that only use RGB images to reconstruct the height map. In addition to an initial height map that can be easily obtained from various resources, our VGF-Net only requires image inputs, but can also accept depth information if available without changing any scheme architecture. We set the height of flight to be 1030 for drone, evaluating the success rate of 3D navigation on our outdoor dataset. Overheight (e.g., 100) always leads to successful navigation, making the evaluation meaningless. On the indoor dataset [armeni20163d] (see also Figure 7 and Figure 8) , we report the success rate of 2D drone navigation, by fixing the height of flight to 0.5. All results can be found in Table 3.

Obviously, using accurate depth information can yield a perfect success rate of navigation (see the entry “ground-truth depth”). Here, the depth data is directly computed from the synthesized 3D urban scenes, without involving any noise. However, due to the limitation of hardware device, it is difficult for the scanner to really capture the accurate depth data of outdoor scenes. A simple alternative is to use deep network to estimate the depth based on the RGB image (see the entry “estimated depth”). Depth estimation often produces erroneous depth values for the height mapping, even with the most advanced method [bian2019unsupervised], thus severely misleading the navigation process. Similar to depth information, the sparse 3D keypoints used in our approach also provide valuable geometry information of objects. More importantly, our VGF-Net uses visual cues to assist the learning of geometric representations. Therefore, our approach without using depth produces better results than that of using depth estimated by state-of-the-art techniques. We have shown an example of trajectory for 3D drone navigation in Figure 1. We also show examples of height mapping in Figure 9, where the height map with redundant boundary (see the first two rows of Figure 9) or missing boundary (see the last two rows of Figure 9) is input to the VGF-Net. Even given the input height maps with much noise, our network still precisely recovers the height information.

Depth data of indoor scenes (see Figure 7) can be more easily achieved. With the available depth information, we can trivially input the RGB image along with the associated depth to the VGF-Net, producing the height map. We compare VGF-Net to the recent approach [gupta2017cognitive] (see the entries “LSTM ” and “CMP”) that produces state-of-the-art indoor navigation accuracies. Our method achieves a better result under the same condition of training and testing. Without depth, our approach still leads to the best result among all image based methods. It demonstrates the generality and ability of our approach, in terms of stably learning useful information from different data sources. In Figure 8, we show more navigation trajectories planned by our approach in an indoor testing scene.

## 6 Conclusions and Future Work

The latest progress on drone navigation is largely driven by the active sensing and selecting the useful visual and geometric information of surrounding 3D scenes. In this paper, we have presented VGF-Net, where we fuse visual and geometric information for simultaneous drone navigation and height mapping. Our network distills the fused information, which is learned from the RGB image sequences and an initial rough height map, constructing a novel VG representation to better capture object/scene relation information. Based on the VG representation, we propose DAM to establish information exchange among objects and select essential object relationship in a data-driven fashion. By using residual update strategy, DAM progressively refines the object boundaries in the 2.5D height map and the extracted 3D keypoints, showing its generality to various complicate outdoor/indoor scenes. The mapping module runs at nearly 0.2sec on a mobile GPU, which could be further optimized by compression and pruning in an embedded system.

VGF-Net eventually outputs the residual update map and spatial offsets, which are used for explicitly updating the geometric information of objects (i.e., the 2.5D height map and 3D keypoints). It should be noted that we currently use convolutional layers to learn implicit representation from the fused information, and update the visual representation. The visual content of the sequence of RGB image shows complex patterns, which together form the global object/scene relationship. However, these patterns may be neglected by the implicit representation during the learning process. Thus, in the near future, we would like to investigate a more controllable way to update the visual representation. Additionally, complex occlusion relations in the real scenarios often lead to inaccurate height mappings in the occluded areas. In the future, we would like to further utilize the uncertainty map of the environment, together with the multi-view information to improve both the accuracy and the efficiency of the mapping process. Moreover, since the geometric modeling (triangulation of sparse keypoints) is commonly involved in the optimization pipeline of SLAM, effectively collaborating the 3D keypoints detection and the height mapping would be quite interesting to explore.

## Acknowledgment

We would like to thank the anonymous reviewers for their constructive comments. This work was supported in parts by NSFC Key Project (U2001206), Guangdong Outstanding Talent Program (2019JC05X328), Guangdong Science and Technology Program (2020A0505100064, 2018A030310441, 2015A030312015), DEGP Key Project (2018KZDXM058), Shenzhen Science and Technology Program (RCJC20200714114435012), and Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen University).

Comments

There are no comments yet.