Real-time SLAM system with deep features
In this paper, we present a deep learning-based network, GCNv2, for generation of keypoints and descriptors. GCNv2 is built on our previous method, GCN, a network trained for 3D projective geometry. GCNv2 is designed with a binary descriptor vector as the ORB feature so that it can easily replace ORB in systems such as ORB-SLAM. GCNv2 significantly improves the computational efficiency over GCN that was only able to run on desktop hardware. We show how a modified version of ORB-SLAM using GCNv2 features runs on a Jetson TX2, an embdded low-power platform. Experimental results show that GCNv2 retains almost the same accuracy as GCN and that it is robust enough to use for control of a flying drone.READ FULL TEXT VIEW PDF
As the foundation of driverless vehicle and intelligent robots, Simultan...
Simultaneous Localization and Mapping (SLAM) is a critical task for
In this paper, we propose a novel indirect monocular SLAM algorithm call...
In this paper, we develop a robust efficient visual SLAM system that uti...
Direct methods for SLAM have shown exceptional performance on odometry t...
In this paper, we introduce OpenVSLAM, a visual SLAM framework with high...
The diversity of SLAM benchmarks affords extensive testing of SLAM algor...
Real-time SLAM system with deep features
The ability estimate position is key to most, if not all, robotics application involving mobility. In this paper we focus on the problem of visual odometry (VO), i.e. relative motion estimation based on visual information. This is the corner stone in vision-based SLAM systems, which will be the setting we will demonstrate our work in. As in our previous work,
, we estimate the motion using only an RGB-D sensor, and our target platform is a drone operating in an indoor environment. The RGB-D sensor makes scale directly observable without the need for visual-inertial integration or the computational cost of inferring depth using a neural network as in[2, 3, 4]. This increases robustness, which is a key property when used on a drone and in particular for indoor environments where the margin for error is small and the there is typically less obvious textures than in outdoor environments. Our method is designed to be applicable to any system simply by adding an RGB-D sensor, without the need for complicated calibration and synchronisation routines with additional sensors, such as cameras or IMUs. Fusion can instead take place at a lower rate and with less need for precise timing, which, for example, makes integration with the flight control system of a drone simpler.
Like many other areas of research, there is a trend in SLAM to investigate deep learning-based methods. In  a keypoint detector and descriptor called SuperPoint is presented. Experimental results show that this CNN-based method has more distinctive descriptors than classical descriptors such as SIFT, and a detector on par with them. However, evaluation on homography estimation in  shows that it is only working on par with other keypoint extractors, classical or learning-based. In our previous work , presented before SuperPoint, we introduced the Geometric Correspondence Network, GCN, specifically tailored for producing keypoints for camera motion estimation, achieving better accuracy than classical methods. However, due to the computation and structure limitation of GCN, it is difficult to achieve real-time performance in a fully-operational system, e.g. on board a drone. Both keypoint extraction and matching are computationally too expensive. Furthermore, due to the multi-frame matching setup, integrating GCN into existing SLAM systems becomes non-trivial.
In this paper we introduce GCNv2, based on the conclusions from , to improve computational efficiency while still maintaining the high precision of GCN. Our contributions are:
GCNv2 maintains the accuracy of GCN, providing significant improvements in motion estimation in comparison to related deep learning-based feature extraction methods as well as classical methods.
The inference of GCNv2 can run on embedded low-power hardware, such as a Jetson TX2, compared to GCN which requires a desktop GPU for real-time inference.
We demonstrate the effectiveness and robustness of our work by using GCN-SLAM111Built on ORB-SLAM2 with ORB substituted for GCNv2 on a real drone for control, and show that it handles situations where ORB-SLAM2 fails.
In this section we cover related work in two areas. First VO and SLAM methods are covered and then we focus specifically on work on deep learning-based methods for image correspondence.
In direct methods for VO and SLAM, motion is estimated by aligning frames based directly on the pixel intensities, with  being an early example. DVO (Direct Visual Odometry), presented in , adds a pose graph to reduce the error. DSO  is a direct and sparse method that adds joint optimisation of all model parameters. An alternative to the frame-to-frame matching is to match each new frame to a volumetric representation as in KinectFusion , Kintinous  and ElasticFusion .
In indirect methods, the first step in a typical pipeline is to extract keypoints, which are then matched to previous frames to estimate the motion. The matching is based on the keypoint descriptors and geometric constraints. The state-of-the-art in this category is still defined by ORB-SLAM2 [14, 6]. The ORB descriptor is a binary vector allowing high-performance matching.
Somewhere between direct and indirect methods we find the semi-direct approaches. SVO2  is a sparse method in this category, and can run at hundreds of Hertz. There are also semi-dense methods, in which category LSD-SLAM  was one of the first. RGBDTAM  combines both semi-dense photometric and dense geometric errors for pose estimation.
There are a number of recent deep learning-based mapping systems like [17, 18]. The focus in these methods is deep learning-based single view depth estimation to reduce the scale drift inherent in monocular systems. CNN-SLAM  feeds the depth into LSD-SLAM. In DVSO , depth is predicted in a similar way to , using a virtual stereo view. CodeSLAM  learns an optimizable representation from conditioned auto-encoding for 3D reconstruction. In S2D , we build on DSO 
and exploit both depth and normals predicted by a jointly optimised CNN. Some work on end-to-end training for motion estimation also exist. Image reconstruction loss is used for unsupervised learning in[4, 22]. However, geometry-based optimization methods still outperform end-to-end systems as shown in .
There is a an abundance of recent works that deploy variants of metric learning for training deep features for finding image correspondences[23, 24, 25, 26, 27, 28, 29, 30, 5]. Works in [31, 32] focus on improving learning-based detection with better invariances. Aimed at a different aspect, [33, 34, 35] use synthetic samples generated in a self-supervised manner to improve general feature matching.
Among the aforementioned methods, LIFT  in particular uses a patch-based method to perform both keypoint detection and descriptor extraction. SuperPoint  predicts the keypoints and descriptors using a single network together with the self-supervised strategy in . Notably,  shows that [5, 29, 30] work on par with classical methods like SIFT for motion estimation.
In GCN , we show that by learning keypoints and descriptors specifically targeting motion estimation, performance is improved – contrary to what is reported for other more general deep learning-based keypoint extractor systems [5, 30]. In this paper we introduce an extension to GCN, GCNv2. We demonstrate the applicability of these keypoints for SLAM and build on ORB-SLAM2 as it offers a comprehensive multi-threaded state-of-the-art indirect SLAM system with support for monocular as well as RGB-D cameras. ORB-SLAM2 complements the tracking front-end with a back-end that does both pose graph optimisation using g2o  and loop closure detection using a binary bag of words representation . To simplify this integration, we design the GCNv2 descriptor to have the same format as that of ORB.
In this section, we present the design of GCNv2, aimed at making it suitable for real-time SLAM applications running on embedded hardware rather than a powerful desktop computer. We first introduce the modifications to the network structure. Then, we present the training scheme for the binarized feature descriptor and keypoint detector.
The original GCN structure, proposed in , consists of two major parts: an FCN  structure with a ResNet-50 backbone and a bidirectional convolutional network. The FCN is adopted for dense feature extraction and the bidirectional network is used for temporal keypoint prediction. Although impressive tracking performance has been achieved using GCN compared with existing methods, GCN has practical limitations when it comes to the use in a real-time SLAM system with limited hardware. The network architecture requires relatively powerful computational hardware which renders it unable to run in real-time on board e.g. the Jetson TX2 used on our drone. Furthermore, the bidirectional structure requires matching between two or more frames at the same time. This significantly increases computational complexity for a window-based SLAM method, since keyframes then are updated dynamically based on the current camera position.
Inspired by SuperPoint , which uses a simple structure to perform detection using a single frame, we deploy a network with even fewer parameters and working on a lower scale than SuperPoint. Intuitively, the network performs an individual prediction for each grid cell of size pixels in the original image. In GCNv2 all the pooling layers are replaced by convolutions with kernel size
and padding. As in SuperPoint, the network takes images as input. This is also the image size we used later for SLAM. Further details on the GCNv2 network specifics can be found in our publicly available source code222https://github.com/jiexiong2016/GCNv2_SLAM.
GCN-SLAM with GCNv2 runs at 20 Hz on Jetson TX2 and runs at around 80 Hz on a laptop with Intel i7-7700HQ and mobile version NVIDIA 1070. To achieve even higher frame rates we introduce a smaller version of GCNv2, called GCNv2-tiny, where we reduce the number of feature maps by half from the second layer and onward. GCN-SLAM with GCNv2-tiny runs at 40 Hz on a TX2 and is therefore well-suited for deployment on a drone.
Binarized Descriptor We trained the features of GCNv2 to be binary for accelerating the matching procedure and to match those of ORB. To binarize the features, we add a binary activation layer on top the final output. It is essentially a hard sign function and is therefore not differentiable. The challenge is how to back-propagate the loss properly through this layer of the network. We used the method proposed in . The binary activation layer can be written as follows:
where is the 2D coordinates in the image and is the feature vector at a given location. is the loss for metric learning. is the indicator function. We found that it is more efficient to train the network with the above method rather than forcing the network to directly predict a binary output by minimizing quantification loss as in . One possible reason is that forcing the value to be clustered around conflicts with the metric learning that follows, which uses distance as margin making the training unstable.
The number of feature maps is set to to make the descriptor have the same bit width as ORB features so that the descriptor can be directly incorporated into existing ORB-based visual tracking systems.
Nested Metric Learning The pixel-wise metric learning is used for training the descriptor in a nearest-neighbour manner. The triplet loss for binarized features is as follows:
where is the distance margin for the truncation. is equivalent to the squared Hamming distance for the 32-byte descriptor. We use the squared distance since we found it leads to faster and better convergence for training. is a matching pair obtained using the ground truth camera poses from the training data as follows:
where is the rotation matrix and is the translation vector. is a non-matching pair retrieved by negative sample mining. The mining procedure is described in algorithm 1. The exhaustive search will further penalize the already matched features with the relaxed criteria described in . The relaxed criteria is used to increase the tolerance to potentially noisy data.
The training loss for keypoint detection is the same as in the original GCN. It is computed using two consecutive frames as follows:
A notable difference from SuperPoint is that GCNv2 specifically targets motion estimation. SuperPoint tries to accomplish SIFT-like corner detection and its performance gain can chiefly be attributed to its superior descriptor. However, a CNN is capable of generating more representative features with a larger receptive field than classical methods. We therefore generate the ground truth by detecting Shi-Tomasi corners in a grid and warp them to the next frame using eq. 3. This leads to better distribution of keypoints and the objective function directly reflects the ability to track the keypoints based on texture.
The triplet loss in eq. 2 and cross entropy in eq. 4 are weighted by during training to provide a coarse normalization of the two terms. The learning rate for the adaptive gradient descent method, ADAM , used for training is started from
and halved every 40 epoch for a total of 100 training epochs. The weights of GCNv2 are randomly initialized. We mapped the squared Hamming distance to theunit sphere to perform the fast nearest neighbour matching as in . The margin for the triplet loss is set to . The weights of the weighted cross entropy is set to .
One of the most important design choices for a keypoint-based SLAM system is the choice of keypoint extractor. The keypoints are often re-used at multiple stages in such systems. ORB features , the namesake of ORB-SLAM2, are a particularly well-suited candidate as ORBs are invariant to rotation and scale, cheap to compute compared to other keypoint detectors with equivalent properties such SIFT or SURF, and have a binary feature vector to cater for fast matching.
As previously shown , GCN used in a naive motion estimation pipeline performs better than or on par with ORB-SLAM2 . Notably, this is without higher-order SLAM functionality such as pose graph optimization, global bundle adjustment, or loop detection. Incorporating GCN into a system with such functionality would therefore be likely to yield better results. However, as mentioned, GCN is prohibitively expensive for real-time use on embedded hardware which we target in this work. In what follows we show how we modify ORB-SLAM2 to incorporate GCNv2, in a system we call GCN-SLAM.
ORB-SLAM2’s motion estimation is based on frame-to-frame keypoint tracking and feature-based bundle adjustment. We will briefly describe the detection and description of these features. ORB-SLAM2 employs a scale pyramid where the input image is iteratively scaled down to enable multi-scale feature detection by running single-scale algorithms on the multiple rescaled images. For each scale level, the FAST corner detector is applied in a grid. If no detections are found in a cell, FAST is run again with a decreased threshold. After all detections have been gathered from all cells at a given level in the scale pyramid, a space partitioning algorithm is used to cull the keypoints first by their image coordinates, then by detection score. Finally, once typically 1000 keypoints have been selected in total, the viewing angle of each keypoint is computed, then each pyramid scale level is filtered with Gaussian blur, and the 256-bit ORB descriptor for each keypoint at each level is computed from the blurred image.
Our method computes both keypoint locations and descriptors simultaneously in a single forward-pass of the network, and as stated before, its end result is designed to be a drop-in replacement for the ORB feature extractor outlined above. In GCNv2, we input a single grey image frame to the network which outputs two matrices: a keypoint mask, and a feature descriptor matrix. The keypoint mask is thresholded to obtain a set of keypoint locations, their confidences, and their corresponding 256-bit feature descriptors. As in , we apply non-maximum suppression with a grid size of . As it is not possible to know the orientation of the detected features, we set the angle to zero. The two keypoint methods are illustrated in fig. 2.
Once keypoints and their respective descriptors are found, ORB-SLAM2 relies primarily on two methods for frame-to-frame tracking: first, by assuming constant velocity and projecting the previous frame’s keypoints into the current frame, and if that fails, by matching the keypoints of the current frame to the last-created keyframe using bag-of-words similarity. We have disabled the former so as to use only use the latter keypoint-based reference frame tracking. We have also replaced the matching algorithm with a standard nearest-neighbor search in our experiments. These modifications are made to examine the performance of our keypoint extraction method, rather than that of ORB-SLAM2’s other tracking heuristics.
Finally, we have left ORB-SLAM2’s loop closure and pose graph optimization intact, apart from having regenerated the bag-of-words vocabulary to suit GCNv2 feature descriptors by computing them on the training dataset presented in section V-A.
|Dataset (200 Frames)||GCN||ORB||SIFT||SURF||SuperPoint||GCNv2||GCNv2-tiny||GCNv2-large|
In this section, we present experimental results to justify our conclusions regarding the performance of our keypoint extraction method, and its embodiment in the GCN-SLAM system. We first introduce our training dataset, then four datasets on which we examined our method’s performance and compare to some related methods, and finally we outline the quantitative and qualitative conclusions of these results.
Note that our aim in this section is not to show that GCN-SLAM is better that ORB-SLAM2 but to show that GCNv2 is: i) better suited for accurate motion estimation, ii) computationally efficient, and iii) providing robustness for a SLAM system.
In the results below, evaluations on datasets were performed on a laptop with an Intel i7-7700HQ and a mobile version of NVIDIA 1070. For real-world experiments we used an NVIDIA Jetson TX2 embedded computer for processing and an Intel RealSense D435 RGB-D camera sensor on our drone (see fig. 1.)
The original GCN was trained using the TUM dataset from sensor fr2. It provides accurate pose through a motion capturing system. In GCNv2, we trained the network using a subset of the SUN-3D  dataset we created in our recent work . SUN-3D contains millions of real-world recorded RGB-D images in various typical indoor environments. A total frames were extracted by roughly one frame per second from the videos. It is very large and can potentially produce a more generalized network. However, the ground truth poses provided by SUN-3D are estimated by visual tracking with loop closure and so are relatively accurate globally, but have misalignments at frame level. To account for this local error, we extract SIFT features and use the provided poses as initial guesses for bundle adjustment to update the relative pose of each frame pair. In this sense, the training of GCNv2 is using self-annotated data with only vision information.
For comparison with the original GCN, we select the same sequences of the TUM datasets as in  and evaluating tracking performance with an open and a closed loop system. We use the Absolute trajectory error (ATE) as the metric.
Since we trained GCNv2 on a different dataset than the original GCN , we also show results using the original recurrent structure for comparison. We have therefore also created GCNv2-large, with ResNet-18 as the backbone and deconvolutional up-sampling for the feature maps. The bidirectional feature detector is moved to the lowest scale as the other two versions of GCNv2.
Frame-to-frame tracking results are shown in table I. The first columns, before SuperPoint, are from  where 640x480 images were used and GCN was trained on TUM fr2 data only. All versions of GCNv2 use the same image resolution as SuperPoint, i.e. 320x240. The results are consistent with the results reported in , SuperPoint performs on par with classical method like SIFT. GCNv2 has a performance close to GCN, and like GCN, significantly better than both SuperPoint and classical keypoints. GCNv2 performance is on par with GCN, and even slightly better in two cases – likely due to using a much larger dataset for training. The exceptions are fr1_floor and fr1_360. These sequences require fine details, and the detection and descriptor extraction in GCNv2 is performed with a lower scale feature map for efficiency, though GCNv2-large handles one of these sequences. The smaller version of GCNv2, GCNv2-tiny, is only slightly less accurate than GCNv2.
In table II, we compare the closed loop performance of GCN-SLAM with our previous work, as well as ORB-SLAM2, Elastic Fusion, and RGBDTAM. GCN-SLAM successfully tracks the position in all sequences with an error similar to that of GCN, whereas ORB-SLAM2 fails on two sequences. GCNv2 has significantly reduced drift error compared to ORB-SLAM2 in the fast rotations of fr1_360. It is also noteworthy that for this particular sequence, the original GCN does significantly better than both ORB-SLAM2 and GCN-SLAM. ORB-SLAM2 is tracking well in all other sequences, and the errors of both GCN-SLAM and ORB-SLAM2 are small.
To further verify the robustness of using GCNv2 in a real-world SLAM system, we show results on datasets collected in our environment under different conditions: a) going up a corridor, turning 180 degrees and walking back with a handheld camera, b) walking in a circle on an outdoor parking lot with a handheld sensor in daylight, c) flying in an alcove with windows and turning 180 degrees, and d) flying in a kitchen and turning 360 degrees while using GCN-SLAM for positioning.
Since they are without ground truth, the results are only qualitative. These datasets were chosen to show that our method handles difficult scenarios, is robust, and can be used for positioning of a real drone. Figure 3 shows the estimated trajectory of GCN-SLAM using ORB versus GCNv2 as keypoints. Note that both features are evaluated in exactly the same tracking pipeline for fair comparison, i.e. GCNv2 or ORB features is the only differences. Refer to the source-code for further details. using GCN-SLAM as a basis for drone control improves performance as can be seen by comparing figs. 2(d) and 2(c). In fig. 2(c), the position is estimated using only an optical flow sensor whereas fig. 2(d) uses GCN-SLAM as a source of position, and it is clear that the drone is able to hold its position better as there is significantly less jitter in this trajectory. In all four datasets, tracking is maintained with GCNv2, but lost with ORB. We used a remote control to send setpoints to the flight control unit on the drone for control, using the built-in position holding mode.
In fig. 4 we compare the performance of our keypoint extractor to the original ORB keypoint extractor. We plot the number of inliers during tracking of the local map for our adapted SLAM system, first with ORB keypoints, and then with GCNv2 keypoints. As the figure illustrates, while there are many more ORB features, our method has a higher percentage of inliers. In addition, as shown in fig. 1, GCNv2 results in better distributed features compared with ORB.
Figure 5 shows the mesh reconstruction using GCN-SLAM output poses from two additional sequences. The left sequence was from an office and the right was acquired walking between two floors using stairs. TSDF volume integration from Open3D333http://www.open3d.org was used to create the mesh. To show the accuracy of our method, the loop closure detection of GCN-SLAM is disabled.
In our previous work , we showed that our method, GCN, achieves better performance in visual tracking compared with existing deep learning and classical methods. However, it cannot be directly deployed into a real-time SLAM system in an efficient way due to its deep recurrent structure. In this paper, we addressed these issues by proposing a smaller, more efficient version of GCN, called GCNv2, that is readily adaptable to existing SLAM systems. We showed that GCNv2 can be effectively used in a modern feature-based SLAM system to achieve state-of-the-art tracking performance. The robustness and performance of the method was verified by incorporating GCNv2 into GCN-SLAM and using it on-board for positioning on our drone.
Limitations GCNv2 is trained mainly for projective geometry and not generic feature matching. As always with learning-based methods generalization is an important factor. GCNv2 works relatively well for outdoor scenes, as demonstrated in the experiments (Cf. fig. 2(b)). However, since no outdoor data was used for training, further improvements can likely be made. Our target here is an indoor setting and we did not investigate this further.
In this paper, we mainly improved the efficiency of GCN and achieve stable tracking perform on our platform, a drone with NVIDIA Jetson TX2. However, since the original recurrent structure is removed, there is trade-off between the accuracy and running speed. In the future, we would like to further investigate to use the self-supervised learning to improve our system.
S. Wang, R. Clark, H. Wen, and N. Trigoni, “DeepVO: Towards end-to-end visual odometry with deep recurrent convolutional neural networks,” inIEEE Intl. Conf. on Robotics and Automation (ICRA). IEEE, 2017.
Journal of Machine Learning Research, vol. 17, no. 1-32, p. 2, 2016.
S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, and J. Kautz, “Improving landmark localization with semi-supervised learning,” inIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1546–1555.