For many years, Simultaneous Localization and Mapping (SLAM) has been the subject of technical research [survey]. But with vast improvements in computer processing speed and the availability of low-cost sensors, SLAM is now used for practical applications in a growing number of fields [survey2]. A basic SLAM system consists of map generation and concurrent robot localization.
One of the most crucial tasks for a SLAM system is to detect when the robot reaches an already seen position in order to correct the map it creates, or loop closure. Loop closure is a key element of SLAM, and without it, local errors quickly build into global errors over time [relocal]. Loop closure is composed of place recognition, determining whether the current frame has already been seen, and loop correction [placerecog].
Place recognition is characterized as being able to recognize the same place despite significant changes in appearance and viewpoint. Current state-of-the-art SLAM systems, such as ORB-SLAM3 [orbslam3], rely on using Lidar or depth information in order to perform geometric validation. However, RGBD and stereo cameras are not available in every situation due to their high cost; they are several times more expensive than their monocular counterparts [gated]. The ability to only rely on monocular input makes SLAM systems even more versatile. For example, in rural areas, small monocular cameras are ideal due to their low power consumption and availability.
There have been many proposed methods for feature detection, but these have been split mainly into two categories. Methods like SuperPoint [superpoint], SIFT [sift], and ORB [orb] are good at detecting geometric features, but fail in organic environments [survey3]. On the other hand, methods like SalientDSO [sdso], Salient Point Detection [spd], and Feature Based SLAM [fbs], determine points of human interest, but only target key objects. In order to create loop closures that are reliable after large viewpoint shifts and in object-less environments, a combination of these two types of features can be used for loop closure. Some multiple feature detectors already exist [fusion], but these methods rely on two features from the same category. By creating features that represent both human and geometric importance (HGI), the benefits of both can be realized.
These ideas motivate our method, HGI-SLAM. Newer feature detection methods use machine learning based methods[liftslam], and we continue with this technique. Although these methods can be slower than algorithmic methods, they outperform them in terms of accuracy and viewpoint invariance. Despite this, our method can be run in real time due the descriptor generation occuring in a separate thread.
We present HGI-SLAM, a novel combination of features for more accurate loop closures. Using only monocular camera data, we have made it accessible as possible. The key contributions of this paper are:
A novel combination of geometric and salient feature generation for place recognition
A modified descriptor generation algorithm based on SIFT
An end-to-end framework using loop closure detections in a multi-threading setup. We pass them to the ORB-SLAM2 backbone by overriding multiple keyframes
In addition to these, we also provide experimental results on the KITTI [kitti] and EuRoC [euroc] datasets to demonstrate the improved performance of the proposed loop closure approach.
The rest of the paper is organized as follows: Sec. II presents related work to our own and the core elements that we built on. Sec. III describes our method of feature extraction, starting from raw images to loop closure detection. This section details the majority of our novel contribution. Qualitative results of our novel loop closure detection and combined system are given in Sec. IV. We also provide a qualitative analysis of features and timing results in Sec. IV. We conclude the paper in Sec. V with a summary of our proposed framework and future scope and applications.
Ii Related Work
Many past methodologies for loop closure rely on specific sets of features along with a bag of words model (BoW). Exceptions to BoW exist [rsom], but bag of words has been shown to be useful for retrieving similar images for loop closure detection [survey3].
Feature detection methods such as SIFT [sift], SURF [surf], and FAST [fast] detect sparse keypoints, but other types of feature detectors have been proposed (see [line] for line based features). These features have been integrated into slam systems. For example, ORB-SLAM2 [orbslam2] uses FAST for keypoint selection and LDSO [ldso] uses DSO. Due to the speed at which they can be detected and matched, our method uses sparse keypoints as features.
More recent methods of loop closure include deep-learning methods, such as SymbioLCD[symbioLCD] or Online VPR [onlineVPR]. These methods tend to more computationally expensive, but can perform better [survey4]. There are also methods for Lidar based slam such as OverlapNet [overlapnet] and LCDNet [lcdnet]. These methods require Lidar while our method focuses on using only a monocular camera.
HGI-SLAM is built on two major systems. The first is Superpoint [superpoint]. Superpoint is a self supervised framework for interest point and descriptor detection. It is a fully convolutional model that is largely viewpoint invariant. The interest point detector, MagicPoint, preforms a lot better than traditional corner detection approaches like FAST [fast] or Harris [harris] [magicpoint]. SuperPoint tends to produce more dense and correct matches compared to LIFT [lift], SIFT [sift] and ORB [orb].
The second system is SalientDSO, [sdso]. SalientDSO (SDSO) is a way to incorporate semantic information in the form of visual saliency into Direct Sparse Odometry. SDSO generates a saliency heatmap using SALGAN [salgan]
, which introduced the use of Generative Adversarial Network (GAN) for saliency prediction. The heat map is then filtered based on scene parsing, and interest points are extracted from the image. SDSO is robust even with a small number of features, producing low drift in organic environments.
Both of these systems work exceedingly well in their optimal environment. HGI-SLAM combines the best of both systems while only requiring monocular vision.
The framework for HGI-SLAM contains the following steps: First, geometric and salient features are extracted in the form of keypoints and descriptors. This involves running Superpoint [superpoint] and SDSO [sdso], then processing and optimizing these features. Since SDSO does not output descriptors, these were computed with a SIFT [sift] like algorithm on the interest points and the original heatmap from SALGAN [salgan]. Next, the features are combined to train a BoW model to generate a vocabulary for future reference. Finally, the loop closures are detected in a concurrent thread to ORB-SLAM2 and keyframes in ORB-SLAM2 are overridden.
Iii-a Geometric features
We adopt the keypoints and descriptors from Superpoint by running them on the current frame. Superpoint [superpoint]
uses a fully convolutional neural network and homographic adaption to find keypoints and descriptors. These points target corners and geometric features in the input image. They are also scale and rotation invariant to the extent found in the training dataset.
However, when Superpoint is run on an image with high texture or lots of contrast, the keypoints cluster around that part of the image. In order to diversify the keypoints, we remove keypoints that are nearby and have similar descriptors.
Let the set of keypoints, , where . First, we compute the nearest keypoints as follows:
where is the threshold distance. If where is some maximum amount of nearby keypoints then
is kept. Otherwise the average cosine similarity[cossim] between the corresponding descriptors of keypoints in is computed:
The keypoints that have an average similarity greater than a minimum are discarded. This process removes keypoints that are both nearby and similar. The values that were used were , , and .
The descriptors of the keypoints go through one last optimization step before being used in the BoW model, unlike prior art. Three consecutive frames’ keypoints are combined together (overlayed) and used for only the center frame. This saves storage space and removes unnecessary information, as consecutive frames are usually interchangeable. However, in order to retain information from these frames, similarity is again calculated using Eq. 2, this time between descriptors from the separate frames. Similar keypoints are removed until the number of keypoints equals the number of keypoints from the original output from Superpoint.
Iii-B Salient features
The next step is to use the saliency heatmap generated by SALGAN [salgan] to create keypoints and descriptors. In the original paper, SalientDSO (SDSO) [sdso] uses this heatmap to create keypoints but not descriptors. The heatmap is used with PSPnet to determine the semantic label of objects in the scene. Then keypoints are selected by splitting this image into patches, and pixels with the highest gradient are selected. In our method, we adopt a similar method to that of SDSO.
However, instead of using PSPNet [psp], we run a modified version of the point selection algorithm. This is because the limited number of semantic labels of PSPNet limits SDSO to indoor use.
To extract keypoints from the saliency heatmap, first we compute the gradient of the entire heatmap and store it into two matrices and , the magnitude and orientation of the gradients respectively. Then a random patch of the original image
, is selected. The sampling weight is almost the same as SDSO except we replace the median with the average of the gradient in that patch. We do this in order to keep the sampling probability consistent and to avoid the use of a region-adaptive threshold.
Using Eq. 3 we compute the probability of a patch being sampled as:
After a patch has been selected, we subdivide three times and select points with higher gradient thresholds at every level. This is identical to the SDSO method. [sdso]
Now that we have a set of keypoints for each image, we are left with creating the descriptors. To do this, we adapt part of the SIFT [sift] algorithm into a novel descriptor generator. The original image is blurred with a Gaussian kernel of size five. Then for each keypoint a by region is selected of the gradient magnitude and orientation producing and . For each of the sixteen regions in a histogram is generated. The bins are the orientations split into eight different directions, and the values are the magnitudes of the gradients.
The histogram is then smoothed using shifted cubic interpolation. Let be the cubic interpolation of the histogram. Then to smooth the histogram using a weight parameter, ,
Finally, each of the 128 raw values generated from the descriptor values for the keypoint are then normalized linearly to , multiplied by 255, and rounded to integers. This means that the descriptors are a 128 length array where for each , .
These descriptors do not go through the same optimization step as the superpoint descriptors. Instead, these descriptors are only computed every three frames and then passed to the BoW model. A summary of salient descriptor generation is given in Algorithm 1.
Iii-C Loop Closure with Bag of Words
The set of salient and geometric descriptors is used with a BoW model to detect loop closures. The two types of features go into separate BoW models, due to their different representations and lengths. The BoW model provides an efficient lookup of the closest frames to the current frame as well as their distances. Let and correspond to the distance of the closest frames from the salient and geometric models, respectively. We compute the similarity between the candidate frame and the current frame as follows:
where and are weights that determine the relative contribution of salient and geometric features to the loop closure. If then the similarly is lowered to account for the difference, i.e. saliency and geometric features that agree are better. Frames with a similarity above a similarity threshold are marked as loop closures.
The last step is to skip loop closure detections of nearby frames ( frames apart), and frames that have already been closed before.
Iii-D End-to-end framework
In order to create a complete SLAM system, we combine our loop closure with ORB-SLAM2. We start a concurrent thread loop closing thread, and modify the default thread. The current frame is passed to HGI, which detects loop closures and delivers the candidate frames back to ORB-SLAM2. This process is shown in Fig. 2.
After a loop closure is detected, the three ORB-SLAM2 keyframes before the target frame are replaced with frames processed by HGI. To store HGI frames efficiently, only frames that have similarity (computed using Eq. 8) less than when compared to the last framed added, are stored. Loop closures can be attemped even if HGI frames are not stored, relying on ORB features as a fallback.
The loop closure of HGI combined with ORB-SLAM2 is HGI-SLAM, a complete SLAM framework with accurate loop closure detection.
Our method was evaluated with two metrics. First, we present a quantitative analysis of HGI loop detection in terms of precision and recall. Then we compare the entire HGI-SLAM to its constituent parts to show the improvement of combining feature types. For both parts we use the KITTI[kitti] sequences with loop closures, and all EuRoC [euroc] sequences. The next section provides a qualitative analysis, along with a direct similarity [cossim] comparison of feature types. Finally, we compare the runtime of our method against the base ORB-SLAM2 system.
Iv-a Evaluation of Loop Closure Detection
For the following two evaluations we use the KITTI [kitti] and EuRoC [euroc] datasets. The KITTI dataset contains stereo sequences recorded from a car in urban and highway environments. Of the stereo images, we use only the first image to provide a monocular input. We also only ran the evaluation on sequences 00, 02, 05, 06, 07, and 09 as they contain loops. The EuRoC dataset contains 11 stereo sequences of different rooms and a large industrial environment. We again only use the first image sequences, and skip sequence V2-03 due to severe motion blur. We ran each method on these sequences to predict loop closure frames, and recorded the average precision and recall for each.
Precision and recall are defined as:
where TP, FP, FN are true positives, false positives, and false negatives respectively.
We compared HGI to other monocular methods for loop closure detection, SymbioLCD [symbioLCD] and Online VPR [onlineVPR]. ORB-DBOW2 is also shown as a baseline. Table I and Table II show that HGI had the highest average recall and precision for both datasets. Further, HGI has superior recall on most sequences, with SymbioLCD preforming slightly better on 07 of KITTI and V1-01 of EuRoC. Sequence V1-02 was the only sequence where HGI had both lower recall and precision, which is discussed in Section IV-B. Some specific frames and their keypoints are shown in Fig. 3.
Iv-B Evaluation of Complete System
To evaluate the entire HGI-SLAM system, we ran each method on the same sequences as in Section IV-A, and recorded the absolute trajectory error.
Absolute trajectory error is defined as
where is the predicted position and is the ground truth. [robust]
We compared HGI-SLAM to its constituent parts (salient, geometric) by using only that feature type for loop closure detection. ORB-SLAM2 on monocular input is shown as a baseline. Table III and Table IV show that HGI-SLAM has comparable ATE to ORB-SLAM2 on most sequences. HGI-SLAM outperforms ORB-SLAM2 in sequences 00, 02, 06, and 09. In sequence 05 and 07, both methods detect nearly all the loop closures, resulting in close ATE. Scale drift occurs in all of the methods due to the monocular input, but the loop corrections in HGI-SLAM improve the overall performance.
Neither salient nor geometric features individually preformed better than HGI. Using only salient features tends to produce large tracking errors due to the limited salient objects in some scenes. Note that in sequence 09 of KITTI and V1-02 of EuRoC, salient features alone could not complete the tracking. This does not affect the performance of HGI-SLAM because salient features serve as a supplement to geometric features, which work well in most environments. Geometric features largely had a ATE slightly higher than HGI-SLAM or ORB-SLAM2.
Iv-C Qualitative Analysis of Features
The claim in this paper is that the combination of geometric and salient features results in more accurate loop closure detection. The intuition behind this claim is that saliency represents information about objects in the scene, and geometric features represent high-level position information. This combination results in detections that can rely on two different types of features, providing more robustness and versatility. Fig. 4, captures the uniqueness of each feature type, as they are mostly dissimilar by cosine similarity [cossim].
Both salient features and geometric features were collected, and a histogram was created based on the similarity score between them. The graph is scaled so that largest number of elements has a magnitude of one. Fig 4 has two peaks of relative magnitude of similarity. The larger one centered near the left is due to the difference in the type of features. The smaller peak to the right is caused by the alignment of features in certain situations, specifically when salient objects are in front of a mostly featureless background.
We show through quantitative and qualitative analysis that HGI-SLAM detects loop closures better than by using either feature alone, and better than bare ORB-SLAM2. Our proposed method can detect loop closures in situations with more organic features, which is due to the saliency component.
Iv-D Runtime Analysis
In order to complete the evaluation of HGI-SLAM, we present timing results in Table V based on sequences mentioned in the previous section.
The timings in Table V represent the average runtime of each thread. The original tracking thread was only modified to collect data on the system. The original loop thread is where the loop closure frames are replaced, as described in Section III-D. The HGI-loop detection thread is where the loop detection of our method occurred.
Overall, our method does not significantly slower the ORB-SLAM2 base, and is capable of detecting loop closures in real-time. The slowest parts of HGI are the heatmap generation from SALGAN, and descriptor generation, described in Section III-B. The complete system HGI-SLAM runs at 30fps on an Intel® Core i7-10510U CPU @ 1.80GHz with a NVIDIA® GeForce® MX230 graphics card.
|ORB-SLAM2 Tracking (Barely Modified)||42.77||48.15|
|ORB-SLAM2 Loop (Heavily Modified)||96.34||104.90|
|HGI Loop Detection||-||86.29|
We have introduced an approach to loop closure using human salient and geometric features, HGI-SLAM. By combining geometric and salient features, our method is able to accurately detect loop closures using either objects or contours. In order to do this, we created a novel descriptor generation method and a fully integrated SLAM system based on ORB-SLAM2 [orbslam2]. By postprocessing loop detections, we removed many false positives and optimize the storage of keyframes. We provide quantitative evaluations of HGI on the KITTI [kitti] and EuRoC [euroc] datasets to show the benefits of this approach. Furthermore, HGI-SLAM is better than either using either type of feature alone, as shown in the evaluation of the complete system. We corroborate with a qualitative analysis of the salient and geometric features.
Lastly, we believe that HGI-SLAM can be extended to further improve its loop closure. Combining the information of neighboring keyframes after loop closure may increase robustness to motion blur. Training the model in a larger variety of environments could further extend its versatility.