I Introduction
Augmentation of the feature matching process of VO/VSLAM systems with a local map matching subprocess aids data association and state optimization [1, 2]. Compared with a global map containing all historical 3D points, the local map includes only the subset of 3D points that are hypothesized to be currently visible. Conducting data association and downstream state optimization on a compact local map is more efficient than for the larger global map.
By matching 2D features from the current frame to the local map (which includes 3D points observed at earlier frames), extra longbaseline feature matchings can be extracted and utilized in state optimization; see Figure 1 (topleft) depicting a histogram of matched local map points for ORBSLAM, where the baseline is measured in terms of how long ago the features were seen (as opposed to how far spatially). These longbaseline matchings contribute to the accuracy and robustness of VO/VSLAM. Not surprisingly, VO/VSLAM systems employing a local map [3, 1] tend to be more accurate and robust than systems relying only on frametoframe tracking [4, 5, 6].
To increase the likelihood of finding and utilizing longbaseline feature matching, it is natural to maintain a history of the 3D points observed earlier in time within the local map. Specific properties or information has been utilized to guide the local map contents to ensure a compact yet relevant of local map, as there is a tradeoff between size and search efficiency. The most commonly used property to guide the search of relevant 3D points is covisibility. Covisibility was introduced for loop closing in VSLAM [7], and later extended to pose tracking [8, 1, 9, 10]
. The assumption of covisibility being: if an earlier keyframe shares many 3D points with a recent keyframe (i.e. covisible), then all 3D points observed by the earlier keyframe are likely to be seen also. Covisibility information is cheap to obtain as the byproduct of earlier data association calculations, therefore it can be considered to be an efficient heuristic for local map building. However, covisibility only utilizes the relativelyweak temporal prior (i.e. seen before, likely to be seen now). A local map generated with covisibility could easily grow without bound, and introduce significant latency to VO/VSLAM thereafter. Figure
1 (middle row) includes a plot of the ORBSLAM local map versus time, where it is seen to occasionally grow to be one to two orders of magnitude more than the number of tracked features per frame (typically on the order of to ).In this work, we propose to enhance the covisibility local map building step with a strong appearance prior, which will lead to a compact yet relevant local map, a indicated in Figure 1 (middle row) where the proposed local map queried is bounded in size and can be up to an order of magnitude lower that for ORBSLAM. The idea is straightforward: only those 3D points that are visually similar to currently extracted features are potentially useful in data association (and state optimization thereafter). To utilize the appearance prior efficiently, we propose to index descriptors of historical 3D points with MultiIndex Hashing (MIH) [11]. By querying historical 3D points from a series of hash tables, we can collect the subset of 3D points that are similar to current measurements in appearance/descriptor space. The visuallysimilar 3D points are then verified with covisibility, and put together as the local map for the costly computations, e.g. data association and state optimization.
Furthermore, an online table selection algorithm is developed to choose a subset of hash tables that cover the most relevant 3D points. By only querying 3D points from the subset, the overhead on hash table queries is reduced, while the quality of the local map is preserved, as indicated by comparable RPE in Fig 1 (topright). The table selection process is rooted in the submodular property with regards to the table selection metric (e.g. information gain of feature matchings obtained from each table). Because of the submodular property of table selection metric, a greedy algorithm can achieve nearoptimal table selection outcomes with good efficiency properties. Figure 1
(bottom row) shows better bounding of the SLAM latency per frame, with fewer outliers, relative to a
ms threshold.The proposed appearanceenhanced local map building method is integrated into a stateoftheart VO/VSLAM system, ORBSLAM [1]. When evaluated on multiple public benchmarks, the size of the local map is significantly reduced. More importantly, the proposed method has lower latency than the stateoftheart VO/VSLAM systems, while remaining one of the best methods in terms of accuracy and robustness. Furthermore, the proposed local map building method is generic; it can be easily extended to other visual(inertial) SLAM systems utilizing a local map, i.e., [3, 12].
Ii Related Works
This section reviews existing works that index 3D points in a map. Two closelyrelated fields are explored: Visionbased Localization (VBL) & Visual SLAM (VSLAM). Differences between existing works and the proposed work are discussed.
VBL aims to retrieve the 6DoF pose of a visual query (image or video) within a huge, prebuilt spatial representation, e.g. a 3D point map. One key component of VBL is to index the spatial representation for efficient query. Covisibility was introduced to featurebased VBL [13, 14] as a cue to prioritize feature matching efforts. Researchers also proposed alternative indexing methods based on appearance/feature descriptors [15, 16]. Realvalued feature descriptors such as SIFT[17] and SURF [18] are typically indexed offline using a kdtree. Appearancebased indexing are proven to yield more accurate & robust query results, while covisibility is more computationallyefficient. Combining both cues was first explored in [19], and further refined in [20, 21]. The work [21] replaced the kdtree data structure with a faster & more flexible indexing method, inverted multiindex. The appearancebased query results are then filtered with covisibility. Such a combination scheme is efficient: the VBL system runs realtime on mobile device. Nevertheless, training the inverted index is still an offline process requiring a known 3D map.
Recently, binary feature descriptors such as BRISK [22] and ORB [23] have become popular in VBL since they are more efficient to extract. Conventional indexing data structures like kdtrees are better suited to realvalued descriptors, rather than binary ones, motivating the exploration of alternative indexing methods. For example, [24] proposed to index binary descriptors with randomized trees, which were trained offline from the prebuilt 3D map. Hashing has been proven to be a good indexing solution [25, 26] in binarydescriptor VBL. Coarsetofine searching schemes are commonly applied in these VBL systems, where an initial hashing query provides the coarse results that are later refined by a linear scan.
Apart from compatibility with binary descriptors, two other properties of hashing make it particularly attractive to online & incremental pose estimation problem, e.g. VSLAM. First, hashing index can be updated efficiently for online processes. It is then possible to generate a more compact and relevant index by updating hash tables, e.g., according to changes in the map & the visibility constraints. Second, hashing relaxes the requirement for database pretraining (or prior offline database generation), therefore enabling VSLAM systems to operate in general and unknown environments. Hashing has been applied to modules of VSLAM where realtime performance is not required. [27] indexed binary descriptors with Locality Sensitive Hashing (LSH) [28], and demonstrated good relocalization performance in a VSLAM system. [29] utilized MultiIndex Hashing (MIH) [11] in the loop closing module of VSLAM.
The proposed work is based on MIH, but with a key enhancement: an online table selection algorithm is developed to reduce the number of hashing queries, therefore enabling MIH to be used in VSLAM modules with realtime requirements, e.g. pose tracking. The local map queried with appearance/feature descriptors is further tailored with a covisibility check. The final local map is more compact than the ones generated with either covisibility or appearance only. Running data association and state optimization on the sizereduced local map is more efficient and leads to significant latency reductions in VSLAM based on a more efficient local map data association step. Furthermore, the quality of the local map (e.g. amount of longbaseline feature matchings) is preserved in the compact local map. Therefore, the performance of VSLAM is preserved. Preliminary quantification of these benefits can be seen in Figure 1 for a single sequence.
Iii Local Map Building with MultiIndex Hashing
A diagram of the proposed local map building method is illustrated in Fig 2. The modules of proposed method are highlighted with shaded boxes, while those in a conventional VSLAM pipeline have clear boxes. This section describes the query and insertion stage of MIH. The hash table selection algorithm will be introduced in the next section.
Query MIH. Assume that a frame with binary descriptors extracted is provided and that the MIH contains hash tables. Each binary descriptor will trigger a MIH query. In a MIH query, the bit binary query descriptor is first separated into disjoint contiguous substrings. Each substring gets queried with the corresponding hash table for an exact match. Query results from all hash tables are put together as the final query result. Repeating the MIH query for all binary descriptors from the input frame, aggregate the 3D point set that satisfy the appearance prior. Its intersecting with the 3D point set collected with conventional covisibility builds the final local map .
Insert to MIH. Updating MIH according to changes in the map & visibility constraints is essential for efficient local map building. As a tradeoff between update frequency and computation cost, MIH updates are triggered only for keyframes sent to the mapping thread. Updating MIH in the mapping thread avoids introducing overhead during realtime pose tracking.
For each keyframe, the covisible 3D points are inserted into the MIH. Similar to the query process, the bit binary descriptor of each 3D point in is separated into disjoint contiguous substrings, each of which is of length . Each substring is then inserted into a corresponding hash table. For 3D points already in the hash tables, their entries will be brought to the front of the bucket, making them more likely to be queried in the future.
Choice of hash table number. The amount of hash tables has strong impact on the performanceefficiency of MIHbased local map indexing. Recall the example of a frame with features extracted. Each feature will trigger a MIH query consisting of queries to hash tables. Therefore, the MIHbased local map building has a time complexity of , i.e., linear in . Meanwhile, the space complexity of MIH is , where is the bucket size in each hash table. The space complexity decreases exponentially with table number . Therefore, only a certain range of works in practical applications due to time & space complexity limits.
Apart from time & space complexity, the robustness of local map building against perturbations in binary descriptors is largely decided by hash table number . Assuming
bits of the query descriptor are perturbed under a uniform distribution, the recall probability (i.e. probability that the query succeeds with a perturbed string) is connected to hash table number
as per [29]:(1) 
where is the Stirling partition number [30].
When working with 256bit binary descriptors such as ORB, the relationship described in Eq 1 is illustrated in Fig 3. The green and red dashed lines indicate example thresholds of bitwise perturbations in typical SLAM applications. At least 32 tables are needed for high recall probability within the example perturbation levels (vertical dashed lines). Using 64 tables is also possible, but with the drawback of higher overhead due to the lineargrowth in time complexity. In the proposed local map indexing method, 32 hash tables are maintained; each table covers an 8bit descriptor substring.
Choice of bucket size. Another parameter affecting the performanceefficiency of MIHbased local map building is the bucket size of each hash table. A bucket in MIH is implemented as ring buffer, where only the most recent 3D points are stored. For the purpose of longbaseline feature matching, it is necessary to keep the entries of 3D points observed earlier in time within the bucket. However, an oversized bucket will store entries of 3D points that are no longer visible nor relevant. As a consequence, the resulting local map will be less compact and relevant, introducing overhead to data association. In what follows, the bucket size is set to 10 based on a parameter sweep.
Iv Overhead Reduction with Hash Table Selection
For a frame with features extracted and a 32table MIH, the number of hash table queries in local map building is . While querying all 32 hash tables provides robustness against severe perturbation, querying a subset of hash tables is more efficient when the bitwise perturbation level is low or medium. We propose an online table selection algorithm to identify the minimum subset of hash tables to be queried, which further improve the compactness of local map without performance degeneration.
Formulation. To begin, the metric used for table selection is introduced. Assume is the full set of true feature matchings between current frame and the full local map built with all 32 hash tables. For each hash table , the true feature matchings that can be queried from it form a subset , where . For each hash table , the contribution towards current state optimization can be assessed with the information matrix of subset .
The least squares objective of VO/VSLAM pose tracking is
(2) 
where is the pose of the camera, are the 3D feature points and are the corresponding 2D image measurements. The measurement function, , is a combination of the transformation (worldtocamera) and pinhole projection. To firstorder approximation, the information matrix of the camera pose is
(3) 
where and are the measurement Jacobian and residual information matrix of corresponding true matched features. Denote by the pose information matrix derived from a single feature match .
As introduced for feature subset selection [31, 32], the logDet is especially suited for quantifying the contribution of matched features to VO/VSLAM. Therefore, the value of a hash table towards current state optimization can be measured with
(4) 
There is a certain level of overlap between the true matched feature subsets for each hash table. In ideal scenario without any perturbation to feature descriptor, the full set of true feature matchings can be retrieved from any one of the 32 hash tables, i.e. 100% overlapping between subsets, . In practice perturbations reduce the subset overlap percentage to less than 100%, and each hash table covers a subset of true feature matchings . Therefore, selecting a subset of hash table is equivalent to a problem of maximum coverage, with the objective formulated as:
(5) 
where is the cardinality constraint.
Greedy Solution. The maximum coverage problem is studied in the field of computational theory, where it is known to have submodular properties. Of note, [33] Let be a monotone submoduar function, then greedy algorithms achieve a approximation guarantee to the optimum solution of Eq (5).
As proven in [34], logDet is submodular & monotone increasing. Solutions to the subset selection problem, and the equivalent hash table selection problem, can be approximated using greedy algorithms. More importantly, a greedy algorithm is guaranteed to be nearoptimal, with approximation ratio of . Based on this outcome, we present a greedy, online hash table selection algorithm in Alg 1. Two control parameters are fixed after parameter sweep: cardinality constraint , target contribution .
Notice that the above discussion assumes that the true feature matchings are known whem performing hash table selection. We assume that the hash table contents are a slowlyvarying function of time. Therefore, the hash table subset selection algorithm runs at a lower rate than realtime pose tracking, and only updates the selected subsets at keyframes. Between keyframes, the hash table subset queried for local map building is fixed.
V Experimental Results
This section evaluates the performanceefficiency trade off of the proposed local map building algorithm on a stateoftheart VSLAM system, ORBSLAM[1]. Applying the proposed algorithm to the realtime tracking thread of ORBSLAM reduces pose tracking latency. Meanwhile, tracking accuracy is either improved (on short sequences) or remains near the same level as canonical ORBSLAM (on long sequence), and the robustness is preserved (i.e. avoid tracking failure).
Two public benchmarks are used to evaluate the proposed algorithm:

NewCollege [35], which contains a 43minutes stereo sequence collected with a robot traversing a campus and adjacent parks. There are multiple loops/revisits within the sequence. The sequence is wellsuited for evaluating the longterm performance & efficiency of VSLAM system (with loop closure). Due to the lack of 6DoF pose ground truth, offline Bundle Adjustment is executed with stereo video, and the jointly optimized camera poses are taken as the ground truth. We only evaluate monocular VSLAM (e.g. with left camera) against the ground truth in this experiment.

EuRoC [36], which contains 11 stereoinertial sequences comprising 19 minutes of video, recorded in 3 different indoor environments. Compared with NewCollege, videos in EuRoC are wellsuited for evaluating the shortterm performance & efficiency of VO (without loop closure). Groundtruth tracks are provided using motion capture systems (Vicon & Leica MS50). We evaluate only monocular VO implementations on EuRoC.
Two performance metrics are used in the experiment. When evaluating the shortterm performance of VO on EuRoC, absolute rootmeansquare error (RMSE) between ground truth track and realtime VO estimation is used. When evaluating the longterm performance of VSLAM on NewCollege, the Relative Position Error (RPE) [37, 38] is chosen. Compared with absolute RMSE, RPE is less sensitive to the inevitable scale drift of monocular VSLAM. Therefore, it is better for evaluating monocular systems on longterm sequences.
The efficiency of VO/VSLAM is evaluated with the latency of realtime pose tracking per frame, which is defined as the time interval from receiving an image to publishing the pose estimate. Latency of mapping & loop closing is less of a concern in this work due to the relaxed time constraints of those processes.
Performance assessment involves a 10run repeat for each configuration, i.e., the benchmark sequence, the VO/VSLAM approach and the parameter (number of features tracked per frame). Results for a tested VO/VSLAM configuration are discarded if at least one run experiences track loss. The experiments are conducted on a desktop equipped with an Intel i7 quadcore 4.20GHz CPU (passmark score of 2583 per thread) running the ROS Indigo environment.
Va Online Table Selection vs Fixed Table Subsets
To demonstrate the benefit of online hash table selection (Alg 1), we performed additional 10run repeats of MIHbased local map building with a predefined set of fixed hash table subsets, ranging 1 table (MIH1/32) to all 32 tables (MIH32/32). Results of these tests are compared to MIHbased local map building with online hash table selection, i.e. MIHx/32 (x = 10).
The latency profiles of different hash table subsets are presented in Fig 4. MIHx/32 has the lowest latency for data association, when compared to other predefined hash table subsets. The latency of hash table queries is also lower with online hash table selection. Performance evaluation of the methods collected the average RPE (with a 10sec window), and also logged the average latency of each module in the realtime pose tracking process. Performance (RPE) and efficiency (latency) outcomes are summarized in Fig 5. MIHx/32 has the lowest latency for pose tracking while preserving the performance of VSLAM relative to the fixed table subsets.
VB Comparison with StateoftheArt VO/VSLAM
The latency reduction and strong performance of the proposed local map building algorithm is demonstrated by comparing with other stateoftheart VO/VSLAM systems.
VSLAM. Two stateoftheart VSLAM systems are chosen as baselines: DSO with loop closure (LDSO) [39] and ORBSLAM (ORB) [1]. In addition to the proposed MIHx/32, we integrate two reference methods into ORBSLAM that enhance covisibility local map building with simple heuristics. One heuristic is random sampling, i.e. Rnd. The other heuristic prioritizes map points with a long track history, denoted as Long, since feature points tracked for a long time are more likely to be mapped accurately.
To capture the performanceefficiency trade off of VSLAM systems, we adjust the number of features/patches extracted per frame. All 5 VSLAM systems are configured to run 10repeats on NewCollege, with feature/patch quantities ranging from 800 to 2000. The RPE under 10sec window versus the average latency per frame is depicted in Fig 6. Relative to ORBSLAM, the proposed MIHx/32 leads to latency reduction for all configurations of feature number. Rnd also leads to latency reduction, but not as much as MIHx/32. The Rnd case with 800 features leads to track loss, so it is not plotted. Both LDSO and Long failed to track the full New College sequence. The accuracy of MIHx/32 is comparable to the best performing ORB realizations, but with a lower deviation as indicated by the shorter error bars. Lastly, we report the accuracy & latency of the monocular VSLAM systems under the configuration of 800 features per frame in Table I
. Three RPE metrics are computed using different sliding windows: 3sec, 10sec and 30sec. In addition to the average RPE over 10run repeat, the standard deviation (STD) of the RPE is also reported in each cell of Table
I. The two heuristics Rnd and Longare excluded since they both failed to track on the full sequence. The best numbers (lowest average/STD of RPE, lowest latency) are highlighted with bold. The accuracy of MIHx/32 remains at similar levels as ORB (equal or around 10%), as evaluated on all 3 RPE metrics. More importantly, the latency of proposed method is lower and more consistent than baseline ORB. It is 21%, 33%, and 40% lower for the first quartile, average, and third quartile values.
VSLAM (with loopclosure)  

RPE (STD) 
Seq.  LDSO  ORB  MIHx/32 
  0.11 (2e2)  0.12 (8e3)  
  0.08 (8e3)  0.08 (6e3)  
  0.09 (5e3)  0.10 (1e2)  
Latency 
  13.2  10.4  
Avg.    18.3  12.2  
  21.5  13.3 
VO. Two stateoftheart VO baselines are included: SVO[40] and DSO [41]. For fair comparison, the loop closing module is disabled on all ORBSLAM variants: canonical ORB, MIHx/32, Rnd, and Long. All VO systems are configured to run 10repeats on EuRoC under example configuration (800 features per frame). The shortterm performance of VO are evaluated with RMSE, while the efficiency is still assessed via per frame tracking latency. Accuracy & latency results are summarized in Table II. The best value (lowest RMSE, lowest latency) in each row is highlighted with bold in Table II. According to the upper part of Table II, DSO and the 2 local map building heuristics are not robust enough (e.g. frequent track loss). SVO tracks 9 of 11 sequences, but with the highest RMSE over all VO systems. Both ORB baseline and proposed MIHx/32 track 8 of 11 sequences. Additionally, MIHx/32 improves the accuracy relative to baseline ORB, with an RMSE average that is 41% lower.
The latency reduction of MIHx/32 is less significant for these shortterm VO sequences, when compared with the previous longterm VSLAM outcomes. Nevertheless, MIHx/32 has the 2nd lowest average latency among all 6 VO systems, second to SVO. When comparing the 3rd quantile of latency, MIHx/32 is lower than SVO (by 3%), which suggests that tighter latency bounds can be achieved with the proposed local map building algorithm.
VO (without loopclosure)  

Seq.  SVO  DSO  ORB  MIHx/32  Rnd  Long  
RMSE 
MH 01 easy  0.227  0.407  0.027  0.026  0.025   
MH 02 easy  0.761    0.034  0.031  0.034    
MH 03 med  0.798  0.751  0.041  0.086  0.035    
MH 04 diff  4.757    0.699  0.293  0.746  0.329  
MH 05 diff  3.505    0.346  0.197      
VR1 01 easy  0.726  0.950  0.057  0.040  0.034    
VR1 02 med  0.808  0.536          
VR1 03 diff              
VR2 01 easy  0.277  0.297  0.025  0.032  0.021    
VR2 02 med  0.722  0.880  0.053  0.035  0.216    
VR2 03 diff              
Avg.  1.477  0.637  0.160  0.093  0.159  0.329  
Latency 
7.4  5.8  13.9  11.4  12.0  11.3  
Avg.  12.6  16.4  18.4  15.7  16.0  17.7  
16.8  19.1  20.7  16.3  16.1  21.0 
Vi Conclusion
This paper demonstrated how an appearance prior can be exploited to build a compact yet relevant local map in VSLAM. Working with the compact local map leads to latency reduction in timesensitive VSLAM modules, i.e., pose tracking. Meanwhile, the accuracy and robustness of VSLAM is preserved, thanks to the preservation of longbaseline feature associations in the local map. On both longterm VSLAM and shortterm VO applications, the proposed algorithm leads to significant latency reduction in realtime pose tracking, while keeping (if not improving) VO/VSLAM performance relative to the baseline variant and having the best performance relative to other stateoftheart systems.
References
 [1] R. MurArtal, J. M. M. Montiel, and J. D. Tardos, “ORBSLAM: a versatile and accurate monocular SLAM system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
 [2] R. MurArtal and J. D. Tardós, “Visualinertial monocular SLAM with map reuse,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 796–803, 2017.
 [3] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframebased visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
 [4] S. Shen, N. Michael, and V. Kumar, “Tightlycoupled monocular visualinertial fusion for autonomous flight of rotorcraft mavs,” in IEEE International Conference on Robotics and Automation, 2015, pp. 5303–5310.
 [5] T. Qin, P. Li, and S. Shen, “Vinsmono: A robust and versatile monocular visualinertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
 [6] K. Mohta, K. Sun, S. Liu, M. Watterson, B. Pfrommer, J. Svacha, Y. Mulgaonkar, C. J. Taylor, and V. Kumar, “Experiments in fast, autonomous, gpsdenied quadrotor flight,” in IEEE International Conference on Robotics and Automation, 2018, pp. 7832–7839.
 [7] C. Mei, G. Sibley, and P. Newman, “Closing loops without places,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2010, pp. 3738–3744.

[8]
H. Strasdat, A. J. Davison, J. M. Montiel, and K. Konolige, “Double window
optimisation for constant time visual SLAM,” in
IEEE International Conference on Computer Vision
, 2011, pp. 2352–2359.  [9] M. Bürki, I. Gilitschenski, E. Stumm, R. Siegwart, and J. Nieto, “Appearancebased landmark selection for efficient longterm visual localization,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2016, pp. 4137–4143.
 [10] M. A. Nitsche, G. I. Castro, T. Pire, T. Fischer, and P. De Cristóforis, “Constrainedcovisibility marginalization for efficient onboard stereo SLAM,” in European Conference on Mobile Robots (ECMR). IEEE, 2017, pp. 1–6.
 [11] D. Greene, M. Parnas, and F. Yao, “Multiindex hashing for information retrieval,” in 35th Annual Symposium on Foundations of Computer Science. IEEE, 1994, pp. 722–731.
 [12] R. MurArtal and J. D. Tardós, “Visualinertial monocular SLAM with map reuse,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 796–803, 2017.
 [13] Y. Li, N. Snavely, and D. P. Huttenlocher, “Location recognition using prioritized feature matching,” in European Conference on Computer Vision. Springer, 2010, pp. 791–804.
 [14] S. Choudhary and P. Narayanan, “Visibility probability structure from sfm datasets and applications,” in European Conference on Computer Vision. Springer, 2012, pp. 130–143.
 [15] T. Sattler, B. Leibe, and L. Kobbelt, “Fast imagebased localization using direct 2dto3d matching,” in International Conference on Computer Vision, 2011, pp. 667–674.

[16]
H. Lim, S. N. Sinha, M. F. Cohen, and M. Uyttendaele, “Realtime imagebased
6dof localization in largescale environments,” in
IEEE Conference on Computer Vision and Pattern Recognition
, 2012, pp. 1043–1050.  [17] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
 [18] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in European Conference on Computer Vision. Springer, 2006, pp. 404–417.
 [19] T. Sattler, B. Leibe, and L. Kobbelt, “Improving imagebased localization by active correspondence search,” in European Conference on Computer Vision. Springer, 2012, pp. 752–765.
 [20] ——, “Efficient & effective prioritized matching for largescale imagebased localization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1744–1756, 2017.
 [21] S. Lynen, T. Sattler, M. Bosse, J. A. Hesch, M. Pollefeys, and R. Siegwart, “Get out of my lab: Largescale, realtime visualinertial localization.” in Robotics: Science and Systems, 2015.
 [22] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust invariant scalable keypoints,” in IEEE International Conference on Computer Vision, 2011, pp. 2548–2555.
 [23] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to sift or surf,” in IEEE International Conference on Computer Vision, 2011, pp. 2564–2571.
 [24] Y. Feng, L. Fan, and Y. Wu, “Fast localization in largescale environments using supervised indexing of binary features,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 343–358, 2016.
 [25] J. Cheng, C. Leng, J. Wu, H. Cui, and H. Lu, “Fast and accurate image matching with cascade hashing for 3d reconstruction,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1–8.
 [26] N.T. Tran, D.K. Le Tan, A.D. Doan, T.T. Do, T.A. Bui, M. Tan, and N.M. Cheung, “Ondevice scalable imagebased localization via prioritized cascade search and fast onemany ransac,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1675–1690, 2019.
 [27] J. Straub, S. Hilsenbeck, G. Schroth, R. Huitl, A. Möller, and E. Steinbach, “Fast relocalization for visual odometry using binary features,” in IEEE International Conference on Image Processing, 2013, pp. 2548–2552.
 [28] L. Paulevé, H. Jégou, and L. Amsaleg, “Locality sensitive hashing: A comparison of hash function types and querying mechanisms,” Pattern Recognition Letters, vol. 31, no. 11, pp. 1348–1358, 2010.
 [29] L. Han and L. Fang, “MILD: Multiindex hashing for appearance based loop closure detection,” in IEEE International Conference on Multimedia and Expo, 2017, pp. 139–144.
 [30] R. L. Graham, D. E. Knuth, O. Patashnik, and S. Liu, “Concrete mathematics: a foundation for computer science,” Computers in Physics, vol. 3, no. 5, pp. 106–107, 1989.
 [31] L. Carlone and S. Karaman, “Attention and anticipation in fast visualinertial navigation,” IEEE Transactions on Robotics, vol. 35, no. 1, pp. 1–20, 2019.

[32]
Y. Zhao and P. Vela, “Good feature selection for least squares pose optimization in VO/VSLAM,” in
IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018, pp. 3569–3574.  [33] M. X. Goemans and V. Ramakrishnan, “Minimizing submodular functions over families of sets,” Combinatorica, vol. 15, no. 4, pp. 499–513, 1995.
 [34] M. Shamaiah, S. Banerjee, and H. Vikalo, “Greedy sensor selection: Leveraging submodularity,” in IEEE Conference on Decision and Control, 2010, pp. 2572–2577.
 [35] M. Smith, I. Baldwin, W. Churchill, R. Paul, and P. Newman, “The new college vision and laser data set,” The International Journal of Robotics Research, vol. 28, no. 5, pp. 595–599, May 2009.
 [36] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The EuRoC micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
 [37] J. Sturm, W. Burgard, and D. Cremers, “Evaluating egomotion and structurefrommotion approaches using the TUM RGBD benchmark,” in Workshop on ColorDepth Camera Fusion in Robotics at the IEEE/RJS International Conference on Intelligent Robot Systems, 2012.
 [38] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
 [39] X. Gao, R. Wang, N. Demmel, and D. Cremers, “LDSO: Direct sparse odometry with loop closure,” IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2198–2204, 2018.
 [40] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “SVO: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2017.
 [41] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 4, 2017.
Comments
There are no comments yet.