MILD: Multi-Index hashing for Loop closure Detection

02/28/2017 ∙ by Lei Han, et al. ∙ The Hong Kong University of Science and Technology 0

Loop Closure Detection (LCD) has been proved to be extremely useful in global consistent visual Simultaneously Localization and Mapping (SLAM) and appearance-based robot relocalization. Methods exploiting binary features in bag of words representation have recently gained a lot of popularity for their efficiency, but suffer from low recall due to the inherent drawback that high dimensional binary feature descriptors lack well-defined centroids. In this paper, we propose a realtime LCD approach called MILD (Multi-Index Hashing for Loop closure Detection), in which image similarity is measured by feature matching directly to achieve high recall without introducing extra computational complexity with the aid of Multi-Index Hashing (MIH). A theoretical analysis of the approximate image similarity measurement using MIH is presented, which reveals the trade-off between efficiency and accuracy from a probabilistic perspective. Extensive comparisons with state-of-the-art LCD methods demonstrate the superiority of MILD in both efficiency and accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual Loop Closure Detection (LCD) tries to detect previously visited places based on appearance information of the scene. LCD can play an important part in Global Consistent Visual Simultaneous Localization and Mapping (SLAM) systems [1, 2] and appearance-based robot relocalization [3]. For visual SLAM, state-of-the-art approaches [2] only handle a local window of recently added frames while the previous frames are marginalized out due to the limitation of computational complexity, resulting in the accumulation of state (position and orientation) error. LCD is introduced to identify places that have already been visited, thus creating an observation between history state and current state. The accumulated error can be effectively reduced based on this observation.

The most widely used LCD methods can be summarized as local feature based methods, which try to model image similarity based on hand crafted features. Most methods [4, 5, 6, 7, 8, 9] use Bag of Words (BOW) scheme to represent image since [10], which extracts feature points from an image and cluster them into different centroids called visual words. A histogram of appeared visual words is consequently used to represent the image. The similarity of image pairs is computed based on the difference of the visual words histograms. One well-known drawback of BOW is the perceptual aliasing introduced in cluster step if two dissimilar features are clustered into the same visual word. The performance of clustering depends on the quality of a previously [4] or online [5] trained dictionary.

Conventional methods [4, 5, 6] using real-valued features like SIFT [11] or SURF [12]

suffer from high computational complexity in feature extraction and feature classification. To deal with this problem, recent methods like BOBW 

[7], IBuILD [8], ORBSLAM [9] have proposed to use efficient binary features like ORB [13] or BRISK [14]

. While binary feature based LCD methods can run at real time, the accuracy (typically measured by precision and recall metric 

[5]) of these methods is not satisfying.

In this paper, MILD: Multi-Index hashing for appearance based Loop closure Detection is proposed as an appearance based LCD approach exploiting the efficiency of binary features. Instead of using BOW representation widely adopted by previous methods, image similarity is measured based on direct feature matching without introducing additional computational complexity with the aid of Multi-Index Hashing (MIH) [15]. Contributions of this paper include:

  • We propose a novel LCD system based on Multi-Index hashing (MILD). In particular, we do not explicitly find the exact nearest neighbor of each feature or use BOW representation for images. Instead, MIH is used to approximate the image similarity measurement, so that redundant computations between dissimilar features can be avoided.

  • The approximated image similarity measurement based on MIH is analyzed from a probabilistic perspective, which effectively reveals the trade-off between the accuracy and complexity in MILD, ensuring the superiority of MILD in high accuracy and low complexity compared with state-of-the-art algorithms.

  • The detection of multiple loop closures is enabled in MILD, while most of the previous works [5, 6, 16] assume that loop closure only occurs once in the candidate dataset for each query image.

Figure 1: Overview of the proposed MILD: Multi-Index Hashing for Loop closure Detection.

2 Related Work

In this work, we focus on LCD by local image feature. Approaches such as global image descriptor [17, 18] or exploiting illumination invariant components to improve image similarity measurement under different lighting conditions [19] are not discussed, but can be combined for a more robust LCD system.

The accuracy of binary feature based LCD methods is not satisfying, the authors in [20, 21] investigate this problem and find that binary features are not straightforward to cluster using existing nearest neighbor search methods, due to the high dimensionality and the nature of the binary descriptor space. To overcome this deficiency, [21]

projects binary features into a real-valued vector space and implements nearest neighbor search in this space.

An alternative way for LCD is direct feature match as proposed by [16, 22]. Instead of using BOW representation, [16] proposes to use raw features to represent an image directly (BoRF), which significantly improves the recall performance. [22] adopts Locality Sensitive Hashing (LSH) for fast approximate nearest neighbor search based on the SIFT feature. These methods suffer from high computational complexity and cannot scale well with the increase of candidate images.

We address this problem by Multi-Index Hashing (MIH) proposed by [15] to hash long binary codes for fast information retrieval. Recently [23] uses MIH for exact nearest neighbor search and tries to find the optimal substring length given the database size, code length and search radius to minimize the upper bound of the search cost. Experiments show that search cost grows rapidly with the increase of search radius.

As a method of nearest neighbor search, MIH has already been used in different applications like image relocalization [24] and image search [25][24] follows the same procedure in [23] and complains about the inefficiency of MIH in finding the exact nearest neighbor for each feature. While [25] only explores the use of partial binary descriptors created in MIH as direct codebook indices, and follows a traditional BOW method to measure image similarity.

On the contrary to the previous methods, we do not explicitly find the exact nearest neighbor of each feature or use BOW representation for images. Instead, MIH is used to approximate the image similarity function proposed in [26]. The accuracy and efficiency of such approximation are analyzed from a probabilistic perspective.

3 MILD: Multi-Index Hashing for
Loop closure Detection

The framework of MILD is shown in Fig. 1, where the MILD can be divided into two stages: the first step aims to calculate the similarity between current image and candidate set that are constructed by all the previous images . We denote as the binary local feature set to represent an image , where

stands for the number of features. Here the ORB feature [13] is used due to the computation efficiency and rotation invariance, with the descriptor be a 256 bit binary sequence. Given the image similarity, a Bayesian filter is applied to calculate the probability of loop closure for each candidate.

3.1 Image Similarity Measurement

We define the similarity of image pair (, ) as


where refers to binary feature similarity [26], i.e.,


Here denotes Hamming distance between binary features and , is the weighting parameter, and is the pre-defined Hamming distance threshold.

A straightforward way to calculate the image similarity is linear search for all the candidates in . However, the computational cost may be unbearable for large datasets. Given the fact that the number of repeating or highly-similar features is limited between current image and previous images, implying that the valid similarity measurements are highly sparse, we propose to use Multi-Index Hashing (MIH) to avoid invalid computations, since MIH is capable in distinguishing similar features. More analysis is provided in Section 3.3.

Figure 2: Framework of MIH. Binary feature is divided into disjoint substrings. -th substring is the hash index of the -th hash table. Image index and feature index are stored in corresponding entries as reference for feature .

As illustrated in Fig. 2, in MIH, a long binary feature is hashed times based on its disjoint substrings. More precisely, if the Hamming distance of two features is smaller than , each feature is divided into disjoint substrings, then at least in 1 substring the Hamming distance of two features will be smaller than [23], implying that for two features with small Hamming distance, the probability that they fall into the same entry in at least one hash table will be close to 1. Then, the image similarity measurement in Eqn. (1) can be approximated using MIH, where the database is constructed online based on the candidate set, and the image similarity is measured during the query stage. In practice, database construction and query are implemented with MIH simultaneously.

  • Database construction: For every input image and its feature set , all features are hashed into the hash tables by separating each feature into substrings , where is the hash index of -th hash table.

  • Query: For the newly arrived query image and its binary feature set , the similarity between and candidates is initialized as . Let be the collection of features that falls into the same entry with the feature , then in Eqn. (1) can be approximated by


Examining Eqn. (3), can be calculated by 1 pass traverse of features in during the hashing process. is a subset of . The probability of that falls into (denoted as the recall probability) is related to the Hamming distance between and , and the number of hashing tables in MIH. The detailed analysis of the approximation error between and is provided in Section 3.3.

3.2 Bayesian Inference

Bayesian inference is used to select true loop closure based on image similarity measurement and temporal coherency of camera movement [5]

. To enable the detection of multiple loop closures, we propose to extend the random variable representing loop closure hypotheses at time

(denoted as ) to be a binary random variable , where is the event that current image closes the loop with the past image . In this way, the time evolution model is formulated as


where . Thus, the belief can be computed as


Recall that the image similarity measurement is given by Eqn. 3, the likelihood is computed as [5]


where and

are the mean and standard deviation of sequence

. Finally, the loop closure probability given all the previous similarity measurements can be computed as


where is defined as a fixed value to normalize the output loop closure probability. The candidates whose loop closure probability is larger than the threshold will be the detected loop closures.

3.3 Analysis of MIH

Suppose the binary feature is divided into disjoint substrings, the probability that a feature pair with Hamming distance falls into the same entry in at least one of the hash tables is denoted as the recall probability . This is equivalent to the case that independent balls are thrown into

bins randomly, where the probability of at least one bin has no ball under the assumption of uniform distribution of Hamming errors is a solved problem 



Here is the Stirling partition number [27]. Fig. 3 shows the recall probability changes along Hamming distance , as well as the influence of on the recall probability. As we expected, a larger yields a smaller recall probability, while a larger tends to make the decreasing curve of recall probability more gradual.

Figure 3: The effect of and on recall probability.

In LCD, for each feature in the query image, features describing the same place in

are referred as inliers and the others are outliers. Then the computational complexity of

(denoted as ) is proportional to the average probability of outliers falling into . The accuracy of (denoted as ) can be modeled as the average probability of inliers falling into . The unavoidable computations of similarity calculation for inliers are discarded in . Using the statistics of the distance distribution for inliers and outliers of ORB feature [13]

, the Hamming distances of outliers and inliers can be modeled as Gaussian distribution

and , respectively. Based on this approximation, the accuracy and complexity can be calculated as


Given Eqn. (9), the influence of different on the trade-off between accuracy and complexity of MILD is further presented in Fig. 4. A higher indicates that the approximation error between and is smaller, yielding higher accuracy of MILD. While a lower indicates more efficiency. Although and grow monotonously with , there exists an interval of to achieve good balance of high accuracy and low complexity. An appropriate can be chosen for different applications regarding different bias on accuracy and complexity. For example, in MILD, guarantees relatively high accuracy and very low computational cost. Experiments show that MILD enables loop closure detection within 15 ms for a database containing more than 1000 images, which is efficient enough for real-time LCD system.

Figure 4: The effect of on accuracy and complexity .
(a) Image Similarity Score
(b) Detected Loop Closure
(c) Ground Truth
Figure 5: Experiments on NewCollege Dataset [4]. The coordinate of each pixel represents the index for candidate image and query image respectively.

4 Experiments and Discussions

To evaluate the performance of MILD, we conduct extensive experiments on different datasets111NewCollege [4] contains 1073 images of size . CityCentre [4] contains 1237 images of size . Lip6Indoor [5] has 388 images of size . Lip6Outdoor [5] has 1063 images of size . and compare with state-of-the-art methods: Angeli [5], RTABMAP [6] and BOWP [28] which are based on SIFT/SURF feature, as well as BOBW [7] and IBuILD [8] that use binary feature222All the Experiments are implemented on an Intel-core i7 @ 2.3 GHz processor with 8 GB RAM. Only one core is used to compare the computational efficiency of MILD with other algorithms. In MILD, 800 ORB features are extracted for each image. Feature descriptor is divided into 16 substrings with 16 bits each. The feature Hamming distance threshold , and the loop closure probability threshold .. The implementation of MILD will be publicly available online.

4.1 Subjective Analysis

For a better understanding of MILD, we particularly show intermediate results of MILD on NewCollege dataset [4] in Fig. 5, where the approximated image similarity measurement using MIH is illustrated in Fig. 5(a). Given the image similarity measurement, Bayesian inference is employed to select loop closures among candidates, as shown in Fig. 5(b). Compared with the ground truth of loop closures (Fig. 5(c)), the proposed MILD works effectively, as reflected by the fact that image similarity score in Fig. 5(a) is high when image pair is a true loop closure, and the detected loop closures in Fig. 5(b) highly resemble ground truth.

4.2 Objective Evaluation

The quantitative comparisons regarding accuracy (recall rate at precision equals to 100%) and complexity on different datasets are presented in Table 1, where the performance of concerned methods are collected directly from the reference papers. Examining Table 1, we have following observations:

  • Angeli [5], RTABMAP [6] and BOWP [28] require hundreds of milliseconds for LCD per image, due to the using of SIFT/SURF feature. Although RTABMAP yields the best recall, it takes 700ms per frame on average to process one query image, which is around twenty times slower than MILD.

  • BOBW [7] and IBuILD [8] that use binary feature can be implemented in real-time, but suffering from low accuracy, which is around 50% lower than MILD.

  • On the contrary, although we do not assume single loop closure in the inference stage, which potentially introduces more outliers, MILD still achieves competitive performance in both accuracy and complexity, i.e., the accuracy is comparable to SIFT/SURF feature based methods, and can be successfully implemented in realtime.

Angeli [5] - - 80% 71%
- - 460ms 753ms
RTABMAP [6] 81% 89% 98% 95%
700ms 700ms 100ms 400ms
BOBP [28] 86% 77% 92% 94%
441ms 393ms 69ms 120ms
BOBW [7] 30.6% 55.9% - -
20ms 20ms - -
IBuILD [8] 38% - 41.9% 25.5%
- - - -
MILD 83% 87.3% 94.5% 93.4%
36ms 35ms 7ms 9ms
Table 1: Comparisons with state-of-the-art algorithms

For memory requirement, MIH takes 32 bytes to store feature descriptors and 4 bytes to store its corresponding image index and feature index in each hash table per feature. The only fixed overhead of MILD is pointers for each hash table, where is the substring length. In our experiments, and there are hash tables in total. For example, the minimum memory required for NewCollege dataset is MB, which is acceptable for modern mobile devices.

5 Conclusions and Future Work

While MIH has shown large potential in exactly nearest neighbor search recently [23], we extend its application in approximately nearest neighbor search and propose a novel Multi-Index Hashing scheme for Loop closure Detection problem (MILD). Theoretical analysis successfully reveals the trade-off between accuracy and efficiency of MIH in image similarity measurement. Experiments on public datasets show that MILD achieves competitive performance regarding high accuracy and low complexity, compared with state-of-the-art LCD approaches.

In our work, the uniform distribution of binary codes is assumed, but in practice many features fall into the same entry in the hashing process, such entries are discarded for efficiency consideration. It would be interesting to consider prior knowledge on non-uniform distribution of different features for improving MILD.


  • [1] Hauke Strasdat, Local accuracy and global consistency for efficient visual slam, Ph.D. thesis, Citeseer, 2012.
  • [2] Jakob Engel, Thomas Schöps, and Daniel Cremers, “Lsd-slam: Large-scale direct monocular slam,” in

    European Conference on Computer Vision

    . Springer, 2014, pp. 834–849.
  • [3] Brian Williams, Georg Klein, and Ian Reid, “Automatic relocalization and loop closing for real-time monocular slam,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 9, pp. 1699–1712, 2011.
  • [4] Mark Cummins and Paul Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
  • [5] Adrien Angeli, David Filliat, Stéphane Doncieux, and Jean-Arcady Meyer, “Fast and incremental method for loop-closure detection using bags of visual words,” IEEE Transactions on Robotics, vol. 24, no. 5, pp. 1027–1037, 2008.
  • [6] Mathieu Labbe and Francois Michaud, “Appearance-based loop closure detection for online large-scale and long-term operation,” IEEE Transactions on Robotics, vol. 29, no. 3, pp. 734–745, 2013.
  • [7] Dorian Gálvez-López and Juan D Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
  • [8] Sheraz Khan and Dirk Wollherr, “Ibuild: Incremental bag of binary words for appearance based loop closure detection,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 5441–5447.
  • [9] Raúl Mur-Artal and Juan D Tardós, “Fast relocalisation and loop closing in keyframe-based slam,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 846–853.
  • [10] Josef Sivic and Andrew Zisserman, “Video google: A text retrieval approach to object matching in videos,” in Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003, pp. 1470–1477.
  • [11] David G Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
  • [12] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, “Surf: Speeded up robust features,” in European conference on computer vision. Springer, 2006, pp. 404–417.
  • [13] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision. IEEE, 2011, pp. 2564–2571.
  • [14] Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in 2011 International conference on computer vision. IEEE, 2011, pp. 2548–2555.
  • [15] Dan Greene, Michal Parnas, and Frances Yao, “Multi-index hashing for information retrieval,” in Foundations of Computer Science, 1994 Proceedings., 35th Annual Symposium on. IEEE, 1994, pp. 722–731.
  • [16] Hong Zhang, “Borf: Loop-closure detection with scale invariant visual features,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 3125–3130.
  • [17] Aude Oliva and Antonio Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
  • [18] Jana Kosecka, Liang Zhou, Philip Barber, and Zoran Duric, “Qualitative image based localization in indoors environments,” in

    Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on

    . IEEE, 2003, vol. 2, pp. II–3.
  • [19] Will Maddern, Alex Stewart, Colin McManus, Ben Upcroft, Winston Churchill, and Paul Newman, “Illumination invariant imaging: Applications in robust vision-based localisation, mapping and classification for autonomous vehicles,” in Proceedings of the Visual Place Recognition in Changing Environments Workshop, IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 2014, vol. 2, p. 3.
  • [20] Marius Muja and David G Lowe, “Fast matching of binary features,” in Computer and Robot Vision (CRV), 2012 Ninth Conference on. IEEE, 2012, pp. 404–410.
  • [21] Simon Lynen, Michael Bosse, Paul Furgale, and Roland Siegwart, “Placeless place-recognition,” in 2014 2nd International Conference on 3D Vision. IEEE, 2014, vol. 1, pp. 303–310.
  • [22] Hossein Shahbazi and Hong Zhang, “Application of locality sensitive hashing to realtime loop closure detection,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2011, pp. 1228–1233.
  • [23] Mohammad Norouzi, Ali Punjani, and David J Fleet, “Fast exact search in hamming space with multi-index hashing,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 6, pp. 1107–1119, 2014.
  • [24] Youji Feng, Lixin Fan, and Yihong Wu, “Fast localization in large-scale environments using supervised indexing of binary features,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 343–358, 2016.
  • [25] Junjie Cai, Qiong Liu, Francine Chen, Dhiraj Joshi, and Qi Tian, “Scalable image search with multiple index tables,” in Proceedings of International Conference on Multimedia Retrieval. ACM, 2014, p. 407.
  • [26] Liang Zheng, Shengjin Wang, and Qi Tian, “Coupled binary embedding for large-scale image retrieval,” IEEE transactions on image processing, vol. 23, no. 8, pp. 3368–3380, 2014.
  • [27] Ronald L Graham, Concrete mathematics: a foundation for computer science, Pearson Education India, 1994.
  • [28] Nishant Kejriwal, Swagat Kumar, and Tomohiro Shibata, “High performance loop closure detection using bag of word pairs,” Robotics and Autonomous Systems, vol. 77, pp. 55–65, 2016.