Place recognition: An Overview of Vision Perspective

06/17/2017 ∙ by Chaoyang Zhu, et al. ∙ 0

Place recognition is one of the most fundamental topics in computer vision and robotics communities, where the task is to accurately and efficiently recognize the location of a given query image. Despite years of wisdom accumulated in this field, place recognition still remains an open problem due to the various ways in which the appearance of real-world places may differ. This paper presents an overview of the place recognition literature. Since condition invariant and viewpoint invariant features are essential factors to long-term robust visual place recognition system, We start with traditional image description methodology developed in the past, which exploit techniques from image retrieval field. Recently, the rapid advances of related fields such as object detection and image classification have inspired a new technique to improve visual place recognition system, i.e., convolutional neural networks (CNNs). Thus we then introduce recent progress of visual place recognition system based on CNNs to automatically learn better image representations for places. Eventually, we close with discussions and future work of place recognition.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Place recognition has attracted a significant amount of attention in computer vision and robotics communities, as evidenced by related citations and a number of workshops dedicated to improve long-term robot navigation and autonomy. It has a number of applications ranging from autonomous driving, robot navigation to augmented reality, geo-localizing archival imagery.

The process of identifying the location of a given image by querying the locations of images belonging to the same place in a large geotagged database, usually known as place recognition, is still an open problem. One major characteristic that separates place recognition from other visual recognition tasks is that place recognition has to solve condition invariant recognition to a degree that many other fields haven’t. How can we robustly identify the same real-world place undergoing major changes in appearance, e.g., illumination variation (Figure  1), change of seasons (Figure  2) or weather, structural modifications over time, and viewpoint change. To be clear, above changes in appearance are summarized as conditional variations, but excludes the viewpoint change. Moreover, how can we distinguish true images from similarly looking images without supervision? Since collecting geotagged datasets is time-consuming and labor intensive, and situations like indoor places do not have GPS information necessarily. Place recognition task has been traditionally cast as an image retrieval task Yu et al. [2015b] where image representations for places are essential. The fundamental scientific question is that what is the appropriate representation of a place that is informative enough to recognize real-world places, yet compact enough to satisfy the real-time processing requirement on a terminal, such as mobile phone and a robot.

At early stages, place recognition was dominated by sophisticated local-invariant feature extractors such as SIFT Lowe [2004] and SURF Bay et al. [2006], hand-crafted global image descriptors such as GIST Oliva and Torralba [2001, 2006], and bag-of-visual-words Philbin et al. [2007]; Sivic et al. [2003]

approach. These traditional feature extraction techniques have led to a great step towards the ultimate outcome.

Recent years have witnessed a prosperous advancement of visual content recognition using a powerful image representations extractor - Convolutional Neural Networks (CNNs) Krizhevsky et al. [2012]; Simonyan and Zisserman [2014]; Yu et al. [2017b, a], which sets state-of-the-art performance on many category-level recognition tasks such as object classification Krizhevsky et al. [2012]; Simonyan and Zisserman [2014]; Christian et al. [2014]

, scene recognition

Zhou et al. [2014]; Yuan et al. [2015]; Yu et al. [2015a], image classification Yu et al. [2012]. Principle ideas of CNNs can date back to 1980s, and the major two reasons why CNNs are so successful in computer vision are the advances of GPU-based computation power and data volume respectively. Recent studies show that general features extracted by CNNs can be transferable Sharif Razavian et al. [2014]

and generalized well to other visual recognition tasks. Semantic gap is a well-known problem in place recognition, where different semantics of places may share common low-level features extracted by SIFT, e.g., colors, textures. Convolutional Neural Networks may bridge this semantic gap by treating an image as an high-level feature vector, extracted through deep stacked layers.

This paper provides an overview of both traditional and deep learning based descriptive techniques widely applied to place recognition task, which by no means is exhaustive. The remainder of this paper is organized as follows : section 1 gives an introduction of place recognition literature. Section 2 talks about local and global image descriptors widely applied to place recognition. Section 3 presents a brief view of convolutional neural networks and corresponding techniques used in place recognition. Section 4 discusses the future work in place recognition.

Figure 1: TokyoTimeMachine dataset Arandjelović et al. [2016] : images from the same place with different illumination conditions : day, sunset, night.
Figure 2: Frames extracted from the Nordland dataset Sünderhauf et al. [2013] that belong to the same place in spring, summer, fall, winter.

2 Traditional Image Descriptors

2.1 Local Image Descriptors

Local feature descriptors such as SIFT Lowe [2004] and SURF Bay et al. [2006] have been widely applied to visual localization and place recognition tasks. They achieve an outstanding performance on viewpoint invariance. SIFT and SURF describes the local appearance of individual patches or key points within an image, for an image, represents the set of local invariant features, in which is the local feature. The length of depends on the number of key points of the image. These local invariant features are then usually aggregated into a compact single vector for the entire image, using techniques such as bag-of-visual-words Philbin et al. [2007]; Sivic et al. [2003], VLAD Jégou et al. [2010] and Fisher kernel Jaakkola et al. [1999]; Perronnin and Dance [2007].

Since local feature extraction consists of two phases : detection and description, a number of variations and extensions of techniques for the two phases are developed. For example, Mei et al. [2009] used FAST Rosten and Drummond [2006] to detect the interested patches of an image, which were then described by SIFT. Churchill and Newman [2013] used FAST as well during detection phase, whereas they use BRIEF Calonder et al. [2012] to describe the key points instead of SIFT.

State-of-the-art visual simultaneous localization and mapping (SLAM) systems such as FAB-MAP Cummins and Newman [2008, 2011] used bag-of-visual-words to construct the final image representation. Each image may contain hundreds of local features, which is impractical in large-scale and real-time processing place recognition task, moreover, they require an enormous amount of memory to store the high dimensional features. The bag-of-visual-words mimics the bag-of-words technique used in efficient text retrieval field. It typically needs to form a codebook with visual words, each visual word

is a centroid of a cluster, usually gained by k-means

Kanungo et al. [2002]. Then each local invariant feature is assigned to its nearest cluster centroid, i.e., the visual word. A histogram vector of dimension containing the frequency of each visual word being assigned can be formed this way. The bag-of-visual-words model and local image descriptors generally ignore the geometric structure of the image, that is, different orders of local invariant features in will not impact on the histogram vector, thus the resulting image representation is viewpoint invariant. There are several variations on how to normalize the histogram vector, a common choice is normalization. Components of the vector are then weighted by inverse document frequency (idf). However, the codebook is dataset dependent and needs to be retrained if a robot moves into a new region it has never seen before. The lack of structural information of the image can also weaken the performance of the model.

Fisher Kernel proposed by Jaakkola et al. [1999]; Perronnin and Dance [2007]

is a powerful tool in pattern classification combining the strengths of generative models and discriminative classifiers. It defines a generative probability model

, which is the probability density function with parameter

. Then one can characterize the set of local invariant features with the following gradient vector :


where the gradient of the log-likelihood describes the direction in which parameters should be modified to best fit the observed data intuitively. It transforms a variable length sample into a fixed length vector whose size is only dependent on the number of parameters in the model.

Perronnin and Dance [2007]

applied Fisher kernel in the context of image classification with a Gaussian Mixture Model to model the visual words. In comparison with the bag-of-visual-words representation, they obtain a

dimensional image representation of a local invariant feature set, while dimensional image representation using bag-of-visual-words. Thus, fisher kernel can provide richer information under the circumstance that their size of codebook is equal, or, fewer visual words are required by this more sophisticated representation.

Jégou et al. [2010] propose a new local aggregation method called Vectors of locally aggregated descriptors (VLAD), which becomes state-of-the-art technique compared to bag-of-visual-words and fisher vector. The final representation is computed as follows:


The VLAD vector is represented by where the indices and respectively index the visual word and the component of local invariant feature. The vector is subsequently normalized. We can see that the final representation stores the sum of all the residuals between local invariant features and its nearest visual word. One excellent property of VLAD vector is that it’s relatively sparse and very structured, Jégou et al. [2010] show that a principle component analysis is likely to capture this structure for dimensionality reduction without much degradation of representation. They obtain a comparable search quality to bag-of-visual-words and Fisher kernel with at least an order of magnitude less memory.

2.2 Global Image Descriptors

The key difference between local place descriptors and global place descriptors is the presence of detection phase. One can easily figure out that local place descriptors turns into global descriptors by predefining the key points as the whole image. WI-SURF Badino et al. [2012] used whole-image descriptors based on SURF features and BRIEF-GIST Sünderhauf and Protzel [2011] used BRIEF Calonder et al. [2012] features in a similar whole-image fashion.

A representative global descriptor is GIST Oliva and Torralba [2001, 2006]. It has been shown to suitably model semantically meaningful and visually similar scenes in a very compact vector. The amount of perceptual and semantic information that observers comprehend within a glance (around 200ms) refers to the gist of the scene, termed Spatial Envelope properties, it encodes the dominant spatial layout information of the image. GIST uses Gabor filters at different orientations and different frequencies to extract information from the image. The results are averaged to generate a compact vector that represents the ”gist” of a scene. Murillo and Kosecka [2009] applied GIST into large-scale omnidirectional imagery and obtains a nice segmentation of the search space into clusters, e.g., tall buildings, streets, open areas, mountains. Siagian and Itti [2009] followed a biological strategy which first computes the gist of a scene to produce a coarse localization hypothesis, then refine it by locating salient landmark points in the scene.

The performance of techniques described in section 2 mainly depends on the size of codebook , if too small, the codebook will not characterize the dataset well, on the contrast, if the size is too large, it will require huge computational resources plus time. While global image descriptors have their own disadvantages, they usually assume that images are taken from a same viewpoint.

3 Convolutional Neural Networks

Recently Convolutional Neural Networks achieve state-of-the-art performance on various classification and recognition tasks, e.g., handwriting digits recognition, object classification Krizhevsky et al. [2012]; Simonyan and Zisserman [2014]; Christian et al. [2014], scene recognition Zhou et al. [2014]; Yuan et al. [2015]. Features extracted from convolutional neural networks trained on very large datasets significantly outperforms SIFT on a variety of vision tasks Krizhevsky et al. [2012]; Fischer et al. [2014]. The core idea behind CNNs is the ability to automatically learn high-level features trained on a significant amount of data through deep stacked layers in an end-to-end manner. It works as a function that takes some inputs such as images and output the image representations characterized by a vector. A common CNN model for fine-tuning is vgg-16 Simonyan and Zisserman [2014], the architecture can be seen from Figure  3. For an intuitive understanding of what the CNN model learns in each layer, please check Sünderhauf et al. [2015]; Arandjelović et al. [2016] for heatmap graphical explanation.

Chen et al. [2014] was the first work to exploit CNN in place recognition system as a feature extractor. They used a pre-trained CNN called Overfeat Sermanet et al. [2013]

, which is originally proposed for the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and proved that advantages of deep learning can shift to place recognition task.

Fischer et al. [2014] provides an investigation on the performance of CNN and SIFT on a descriptor matching benchmark. Sünderhauf et al. [2015] comprehensively evaluates and compares the utility and viewpoint-invariant properties of CNNs. They’ve shown that features extracted from middle layers of CNNs have a good robustness against conditional changes, including illumination change, seasonal and weather changes, while features extracted from top layers are more robust to viewpoint change. Features learnt by CNNs are proved to be versatile and transferable, i.e., even though they were trained on a specific target task, they can be successfully deployed for other problems and often outperform traditional hand-engineered features Sharif Razavian et al. [2014]. However, their usage as black-box descriptors in place recognition has so far yielded limited improvements. For example, visual cues that are relevant for object classification may not benefit the place recognition task. Arandjelović et al. [2016]

design a new CNN architecture based on vgg16 and the VLAD representation, they remove the all the fully connected layers and plug the VLAD layers into it by making it differentiable. The loss function used in this paper is triplet loss, which can be seen in many other recognition tasks.

Zhou et al. [2014] gathered a huge scene-centric dataset called ”Places” containing more than 7 million images from 476 scene categories. However, scene recognition is fundamentally different from place recognition. Chen et al. [2017] creates for the first time a large-scale place-centric dataset called SPED containing over 2.5 million images.

Though CNNs has the power to extract high-level features, we are far away from making full use of it. How to gather a sufficient amount of data for place recognition, how to train a CNN model in an end-to-end manner to automatically choose the optimal features to represent the image, are still underlying problems to be solved.

Figure 3:

VGG-16 configuration (shown in columns). The depth of the configuration increases from the left(A) to the right(E), as more layers are added(the added layers are shown in bold). The convolutional layer parameters are denoted as ”conv(receptive field size)-(number of channels)”. The ReLU activation function is omitted for simplicity.

4 Discussion and Future Work

Place recognition has made great advances in the last few decades, e.g., the principles of how animals recognize and remember places and relationship between places in a neuroscience perspective Lowry et al. [2016], a new way of describing places using convolutional neural networks, a number of datasets specifically for places are put forward. However, we are still a long way from a robust long-term visual place recognition system which can be well applied to a variety of scenarios of real-world places. Hence, we highlight several promising avenues of ongoing and future research that are leading us closer to this ultimate outcome.

Place recognition is becoming a hot research field and benefiting from related ongoing works in other fields, especially the enormous successes achieved in computer vision through deep learning technique, e.g., image classification, object detection, scene recognition. While features extracted from pre-trained CNNs on other vision tasks are shown to be transferable and versatile, they’ve so far yielded unsatisfactory performance on place recognition task. There is a high possibility that we still don’t fully exploit the potential of CNNs, we can improve the performance in two aspects. First, gather a sufficient amount of place-centric data covering various environments including illumination, weather, structure, season and viewpoint change. One alternative and reliable source is Google Street View Time Machine. If you train a CNN on a small-size dataset, the model usually works awful on other datasets since place recognition is dataset-dependent. And one needs to retrain it when a new dataset is fed into it. Since one of advantages of CNNs is to extract representative features through Big data, state-of-the-art performance can be improved. Second, an optimized CNN architecture for place recognition task. Real-world images from cameras are usually high resolution, whereas in many cases one needs to downscale the original images. For example, the input size of vgg-16 is 224*224. Moreover, an architecture that is well-suited for object detection may not fit well into place recognition task since their visual cues are different. And designing a good loss function is essential to features. Developments focused on the above two problems will further improve the robustness and performance of existing deep learning-based place recognition techniques.

Place recognition system can also benefit from ongoing researches of object detection, scene classification Yu et al. [2013] and scene recognition Yu et al. [2015a]. Semantic context from scene, interpreted as the ”gist” of the scene, can help to partition the search space when comparing similarity between image representations, which ensures scalability and real-time processing towards real-world application. Note that different places may share a common semantic concept, which needs a furthermore and precise feature mapping procedure. Objects such as pedestrian and trees shoulsd be avoided, while objects like buildings and landmarks are important for long-term place recognition. Automatically determining and suppressing features that leads to confusion to visual place recognition systems will also improve the place recognition performance.


  • Arandjelović et al. [2016] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2016.
  • Badino et al. [2012] Hernán Badino, Daniel Huber, and Takeo Kanade. Real-time topometric localization. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 1635–1642. IEEE, 2012.
  • Bay et al. [2006] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. Computer vision–ECCV 2006, pages 404–417, 2006.
  • Calonder et al. [2012] Michael Calonder, Vincent Lepetit, Mustafa Ozuysal, Tomasz Trzcinski, Christoph Strecha, and Pascal Fua. Brief: Computing a local binary descriptor very fast. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1281–1298, 2012.
  • Chen et al. [2014] Zetao Chen, Obadiah Lam, Adam Jacobson, and Michael Milford. Convolutional neural network-based place recognition. CoRR, abs/1411.1509, 2014. URL
  • Chen et al. [2017] Zetao Chen, Adam Jacobson, Niko Sünderhauf, Ben Upcroft, Lingqiao Liu, Chunhua Shen, Ian D. Reid, and Michael Milford. Deep learning features at scale for visual place recognition. CoRR, abs/1701.05105, 2017. URL
  • Christian et al. [2014] Szegedy Christian et al. Going deeper with convolutions. arxiv preprint. arXiv preprint arXiv:1409.4842, 2014.
  • Churchill and Newman [2013] Winston Churchill and Paul Newman. Experience-based navigation for long-term localisation. The International Journal of Robotics Research, 32(14):1645–1661, 2013.
  • Cummins and Newman [2008] Mark Cummins and Paul Newman. FAB-MAP: Probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research, 27(6):647–665, 2008. doi: 10.1177/0278364908090961.
  • Cummins and Newman [2011] Mark Cummins and Paul Newman. Appearance-only slam at large scale with fab-map 2.0. The International Journal of Robotics Research, 30(9):1100–1123, 2011.
  • Fischer et al. [2014] Philipp Fischer, Alexey Dosovitskiy, and Thomas Brox. Descriptor matching with convolutional neural networks: a comparison to sift. arXiv preprint arXiv:1405.5769, 2014.
  • Jaakkola et al. [1999] Tommi S Jaakkola, David Haussler, et al. Exploiting generative models in discriminative classifiers. Advances in neural information processing systems, pages 487–493, 1999.
  • Jégou et al. [2010] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3304–3311. IEEE, 2010.
  • Kanungo et al. [2002] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE transactions on pattern analysis and machine intelligence, 24(7):881–892, 2002.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • Lowe [2004] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  • Lowry et al. [2016] Stephanie Lowry, Niko Sünderhauf, Paul Newman, John J Leonard, David Cox, Peter Corke, and Michael J Milford. Visual place recognition: A survey. IEEE Transactions on Robotics, 32(1):1–19, 2016.
  • Mei et al. [2009] Christopher Mei, Gabe Sibley, Mark Cummins, Paul M Newman, and Ian D Reid. A constant-time efficient stereo slam system. In BMVC, pages 1–11, 2009.
  • Murillo and Kosecka [2009] Ana C Murillo and Jana Kosecka. Experiments in place recognition using gist panoramas. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 2196–2203. IEEE, 2009.
  • Oliva and Torralba [2001] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision, 42(3):145–175, 2001.
  • Oliva and Torralba [2006] Aude Oliva and Antonio Torralba. Building the gist of a scene: The role of global image features in recognition. Progress in brain research, 155:23–36, 2006.
  • Perronnin and Dance [2007] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for image categorization. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.
  • Philbin et al. [2007] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.
  • Rosten and Drummond [2006] Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. Computer vision–ECCV 2006, pages 430–443, 2006.
  • Sermanet et al. [2013] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
  • Sharif Razavian et al. [2014] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: An astounding baseline for recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2014.
  • Siagian and Itti [2009] Christian Siagian and Laurent Itti. Biologically inspired mobile robot vision localization. IEEE Transactions on Robotics, 25(4):861–873, 2009.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Sivic et al. [2003] Josef Sivic, Andrew Zisserman, et al. Video google: A text retrieval approach to object matching in videos. In iccv, volume 2, pages 1470–1477, 2003.
  • Sünderhauf and Protzel [2011] Niko Sünderhauf and Peter Protzel. Brief-gist-closing the loop by simple means. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pages 1234–1241. IEEE, 2011.
  • Sünderhauf et al. [2013] Niko Sünderhauf, Peer Neubert, and Peter Protzel. Are we there yet? challenging seqslam on a 3000 km journey across all four seasons. In Proc. of Workshop on Long-Term Autonomy, IEEE International Conference on Robotics and Automation (ICRA), page 2013, 2013.
  • Sünderhauf et al. [2015] Niko Sünderhauf, Sareh Shirazi, Feras Dayoub, Ben Upcroft, and Michael Milford. On the performance of convnet features for place recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 4297–4304. IEEE, 2015.
  • Yu et al. [2012] Jun Yu, Dacheng Tao, and Meng Wang. Adaptive hypergraph learning and its application in image classification. IEEE Transactions on Image Processing, 21(7):3262–3272, 2012.
  • Yu et al. [2013] Jun Yu, Dacheng Tao, Yong Rui, and Jun Cheng. Pairwise constraints based multiview features fusion for scene classification. Pattern Recognition, 46(2):483–496, 2013.
  • Yu et al. [2015a] Jun Yu, Chaoqun Hong, Dapeng Tao, and Meng Wang. Semantic embedding for indoor scene recognition by weighted hypergraph learning. Signal Processing, 112:129–136, 2015a.
  • Yu et al. [2015b] Jun Yu, Dacheng Tao, Meng Wang, and Yong Rui. Learning to rank using user clicks and visual features for image retrieval. IEEE transactions on cybernetics, 45(4):767–779, 2015b.
  • Yu et al. [2017a] Jun Yu, Xiaokang Yang, Fei Gao, and Dacheng Tao. Deep multimodal distance metric learning using click constraints for image ranking. IEEE transactions on cybernetics, 2017a. doi: 10.1109/TCYB.2016.2591583.
  • Yu et al. [2017b] Jun Yu, Baopeng Zhang, Zhengzhong Kuang, Dan Lin, and Jianping Fan. iprivacy: image privacy protection by identifying sensitive objects via deep multi-task learning. IEEE Transactions on Information Forensics and Security, 12(5):1005–1016, 2017b.
  • Yuan et al. [2015] Yuan Yuan, Lichao Mou, and Xiaoqiang Lu. Scene recognition by manifold regularized deep learning architecture. IEEE transactions on neural networks and learning systems, 26(10):2222–2233, 2015.
  • Zhou et al. [2014] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva.

    Learning deep features for scene recognition using places database.

    In Advances in neural information processing systems, pages 487–495, 2014.