Place recognition survey: An update on deep learning approaches

06/19/2021 ∙ by Tiago Barros, et al. ∙ 0

Autonomous Vehicles (AV) are becoming more capable of navigating in complex environments with dynamic and changing conditions. A key component that enables these intelligent vehicles to overcome such conditions and become more autonomous is the sophistication of the perception and localization systems. As part of the localization system, place recognition has benefited from recent developments in other perception tasks such as place categorization or object recognition, namely with the emergence of deep learning (DL) frameworks. This paper surveys recent approaches and methods used in place recognition, particularly those based on deep learning. The contributions of this work are twofold: surveying recent sensors such as 3D LiDARs and RADARs, applied in place recognition; and categorizing the various DL-based place recognition works into supervised, unsupervised, semi-supervised, parallel, and hierarchical categories. First, this survey introduces key place recognition concepts to contextualize the reader. Then, sensor characteristics are addressed. This survey proceeds by elaborating on the various DL-based works, presenting summaries for each framework. Some lessons learned from this survey include: the importance of NetVLAD for supervised end-to-end learning; the advantages of unsupervised approaches in place recognition, namely for cross-domain applications; or the increasing tendency of recent works to seek, not only for higher performance but also for higher efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 7

page 12

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Self-driving vehicles are increasingly able to deal with unstructured and dynamic environments, which is mainly due to the development of more robust long-term localization and perception systems. A critical aspect of long-term localization is to guarantee coherent mapping and bounded error over time, which is achieved by finding loops in revisited areas. Revisited places are detected in long-term localization systems by resorting to approaches such as place recognition and loop closure. Namely, place recognition is a perception based approach that recognizes previously visited places based on visual, structural, or semantic cues.

Place recognition has been the focus of much research over the last decade. The efforts of the intelligent vehicle and machine vision communities, including those devoted to place recognition, resulted in great achievements, namely evolving towards systems that achieve promising performances in appearance changing and extreme viewpoint variation conditions. Despite the recent achievements, the fundamental challenges remain unsolved, which occur when:

  • two distinct places look similar (also known as perceptual aliasing);

  • the same places exhibit significant appearance changes over time due to day-night variation, weather, seasonal or structural changes (as shown in Fig. 2);

  • same places are perceived from different viewpoints or positions.

Solving these challenges is essential to enable robust place recognition and consequently long-term localization.

Fig. 1: Generic place recognition pipeline with the following modules: place modeling, belief generation and place mapping. Place modeling creates an internal place representation. Place mapping is concerned with maintaining a coherent representation of places over time. And Belief generation, finally, generates, based on the current place model and the map, loop candidates.
Fig. 2: Illustration of the seasonal environment changes. Images taken from the Oxford Robotcar [1] and Nordland dataset [2].

The primary motivation for writing this survey paper is to provide an updated review of the recent place recognition approaches and methods since the publication of previous surveys [3, 4]. The goal is, in particular, to focus on the works that are based on deep-learning.

Lowry et al.[3] presented a comprehensive overview of the existing visual place recognition methods up to 2016. The work summarizes and discusses several fundamentals to deal with appearance changing environments and viewpoint variations. However, the rapid developments in deep learning (DL) and new sensor modalities (e.g.

, 3D LiDARs and RADARs) are setting unprecedented performances, shifting the place recognition state of the art from traditional (handcrafted-only) feature extraction towards data-driven methods.

A key advantage of these data-driven approaches is the end-to-end training, which enables to learn a task directly from the sensory data without requiring domain knowledge for feature extraction. Instead, features are learning during training, using Convolutional Neural Networks (CNNs). These feature extraction approaches have been ultimately the driving force that has inspired recent works to use supervised, unsupervised, or both learning approaches combined (semi-supervised) to improve performance. The influence of DL frameworks in place recognition is, in particular, observable when considering the vast amount of place recognition works published in the last few years that resort to such methods.

On the other hand, a disadvantage of DL methods is the requirement of a vast amount of training data. This requirement is in particular critical since the creation of suitable datasets is a demanding and expensive process. In this regard, place recognition has benefited considerably from the availability of autonomous vehicle datasets, which are becoming more and more realistic. Besides more realistic real-world conditions, also data from new sensor modalities are becoming available, for example, new camera types, 3D LiDARs, and, more recently, RADARs [5, 6]. This work does not address datasets since this topic is already overviewed in other works such as in [7] what place recognition concerns, and in [8] broader autonomous driving datasets.

The contribution of this work is to provide a comprehensive review of the recent methods and approaches, focusing in particular on:

  • the recent introduced sensors in the context of place recognition, an outline of advantages and disadvantages is presented in Table I and outline is illustrated in Fig. 4;

  • the categorization of the various DL-based works into supervised, unsupervised, semi-supervised and other frameworks (as illustrated in Fig. 3), in order to provide to the reader a more comprehensive and meaningful understanding of this topic.

The remainder of this paper is organized as follows. Section II is dedicated to the key concepts regarding place recognition. Section III addresses the supervised place recognition approaches, which include pre-trained and end-to-end frameworks. Section V addresses the unsupervised place recognition approaches. Section VI addresses approaches that combine both supervised and unsupervised. Section VII addresses alternative frameworks that resort to parallel and hierarchical architectures. Lastly, Section VIII concludes the paper.

Fig. 3: A taxonomy of recent DL-based place recognition approaches.

Ii Key concepts of place recognition

This section introduces the fundamentals and key concepts of place recognition. Most of the concepts here discussed have been already presented in [3][9][10][11], but they are concisely revisited in this section to contextualize the reader and thus facilitate the reading process.

Thus, before diving into more details, some fundamental questions have to be addressed beforehand. What is a ‘place’ in the place recognition context? How are places recognized and remembered? Moreover, what are the difficulties/challenges when places change over time?

Ii-a What is a place?

Places are segments of the physical world that can have any given scale - at the limit, a place may represent a single location to an entire region of discrete locations [3] (see examples in Fig. 2). The segments’ physical bounds can be defined, resorting to different segmentation criteria: time step, traveled distance, or appearance. In particular, the appearance criterion is widely used in place recognition. In such a case, a new place is created whenever the appearance of the current location differs significantly from locations that were previously observed [3].

Ii-B How are places recognized and remembered?

Place recognition is the process of recognizing places within a global map, utilizing cues from surrounding environments. This process is typically divided into three modules (as illustrated in Fig.1: place modeling, belief generation, and place mapping.

Ii-B1 Place Modeling

Place modeling is the module that maps the data from a sensor space into a descriptor space. Sensory data from cameras, 3D LiDARs [12, 13] or RADARs [14] are used to model the surrounding environment, which is achieved by extracting meaningful features.

Feature extraction approaches have evolved immensely over the last decade. Classical approaches rely on handcrafted descriptors such as SWIFT[15], SURF[16], Multiscale Superpixel Grids [17], HOG[18] or bag-of-words[19], which are mainly build based on the knowledge of domain experts (see [3] for further understanding). On the other hand, DL-based techniques, namely CNNs, are optimized to learn the best features for a given task [20]. With the increasing dominance of DL in the various perception tasks, also place recognition slowly benefited from these techniques, using initially pre-trained models from other tasks (e.g., object recognition [21] or place categorization [22, 23]), and more recently using end-to-end learning techniques trained directly on place recognition tasks [24, 25].

Ii-B2 Place Mapping

Place mapping refers to the process of maintaining a faithful representation of the physical world. To this end, place recognition approaches rely on various mapping frameworks and map update mechanisms. Regarding the mapping frameworks, three main approaches are highlighted: database [26, 27], topological [28, 29, 30] or topological-metric [31, 32].

Database frameworks are abstract map structures, which store arbitrary amounts of data without any relation between them. These frameworks are mainly used in pure retrieval tasks and resort, for retrieval efficiency, to k-dimensional [33, 34], Chow Liu trees [35] or Hierarchical Navigable Small World (NSW) [36] to accelerate nearest neighbor search.

Topological(-metric) maps, on the other hand, are graph-based frameworks, which represent the map through nodes and edges. The nodes represent places in the physical world, while the edges represent the relationships among the nodes (e.g., the similarity between two nodes). A node may represent one, or several locations, defining in the latter case a region in the physical world. The topological-metric map differs from pure topological in respect of how nodes relate i.e., while in pure topological maps no metric information is used in the edges; in topological-metric maps, nodes may relate with other nodes through relative position, orientation, or metric distance [3]. An example of such a mapping approach is the HTMap approach [37].

Regarding map updating, database frameworks usually are not updated during operation time, while topological frameworks can be updated. Update strategies include simple methods, which update nodes as loops occur[28]

, or more sophisticated ones, where long- short-term memory-based methods are used

[38].

Ii-B3 Belief Generation

The belief generation module refers to the process of generating a belief distribution, which represents the likelihood or confidence of the input data matching a place in the map. This module is thus responsible to generate loop candidates based on the belief scores, which can be computed using methods based on frame-to-frame [39, 40, 41], sequence of frames[42, 43, 44, 14], hierarchical, graphs[45] or probabilistics[46, 47, 48].

The frame-to-frame matching approach is the most common in place recognition. This approach usually computes the belief distribution by matching only one frame at the time; and uses KD trees [33, 34] or Chow Liu trees [35] for nearest neighbor search, and cosine [41], Euclidean distance [49], Hamming distance [50] to compute the similarity score.

On the other hand, sequence-based approaches compute the scores based on sequences of consecutive frames, using, for example, cost flow minimization [51]

to find matches in the similarity matrix. Sequence matching is also implementable in a probabilistic framework using Hidden Markov Models

[47] or Conditional Random Fields [48].

Hierarchical methods combine multiple matching approaches in a single place recognition framework. For example, the coarse-to-fine architecture [52, 27] selects top candidates in a coarse tier, and from those, selects the best match in a fine tier.

Ii-C What are the major challenges?

Place recognition approaches are becoming more and more sophisticated as the environment and operation conditions become more similar to real-world situations. An example of this is the current state-of-the-art of place recognition approaches, which can operate over extended areas in real-world conditions with unprecedented performances. Despite these achievements, the major place recognition challenges remain unsolved, namely places with similar appearances; places that change in appearance over time; places that are perceived from different viewpoints; and scalability of the proposed approaches in large environments.

Ii-C1 Appearance Change and Perceptual Aliasing

Appearance-changing environments and perceptual aliasing have been in particular the focus of much research. As autonomous vehicles operate over extended periods, their perception systems have to deal with environments that change over time due to for example different weather or seasonal conditions or due to structural changes. While the appearance changing problem is originated when the same place changes over time in appearance, perceptual aliasing is caused when different places have a similar appearance. These conditions affect in particular place recognition since the loop decisions are affected directly by the appearance.

A variety of works have been addressing these challenges from various perspectives. From the belief generation perspective, sequence-based matching approaches [53, 54, 42, 55, 48] are highlighted as very effective in these conditions. Sequence matching is the task of aligning a pair of a template and query sequences, which can be implemented through minimum cost flow [42, 56]

, or probabilistically using Hidden Markov models

[47] or Conditional Random Fields[48]. Another way is to address this problem from the place modeling perspective: extracting condition-invariant features [57, 58], extracting for example features from the middle layers of CNN’s [22]. On the other hand, matching quality of descriptors can be improved through descriptors normalization[23, 59] or through unsupervised techniques such as dimensionality reduction [60], change removal [61], K-STD [62] or delta descriptors [43].

Ii-C2 Viewpoint Changing

Revisiting a place from different viewpoints - at the limit opposite direction (180º viewpoint variation)[23] - is also challenging for place recognition. That is, in particular, true for approaches that rely on sensors with a restricted field-of-view (FoV) or without geometrical sensing capabilities. When visiting a place, these sensors only capture a fraction of the environment, and when revisiting from a different angle or position, the appearance of the scene may differ or even additional elements may be sensed, generating a complete different place model.

To overcome these shortcomings, visual-based approaches have resorted to semantic-based features[41, 63]. For example extracting features from higher-order CNN layers, which have a semantic meaning, have demonstrated to be more robust to viewpoint variation [22]. Other works propose the use of panoramic cameras [64] or 3DLiDAR [65], being thus irrelevant what orientation places are perceived in future visits. Thus relying on sensors and methods that do not depend on orientation (also called viewpoint-invariant) turn place recognition more robust.

Ii-C3 Scalability

Another critical factor of place recognition is concerned with scalability [66, 67, 50, 68, 69, 67, 70]. As self-driving vehicles operate in increasingly larger areas, more places are visited and maps become larger and larger, increasing thus computational demand, which affects negatively the inference efficiency. Thus, to boost inference efficiency, approaches include: efficient indexing [71, 72], hierarchical searching [73, 74], hashing [50, 68, 75, 22, 70], scalar quantization [70], Hidden Markov Models (HMMs) [67, 69] or learning regularly repeating visual patterns [66]. For example in [70] a hashing-based approach is used in a visual place recognition task with a large database to both maintain the storage footprint of the descriptor space small and boost retrieval.

Iii Sensors

An important aspect of any perception-based application is the selection of appropriate sensors. To this end, the selection criterion has to consider the specificities of both the application and the environment for the task in hand. In place recognition, the must used sensors are cameras [26, 27, 63, 24, 41, 23, 42, 76, 32], LiDARs [77, 78, 65, 28, 79, 80, 46, 34, 13] and RADARs [14, 81, 82, 83]. Although in a broader AV context, these sensors are widely adopted [84, 85], in place recognition, cameras are the most popular in the literature, followed by LiDARs, while RADARs are a very recent technology in this domain. For the remaining of this section, each sensor is detailed and an outline is presented in Table I.

Fig. 4: Popular sensors in place recognition.

In the place recognition literature, cameras are by far the most used sensor. The vision category includes camera sensors such as monocular [26], stereo [86], RGBD [87], thermal [88] or event-triggered [89]. Cameras provide dense and rich visual information, which can be provided at a high frame rate (ranging up to 60Hz) with a relatively low cost. On the other hand, vision data is very sensitive when faced with visual appearance change and viewpoint variation, which is a tremendous disadvantage compared with the other modalities. Besides vision data, cameras are also able to return depth maps. This is achieved either with RGB-D[90, 87], stereo cameras[86], or trough structure from motion (SfM)[91] methods. In outdoor environments, the limited field-of-view (FoV) and noisy depth measurements are a clear disadvantage when compared with the depth measurements of recent 3D LiDARs.

LiDAR sensors gained more attention, in place recognition, with the emergence of the 3D rotating version. 3D LiDARs capture the spatial (or geometrical) structure of the surrounding environment in a single 360°swift, measuring the time-of-flight (ToF) of reflected laser beams. These sensors have a sensing capacity of up to 120m with a frame rate of 10 to 15 Hz. Such features are particularly suitable for outdoor environment, since measuring depth through ToF is not influenced by lighting or visual appearance conditions. This is a major advantage when compared with cameras. On the other hand, disadvantages are related to the high cost and the large size, which have been promised to be surpassed by the solid-state versions. An additional weak point is the sensitiveness of this technology towards the reflectance property of objects. For example, glass, mirror, smoke, fog, and dust reduce sensing capabilities.

Radar sensors measure distance through time delay or phase shift of radio signals, which makes them very robust to different weather or lighting conditions. The reasonable cost and long-range capability [92] are features that are popularizing radars in tasks such as environment understanding [93] and place recognition [14]. However, radars continue to face weaknesses in terms of low spatial resolution and interoperability [93], disadvantages when compared with LiDARs or cameras.

Sensor Advantage Disadvantage

Camera

- Low cost
- Dense color information
- Low energy consumption
- High precision/resolution
- High frame rate
- Short range
- Sensitive to light
- Sensitive to calibration
- Limited FoV
- Difficulty in textureless environment

3D LiDAR

- Long range
- 360º FoV
- Robust to appearance changing conditions
-high precision/resolution
- High cost
- Sensitive to reflective and foggy environments
- Bulky
- Fragile mechanics

RADAR

- Low cost
- Very Long range

- Precise velocity estimation


- Insensitive to weather conditions
- Narrow FoV
- Low resolution
TABLE I: Sensors for place recognition: pros and cons.
Fig. 5: Block diagram of pre-trained frameworks a) holistic-based b) landmark-based and c) region-based.

Iv Supervised place recognition

This section addresses the place recognition approaches that resort to supervised deep learning. Supervised machine learning techniques learn a function that maps an input representation (

e.g., images, point clouds) into an output representation (e.g.

categories, scores, bounding boxes, descriptor) utilizing labeled data. In deep learning, this function assumes the form of weights in a network with staked layers. The weights are learned progressively by computing the error between predictions and ground-truth in a first step, and in a second step, the error is backpropagated using gradient vectors

[20]. This procedure (i.e., error measuring and weight adjusting) is repeated until the network’s predictions achieve adequate performance. The advantage of such a learning process, particularly when using convolutional networks (CNN), is the capability of automatically learning features from the training data, which, in classical approaches, required a considerable amount of engineering skill and domain expertise. On the other hand, the disadvantages are related to the necessity of a massive amount of labeled data for training, which is expensive to obtain[94].

In place recognition, deep supervised learning enabled breakthroughs. Especially, the capability of CNNs to extract features led to more descriptive place models, improving place matching. Early approaches relied mostly on pre-trained (or off-the-shelf) CNNs that were trained on other vision tasks

[21, 22]. But more recently, new approaches enabled the training of DL networks directly on place recognition tasks in a end-to-end fashion [24, 65, 95, 96].

Iv-a Pre-trained-based Frameworks

max width= Type Ref Model BG/PM Dataset Holistic-based [21] Feature Extraction: OxfordNet [97] and GoogLeNet[98];
Descriptor: VLAD [99] + PCA [100]
L2 distance/Database Holidays [101]; Oxford [102];
Paris [103]
[104] Feature Extraction: CNN-VTL (VGG-F[105])
Descriptor: Conv5 layer + Random selection (LDB[106])
Hamming distance/Database Nordland [2]; CMU-CVG Visual Localization [107]; Alderley [51];
[22] Feature Extraction: AlexNet [108];
Descriptor: Conv3 layer

Hamming KNN/Database

Nordland[2]; Gardens Point[22]; The Campus Human vs. Robot; The St. Lucia [109] Landmark-based [23] Landmark Detection: Left and right image regions
Feature Extraction: CNN Places365 [110]
Descriptor: fc6 + normalization + concatenation
Sequence Match/Database Oxford Robotcar [102];
University Campus;
[111] Landmark Detection: Edge Boxes
Feature Extraction: ALexNet
Descriptor: Conv3 layer + Gaussian Random Projection [112]
Cosine KNN/Database Gardens Point [22]; Mapillary;
Library Robot Indoor; Nordland [2];
[113] Landmrk detection: BING [114]
Feature Extraction: AlexNet
Descriptor: pool 5 layer + Gaussian Random Projection [115, 112] + normalization
L2- KNN/Database Gardens Point[22]; Berlin A100, Berlin Halenseestrasse and Berlin Kudamm [111]; Nordland [2]; St. Lucia [109];
Region-based [59] Feature Extraction: Fast-Net (VGG) [116]
Descriptor: conv3 + L2-normalization + Sparse Random Projection [117]
Cosine distance/Database Cityscapes [118];
Virtual KITTI [119];
Freiburg;
[40] Feature Extraction: VGG16[120]
Descriptor: Salient regions from different layers + Bag-of-Words [121]
Cross matching/ Database Gardens Point [22]; Nordland [2]; Berlin A100, Berlin Halenseestrasse and Berlin Kudamm [111];
[122] Feature Extraction: AlexNet365[123]
Descriptor: (Region-VLAD) salient regions + VLAD
Cosine distance/Database Mapillary; Gardens Point [22]; Nordland [2]; Berlin A100, Berlin Halenseestrasse and Berlin Kudamm [111];

TABLE II: Summary of recent works on supervised place recognition using pre-trained frameworks. All the works use camera-based data. BG = Belief Generation and PM = Place mapping.

In this work, pre-trained place recognition frameworks refer to approaches that extract features from pre-trained CNN models, which are originally trained on other perception tasks (e.g., object recognition [21], place categorization [22, 23] or segmentation [59]). Works using such models fall into three categories: holistic-based, landmark-based, and region-based. Figure 5 illustrates such approaches applied to an input image.

Iv-A1 Holistic-based

Holistic approaches refer to works that feed the whole image to a CNN and use all activations from a layer as a descriptor. The hierarchical nature of CNNs makes that the various layers contain features with different semantic meanings. Thus, to assess which layers generate the best features for place recognition, works have conducted ablation studies, which compared the performance of the various layers towards appearance and viewpoint robustness and compared the performance of object-centric, place-centric, and hybrid networks (i.e., networks trained respectively for object recognition, place categorization and both). Moreover, as CNN layers tend to have many activations, the proposed approaches compress the descriptor to a more tractable sized for efficiency reasons.

Ng et al. [21] study the performance of each layer, using pre-trained object-centric networks such as OxfordNet [97] and GoogLeNet[98] to extract feature from images. The features are encoded into VLAD descriptors and compressed using PCA [100]. Results show that performance increases as features are extracted from deeper layers, but drops again at the latest layers. Matching is achieved by computing the L2 distance of two descriptors.

A similar conclusion is reached by Sünderhauf et al. [22], using holistic image descriptors extracted from AlexNet[124]. Authors argue that the semantic information encoded in the middle layers improves place recognition when faced with severe appearance change, while features from higher layers are more robust to viewpoint change. The work further compares AlexNet (object-centric) with Places205 and Hybrid [125], both trained on a scene categorization task (i.e., place-centric networks) [125]

, concluding that, for place recognition, place-centric networks outperform object-centric CNNs. The networks are tested using a cosine-based KNN approach for matching, but for efficiency reason, the cosine similarity was approximated by the Hamming distance

[126].

On the other hand, Arroyo et al. [104] fuse features from multiple convolutional layers at several levels and granularities and show that this approach outperforms approaches that only use features from a single layer. The CNN architecture is based on the VGG-F [105], and the output features are further compressed using a random selection approach for efficient matching.

Iv-A2 Landmark-based

Landmark-based approaches, contrary to the aforementioned methods, do not feed the entire image to the network; instead, these approaches use, in a pre-processing stage, object proposal techniques to identify potential landmarks in the images, which are feed to the CNN. Contrary to holistic-based approaches, where all image features are transformed to descriptors, in landmark-based approaches, only the features from the detected landmarks are converted to descriptors. Detection approaches used in these works include Edge Boxes, BING, or simple heuristics.

With the aim of addressing the extreme appearance and viewpoint variations problem in place recognition, Sünderhauf et al. [111] propose such a landmark detection approach. Landmarks are detected using Edge Boxes [127] and are mapped into a feature space using the features from Alexnet’s[124] conv3 layer. The descriptor is also compressed for efficiency reasons, using a Gaussian Random Projection approach [112].

A similar approach is proposed by Kong et al. [113]. However, instead of detecting landmarks using Edge Boxes [127] and extracting features from conv3 layer of Alexnet, landmarks are detected using BING [114] and features are extracted from a pooling layer.

A slightly different approach is proposed by Garg et al. [23], which resorting to Places365 [110], also highlights the effectiveness of place-centric semantic information in extreme variations such as front versus rear view. In particular, this work crops the right and left regions of images, which has been demonstrated to possess useful information[59], for place description. The work highlights the importance of semantics-aware features from higher-order layers for viewpoint and condition invariance. Additionally, to improve robustness against appearance, a descriptor normalization approach is proposed. Descriptor normalization of the query and reference descriptors are computed independently since the image conditions differ (i.e., day-time vs. night-time). While matching is computed using SeqSLAM [42].

Fig. 6:

Block diagram of training strategies using a) contrastive-based and margin-based, b) triplet and c) quadruplet loss function.

Iv-A3 Region-based

Region-based methods, similarly to landmark-based approaches, rely on local features; however, instead of utilizing object proposal methods, the regions of interest are identified on the CNN layers, detecting salient layer activations. Therefore, region-based methods feed the entire image to the DL model and use, for the descriptors, only the salient activation in the CNN layers.

Addressing the problem of viewpoint and appearance changing in place recognition, Chen et al. [40] propose such a region-based approach that extracts salient regions without relying on external landmark proposal techniques. Regions of interest are extracted from various CNN layers of a pre-trained VGG16[120]. The approach extracts explicitly local features from the early layers and semantic information from later layers. The extracted regions are encoded into a descriptor, using a bag-of-words-based approach [121] which is matched using a cross-matching approach.

Naseer et al. [59], on the other hand, learn the activation regions of interest resorting to segmentation. In this work, regions of interest represent stable image areas, which are learned using Fast-Net [116]. Fast-Net is an up-convolutional Network that provides near real-time image segmentation. Due to being too large for real-time matching, the features resulting from the learned segments are encoded into a lower dimensionality using L2-normalization and Sparse Random Projection [117]. This approach, in particular, learns human-made structure due to being more stable for more extended periods.

With the aim of reducing the memory and computational cost, Khaliq et al. [122] propose Region-VLAD. This approach leverages a lightweight place-centric CNN architecture (AlexNet365[123]) to extract regional features. These features are encoded using a VLAD method, which is specially adapted to gain computation-efficiency and environment invariance.

max width= Sensor Ref Architecture Loss Function BG/PM Dataset Camera [24] NetVLAD:
VGG/AlexNet + NetVLAD layer
Triplet loss KNN /Database Google Street View Time Machine; Pitts250k [128];
[25] 2D CNN visual and 3D CNN structural feature extraction + Feature fusion network; Margin-based loss [129] KNN /Database Oxford RobotCar [1]; [96] SPE-VLAD:
(VGG-16 network or ResNet18) + spatial pyramid structure + NetVLAD layer
Weighted Triplet L2 /Database Pittsburgh [128];
TokyoTimeMachine[130];
Places365-Standard [110];
[131] Siamese-ResNet:
ResNet in the siamese network
L2-based loss [132] L2 /Database TUM [133];
[50] MobileNet [134] Triplet loss [135, 136] Hamming K-NN /Database Nordland [2];
Gardens Point [22]
[137] HybridNet [138] Triplet Loss Cosine /Database Oxford RobotCar [6]; Nordland [2]; Gardens Point [22]; 3D LiDAR [49] LPD-Net:
Adaptive feature extraction + a graph-based neighborhood aggregation + NetVLAD layer
Lazy quadruplet loss L2 /Database Oxford Robotcar [1];
[65] PointNetVLAD:
PointNet + NetVLAD layer
Lazy triplet and quadruplet loss KNN /Database Oxford RobotCar [1];
[34] OREOS:
CNN as in [120, 139]
Triplet loss [140] KNN /Database NCLT [141];
KITTI [142];
[13] LocNet:
Siamese network
Contrastive loss function[143] L2 KNN /Database KITTI [142]
inhouse dataset
[144] Siamese network Contrastive loss [143] L2 KNN /Database KITTI [142]; inhouse dataset; RADAR [81] VGG-16 + NetVLAD layer Triplet loss KNN /Database Oxford Radar RobotCar [6]

TABLE III: Summary of recent works on supervised end-to-end place recognition. BG = Belief Generation and PM = Place mapping.

Iv-B End-to-End Frameworks

Conversely to pre-trained frameworks, end-to-end frameworks resort to machine learning approaches that learn the feature representation and obtain a descriptor directly from the sensor data while training on a place recognition task. A key aspect of end-to-end learning is concerned with the definition of the training objective: i.e., what are the networks optimized for, and how are they optimized. In place recognition, networks are mostly optimized to generate unique descriptors that can identify the same physical place regardless of the appearance or viewpoint. The achievement of such an objective is determined by selecting, for the task in hands, an adequate network , and adequate network training, which depends on the loss function.

Iv-B1 Loss functions

The loss function is in particular a major concern in the training phase, since it represents the matematical interpretation of the training objective, and thus determining the successful convergence of the optimization process. In place recognition loss functions include triplet-based [24, 96, 137, 65, 34, 81, 50], margin-based[25], quadruplet-based [49], and contrastive-based [13]. Figure 6 illustrates the various training strategies of the loss functions.

The contrastive loss is used in siamese networks [13, 143], which have two branches with shared parameters. This function computes the similarity distance between the output descriptors of the branches, forcing the netwroks to decrease the distance between positive pairs (input data from the same place) and increase the distance between negative pairs. The function can be described as follow:

(1)

where represents the Euclidean distance between the descriptor representation from the branch of the anchor image () and the descriptor representation from the other branch (). While represents a margin parameter, represents the label, where refers to a positive pair and otherwise.

Similar to the former loss, the triplet loss also relies on more than one branch during training. However, instead of computing the distance between positive or negative pairs at each iteration, the triplet loss function computes the distance between a positive and a negative pair at the same iteration, relying, thus, on three branches. As in the former loss function, the objective is to train a network to keep positive pairs close and negative pairs apart. The Triplet loss function can be formulated as follows:

(2)

where refers to the distance between positive pairs (i.e., between anchor and positive sample) and refers to the distance between the negative pair. This function is widely used in place recognition, namely in frameworks that use input data from the camera, 3d LiDARs, and RADARs, which adapt the function to fit the training requirements. Loss functions that drive from the triplet loss include Lazy triplet [65], weighted triplet loss [96] and weakly supervised triplet ranking loss [24].

The quadruplet is an extension of the triplet loss, introducing an additional constraint to push the negative pairs [145] from the positives pairs w.r.t different probe samples, while triplet loss only pushes the negatives from the positives w.r.t from the same probe. The additional constrain of the quadruplet loss reduces the intra-class variations and enlarges the inter-class variations. This function is formulated as follows:

(3)

where and represent the distance between the positive and negative pairs, respectively. The and represent margin parameters, while corresponds to the additional constraint, representing the distance between negative pairs from different probes. In [49, 65], the quadruplet loss function is used to train networks for the task of place recognition using 3D LiDAR data.

The margin-based loss function is a simple extension to the contrastive loss [129]. While the contrastive function enforces the positive pairs to be as close as possible, the margin-based function only encourages the positive pairs to be within a distance of each other.

(4)

where represents the labels of when the pair is positive and otherwise. is a variable that determines the boundary between positive and negative pairs. The margin-based loss function was proposed in [129] to demonstrate that state-of-art performance could be achieved with a simple loss function, only by having an adequate sampling strategy of the input data during training. This function is used in [25] to train a multi-modal network. The network is jointly trained based on information extracted from images and structural data in the format of voxel grids, which are generated from the images.

Iv-B2 Camera-based Networks

A key contribution to supervised end-to-end-based place recognition is the NetVLAD layer [24]. Inspired by the Vector of Locally Aggregated Descriptors (VLAD) [146], Arandjelović et al. [24] propose NetVLAD as a ‘pluggable’ layer into any CNN architecture to output a compact image descriptor. The network’s parameters are learned using a weakly supervised triplet ranking loss function.

Yu et al. [96] also exploited VLAD descriptors for images, proposing a spatial pyramid-enhanced VLAD (SPE-VLAD) layer. The proposed layer leverages the spatial pyramid structure of images to enhance place description, using for feature extraction a VGG-16 [120] and a ResNet-18 [147], and as loss function the weighted T-loss. The network’s parameters are learned under weakly supervised scenarios, using GPS tags and the Euclidean distance between the image representations.

Qiu et al. [131]

apply a siamese-based network to loop closure detection. Siamese networks are twin neural networks, which share the same parameters and are particularly useful when limited training data is available. Both sub-networks share the same parameters and mirror the update of the parameters. In this work, the sub-networks are replaced by ResNet to improve feature representation, and the network is trained, resorting to an L2-based loss function as in

[132].

Wu et al. [50]

jointly addresses the place recognition problem from the efficiency and performance perspective, proposing to this end a deep supervised hashing approach with a similar hierarchy. Hashing is an encoding technique that maps high dimensional data into a set of binary codes, having low computational requirements and high storage efficiency. The proposed framework comprises three modules: features extraction based on MobileNet

[134]; hash code learning, obtained using the last fully connected layer of MobileNet; and loss function, which is based on the likelihood [135, 136]. This work proposes a similar hierarchy method to distinguish similar images. To this end, the distance of hashing codes between a pair of images must increase as similar images are more distinct and must remain the same between different images. These two conditions are essential to use deep supervised hashing in place recognition.

Another efficiency improving technique for deep networks is network pruning. This technique aims to reduce the size of the network by removing unnecessary neurons or setting the weights to zero

[148]. Hausler et al. [137] propose a feature filtering approach, which removes feature maps at the beginning of the network while using for matching late feature maps to foster efficiency and performance simultaneously. Feature maps to be removed are determined based on a Triplet Loss calibration procedure. As a feature extraction framework, the approach uses the HybridNet [138].

Contrary to former single modality works, Oertel et al. [25] propose a place description approach that uses vision and structural information, both originated from camera data. This approach jointly uses vision and depth data from a stereo camera in an end-to-end pipeline. The structural information is first obtained utilizing the Direct Sparse Odometry (DSO) framework [149] and then discretized into regular voxel grids to serve as inputs along with the corresponding image. The pipeline has two parallel branches: one for vision and another for the structural data, which use 2D and 3D convolutional layers, respectively, for feature extraction. Both branches are learned jointly through a margin-based loss function. The outputs of the branches are concatenated into a single vector, which is fed to a fusion network that outputs the descriptor.

Iv-B3 3D LiDAR-based Network

Although NetVLAD was originally used for images, it has also been used on 3D LiDAR data [65, 49]. Uy et al. [65] and Liu et al. [49] propose respectively PointNetVLAD and LPD-Net, which are NetVLAD-based global descriptor learning approaches for 3D LiDAR data. Both have compatible inputs and outputs, receiving as input raw point clouds and outputting a descriptor. The difference relies on the feature extraction and feature processing methods. PointNetVLAD [65] relies on PointNet [150], a 3D object detection and segmentation approach, for feature extraction. In contrast, LPD-Net relies on an adaptive local feature extraction module and a graph-based neighborhood aggregation module, aggregating both in the Feature Space and the Cartesian Space. Regarding the network training, Uy et al. [65] showed that the lazy quadruplet loss function enables higher performance than the lazy triplet loss function, motivating Liu et al. [49] to follow this approach.

A different 3D-LiDAR-based place recognition approach is proposed in [34]. Schaupp et al. propose OREOS, which is a triplet DL network-based architecture [140]. The OREOS approach receives as input 2D range images and outputs orientation-and place-dependent descriptors. The 2D range images are the result of the 3D point clouds projections onto an image representation. The network is trained using an L2 distance-based triplet loss function to compute the similarity between anchor-positive and anchor-negative. Place recognition is validated using a k-nearest neighbor framework for matching.

Yin et al. [13]

uses 3D point clouds to address the global localization problem, proposing a place recognition and metric pose estimation approach. Place recognition is achieved using the siamese LocNets, which is a semi-handcrafted representation learning method for LiDAR point clouds. As input, LocNets receives a handcrafted rotational invariant representation extracted from point clouds in a pre-processing step and outputs a low-dimensional fingerprint. The network follows a Siamese architecture and uses for learning Euclidean distance-based contrastive loss function

[143]. For belief generation, an L2-based KNN approach is used. A similar LocNets-based approach is proposed in [144].

Iv-B4 RADAR-based

Regarding RADAR-based place recognition, Saftescu et al. [95]

also propose a NetVLAD-based approach to map FMCW RADAR scans to a descriptor space. Features are extracted using a specially tailored CNN based on cylindrical convolutions, anti-aliasing blurring, and azimuth-wise max-pooling to bolster the rotational invariance of polar radar images. Regarding training, the network uses a triplet loss function as proposed in

[151].

V Unsupervised Place Recognition

The aforementioned supervised learning approaches achieve excellent results in learning discriminative place models. However, these methods have the inconvenience of requiring a vast amount of labeled data to perform well, as it is common in supervised DL-based approaches. Contrary to supervised, unsupervised learning does not require labeled data, an advantage when annotated data are not available or scarce.

Place recognition works use unsupervised approaches such as Generative Adversarial Networks (GAN) for domain translation

[152]. An example of such an approach is proposed by Latif et al. [152], which address the cross-season place recognition problem as a domain translation task. GANS are used to learn the relationship between two domains without requiring cross-domain image correspondences. The proposed architecture is presented as two coupled GANs. The generator integrated an encoder-decoder network, while the discriminator integrates an encoder network followed by two fully connected layers. The output of the discriminator is used as a descriptor for place recognition. Authors show that the discriminator’s feature space is more informative than image pixels translated to the target domain.

Yin et al. [153] also proposes a GAN-based approach, but for 3D LiDAR-based. LiDAR data are first mapped into dynamic octree maps, from which bird-view images are extracted. These images are used in a GAN-based pipeline to learn stable and generalized place features. The network trained using adversarial and conditional entropy strategies to produce a higher generalization ability and capture the unique mapping between the original data space and the compressed latent code space.

Han et al.,(2020) [88] propose a Multispectral Domain Invariant framework for the translation between unpaired RGB and thermal imagery. The proposed approach is based on CycleGAN [154], which relies, for training, on the single scale structural similarity index (SSIM [155]) loss, triplet loss, adversarial loss, and two types of consistency losses (cyclic loss [154] and pixel-wise loss). The proposed framework is further validated on semantic segmentation and domain adaptation tasks.

Contrary to the former works, which were mainly based on GAN approaches, Merril and Huang [94]

propose, for visual loop closure, an autoencoder-based approach to handle the feature embedding. Instead of reconstructing original images, this unsupervised approach is specifically tailored to map images to a HOG descriptor space. The autoencoder network is trained, having as input a pre-processing stage, where two classical geometric vision techniques are exploited: histogram of oriented gradients (HOG)

[156], and the projective transformation (homography) [157]. HOG enables the compression of images while preserving salient features. On the other hand, the projective transformation allows the relation of images with differing viewpoints. The network has a minimal architecture, enabling fast and reliable close-loop detection in real-time with no dimensionality reduction.

max width= Sensor Ref Architecture Loss Function Task Dataset Camera [152] Architecture: Coupled GANs + encoder-decoder network Minimization of the cyclic reconstruction loss [158] Domain translation for cross domain place recognition Norland [2]; Camera [94] Pre-processing: HOG [156] and homography [157]
Architecture: small Autoencoder
L2 loss function Unsupervised feature embedding for visual loop closure Places [110]; KITTI [142]; Alderley [42]; Norland [2]; Gardens Point [22];
RGB + Thermal [88] Multispectral Domain Invariant model
Architecture: CycleGAN [154]
SSIM [155] + triplet + adversarial + cyclic loss [154] + pixel-wise loss Unsupervised multispectral imagery translation task KAIST [159];
3D LiDAR [152] Pre-processing: Mapping LiDAR to dynamic octree maps to bird-view images
Architecture: GAN + encoder-decoder network
Adversarial learning and conditional entropy Unsupervised Feature learning for a 3D LiDAR-based place recognition task KITTI [142];
NCTL [141];

TABLE IV: Summary of recent works using unsupervised end-to-end learning techniques for place recognition.

max width= Sensor Ref Architecture Loss Function Task Dataset Camera [160] Feature extraction: AlexNet cropped(conv5)
Supervised: VLAD + attention module
Unsupervised: domain adaptation
Supervised: triplet ranking;
Unsupervised: MK-MMD [161]
Single and cross-domain VPR Mapillary 111https://www.mapillary.com
Beeldbank 222https://beeldbank.amsterdam.nl/beeldbank
[162] Supervised: adversarial learning
Unsupervised: autoencoder
Adversarial Learning: Least square [163];
Reconstruction: L2 distance
Disentanglement of place and appearance features in a cross domain VPR Nordland [2];
Alderley [42];
3D LiDAR [77] Supervised: latent space + classification network
Unsupervised: Autoencoder-like network
Classification: softmax cross entropy [164]
Reconstruction: binary cross entropy [165]
Global localization, 3D dense map reconstruction, and semantic information extraction KITTI odometry [142];

TABLE V: Recent works that combine supervised and unsupervised learning in place recognition systems.

Vi Semi-supervised Place Recognition

In this work, Semi-supervised approaches refer to works that jointly rely on supervised and unsupervised methods. The combination of these two learning approaches is particularly used for the cross-domain problem. However, rather than translating one domain to another, these learning techniques are used to learn features that are independent of the domain appearance. A summary of recent works is presented in Table V.

To learn domain-invariant features for cross-domain visual place recognition, Wang et al. [160] propose an approach that combines weakly supervised learning with unsupervised learning. The proposed architecture has three primary modules: an attention module, an attention-aware VLAD module, and a domain adaptation module. The supervised branch is trained with a triplet ranking loss function, while the unsupervised branch resorts to a multi-kernel maximum mean discrepancy (MK-MMD) loss function.

On the other hand, Tang et al. [162] propose a self-supervised learning approach to disentangle place-rated features from domain-related features. The backbone architecture of the proposed approach is a modified autoencoder for adversarial learning: i.e., two input encoder branches converging into one output decoder. The disentanglement of the two feature domains is solved through adversarial learning, which constrains the learning of domain specific features (i.e., features depending on the appearance); a task that is not guaranteed by the reconstruction loss of autoencoders. For adversarial learning, the proposed loss function is the least square adversarial loss [163]; while for reconstruction, the loss function is the L2 distance.

Dubé et al. [77] propose SegMap, an data-driven learning approach for the task of localization and mapping. The approach uses as the main framework an autoencoder-like architecture to learn object segments of 3D point clouds. The framework is used for two tasks: (supervised) classification and (unsupervised) reconstruction. The work proposes a customized learning technique to train the network, which comprises, for classification, the softmax cross-entropy loss function in conjunction with the N-ways classification problem learning technique [164], and, for reconstruction, the binary cross-entropy loss function [165]. The latent space, which is jointly learned on the two tasks, is used as a descriptor for segment retrieval. The proposed framework can be used in global localization, 3D dense map reconstruction, and semantic information extraction tasks.

Vii Other Frameworks

Fig. 7: Block diagram of a) hierarchical b) and c) parallel place recognition frameworks. The example in b) fuses the descriptors, while in c) the belief scores are fused.

This section is dedicated to the frameworks that have more complex and entangled architectures: i.e., containing more than one place recognition approach for the purpose of finding the best loop candidates. Two main frameworks are highlighted: parallel and hierarchical. While parallel frameworks have a very defined structure, hierarchical frameworks may assume very complex and entangled configurations, but both frameworks have the end goal of representing more performant place recognition methods.

Vii-a Parallel Frameworks

Parallel frameworks refer to approaches that rely on multiple information streams, which are fused into one branch to generate place recognition decisions. These parallel architectures fuse the various branches utilizing methods such as feature concatenation [44], HMM [166]

or multiplying normalized data across Gaussian-distributed clusters

[167]. Approaches such as proposed by Oertel et al. [25], where vision and structural data are fused in an end-to-end fashion, are considered to belong to the Section IV-B because features are jointly learned in an end-to-end pipeline. An example of parallel frameworks is illustrated in Fig. 7, and a summary of recent works is presented in Table VI.

Relying on multiple information streams allows overcoming individual sensory data limitations, which can be, for instance, due to environment changing conditions. Zhang et al. [44] address the loop closure detection problem under strong perceptual aliasing and appearance variations, proposing Robust Multimodal Sequence-based (ROMS). ROMS concatenates LDB features [168], GIST features[169]

, CNN-based deep features

[170] and ORB local features [171] in a single vector. A similar (parallel) architecture is proposed by Hausler et al. [166], where an approach, called Multi-Process Fusion, fuses four image processing methods: SAD with patch normalization [42, 172]; HOG [173, 174]; multiple spatial regions of CNN features [138, 175]; and spatial coordinates of maximum CNN activations [41]. However, instead of fusing all features to generate one descriptor as proposed in [44], here, each feature stream is matched separately using cosine distance, and only the resulting similarity values are fused using the Hidden Markov model.

max width= Sensor Ref Model Fusion BG/PM Dataset Camera [44] Feacture Extraction: LDB [168] + GIST [169] + CNN [170] + ORB [171] Concatenation of all features Sequence /Database St Lucia [109]; CMU-VL [107]; Nordland [2]; [166] Feacture Extraction: SAD [42, 172] + HOG [173, 174] + spatial regions of HybridNet(Conv-5 layer) [138, 175] + spatial coordinates of maximum activations HybridNet(Conv-5 layer) [41]
Descriptor: features + normalization
Hidden Markov Model of the similarity distances of each feature stream Dynamic sequence /Database St Lucia [109]
Nordland [2]
Oxford RobotCar [102]

TABLE VI: Summary of recent works on supervised place recognition using parallel frameworks. BG = Belief Generation and PM = Place mapping.

Vii-B Hierarchical Frameworks

In this work, hierarchical frameworks refer to place recognition approaches that, similarly to parallel frameworks, rely on multiple methods; however, instead of having as main framework a parallel architecture, the architecture is formed by various stacked tiers. The hierarchical architectures find the best loop candidate by filter candidates progressively in each tier. An example of such a framework is the coarse-to-fine architecture, which has a coarse and a fine tier. The coarse tier is mostly dedicated to retrieving top candidates utilizing methods that rather are computer efficient than accurate. These top candidates are feed to the fine tier, which can use more computer demanding methods to find the best loop candidate. The coarse-to-fine architecture, while being the most common, is not the only. Other architectures exist, for example Fig. 7 illustrates a framework proposed in [176] and Table VII presents a summary of recent works.

Hausler and Milford [176] show that parallel fusion strategies have inferior performance compared with hierarchical approaches and therefore propose Hierarchical Multi-Process Fusion, which has a three-tier hierarchy. In the first tier, top candidates are retrieved from the database based on HybridNet[138] and Gist[177] features. In the second tier, from the top candidates of the previous tier, a more narrow selection is performed based on KAZE[178] and Only Look Once (OLO)[123] features. Finally, the best loop candidate is obtained in the third tier using NetVLAD[24] and HOG[18]. An illustration of this framework is presented in Fig. 7.

Garg et al. [41]

also follow a similar framework, proposing a hierarchical place recognition approach, called X-Lost. In the coarse tier, top candidates are found by matching the Local Semantic Tensor (LoST) descriptor, which comprises feature maps from the RefineNet

[179] (a dense segmentation network) and semantic label scores of the road, building, and vegetation classes. The best match is found in the fine tier by verifying the spatial layout of semantically salient keypoint correspondences.

This semantic- and keypoint-based approach is further exploited in [63, 27]. In [63], top candidates are obtained fusing NetVLAD[24] and LoST[41] descriptors in a coarse stage, while in [27], depth maps are computed from camera data in an intermediate stage to remove keypoints that are out of range.

Contrary to the former approaches, where performance is the primary goal, An et al.[180] address the efficiency problem of place recognition, proposing an approach based on an HNSW graph for efficient map management. HNSW graph guarantees low map building and retrieval time. In the coarse stage, top candidates are retrieved from the HNSW graph by matching features extracted from the MobileNetV2 [181] using normalized scalar product [182]. The final loop candidate is obtained by matching hash codes from SURF features and the top candidates retrieved in the coarse stage. On the other hand, Liu et al. [52] exploits 3D point clouds instead of camera data, proposing SeqLPD, which is a lightweight variant of our LPD-Net [49]. This approach resorts to super keyframe clusters for coarse search, while for fine search, local sequence matching is preferred.

max width= Sensor Ref Coarse Stage Fine Stage PM Dataset Camera [41] Features: RefineNet [179]
Descriptor: LosT (semantic label scores + conv5 layer feature maps ) + normalization
BG: cosine distance
Features: RefineNet [179]
Descriptor: keypoint extracted from CNN layer activations
GB: spatial layout Verification ( Semantic Label Consistency + weighted Euclidean distance)
Coarse: Database
Fine: Top Candidates
Oxford Robotcar [1];
Synthia Dataset [183];
[63] Descriptor: concatenation of LoST [41] + NetVLAD[24]
BG: cosine distance
Features: pre-trained CNN
Descriptor: keypoint extracted from CNN activations
BG: spatial layout consistency
Coarse: Database
Fine: Top Candidates
Oxford Robotcar [1]; MLFR; Parking Lot; Residence Indoor Outdoor
[27] Features: RefineNet(Resnet101) [179]
Descriptor: conv 5 feature maps
BG: cosine distance
Filtering: out-of-range keypoins based on depth maps
Descriptor: same as coarse stage (filtered)
BG: cosine distance
Coarse: Database
Fine: Top Candidates
Oxford Robotcar [1]
Synthia [183] (for depth evaluation)
[180] Top Candidates:
Features: MobileNetV2 [181]
Descriptor: final average pooling layer
BG: nearest neighbors + normalized scalar product [182]
Features: SURF
Descriptor: hash codes
BG: Hamming Distance between top candidates and SURF-based descriptor + ratio test [184] + RANSAC
Coarse: HNSW graphs
Fine: Top candidates
KITTI [142]
Malaga 2009 Parking 6L [185]
New College [186]
[176] 1st Tier:
Features: HybridNet(AlexNet) [138] and Gist [177]
BG: Difference Scores + normalization
2nd Tier:
Features: KAZE [178] and Only Look Once [40]
BG: (KAZE) sum of the residual distances and difference scores + normalization
3 Tier:
Features: NetVLAD [24] and HOG [18]
BG: max(Average of Difference Scores)
1st tier: Database
2nd tier: Top candidates of the 1st tier
3th tier: Top candidates of the 2nd tier
Nordland [187]
Berlin Kurfurstendamm [111]
3D LiDAR [52] Find the cluster:
Descriptor: lightweight variant LPD-Net
Matching: the nearest L2 distance to the cluster center is selected as the super keyframe
Descriptor: same as in coarse
BG: Local sequence matching
Super keyframe clusters Oxford Robotcar [1]
KITTI [142]

TABLE VII: Summary of recent works using hierarchical frameworks. BG = Belief Generation and PM = Place mapping.

Viii Conclusion and discussion

This paper presents a critical survey on place recognition approaches, emphasizing the recent developments on deep learning frameworks, namely supervised, unsupervised, semi-supervised, parallel, and hierarchical approaches.

An overview of each of these frameworks is presented. In supervised approaches, the pre-trained frameworks tend to resort to semantic information by detecting landmarks or leveraging regional activation from CNN layers. On the other hand, among the end-to-end frameworks, the NetVLAD layer has inspired various works, which integrated this layer in deep architectures to train the model directly on place recognition using sensory data from the camera, 3D LiDAR, or RADAR. The main application of unsupervised approaches, such as GANs and autoencoders, is to address the domain translation problem. While in semi-supervised, which in this work refers to works that jointly leverage supervised and unsupervised methods, the works address the cross-domain problem, however instead of translating a source domain into a target domain, these works seek to obtain a descriptor space that is invariant to domains. Besides these traditional machine learning frameworks, other frameworks have been suggested, combining multiple DL or classical ML approaches into a parallel or hierarchical architecture. In particular, the hierarchical approach has been shown to improve performances in general. Until recently, the primary motivation of the majority of the published articles was to increase performance. However, recent works additionally to high performance are also seeking efficiency.

Acknowledgments

This work has been supported by the projects MATIS-CENTRO-01-0145-FEDER-000014 and SafeForest CENTRO-01-0247-FEDER-045931, Portugal. It was also partially supported by FCT through grant UID/EEA/00048/2019.

References

  • [1] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km: The Oxford RobotCar Dataset,” The International Journal of Robotics Research (IJRR), vol. 36, no. 1, pp. 3–15, 2017. [Online]. Available: http://dx.doi.org/10.1177/0278364916679498
  • [2] D. Olid, J. M. Fácil, and J. Civera, “Single-view place recognition under seasonal changes,” in PPNIV Workshop at IROS 2018, 2018.
  • [3] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 1–19, 2015.
  • [4] E. Garcia-Fidalgo and A. Ortiz, “Vision-based topological mapping and localization methods: A survey,” Robotics and Autonomous Systems, vol. 64, pp. 1–20, 2015.
  • [5] G. Kim, Y. S. Park, Y. Cho, J. Jeong, and A. Kim, “Mulran: Multimodal range dataset for urban place recognition,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 6246–6253.
  • [6] D. Barnes, M. Gadd, P. Murcutt, P. Newman, and I. Posner, “The oxford radar robotcar dataset: A radar extension to the oxford robotcar dataset,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 6433–6438.
  • [7] F. Warburg, S. Hauberg, M. López-Antequera, P. Gargallo, Y. Kuang, and J. Civera, “Mapillary street-level sequences: A dataset for lifelong place recognition,” in

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , 2020, pp. 2626–2635.
  • [8] Y. Kang, H. Yin, and C. Berger, “Test your self-driving algorithm: An overview of publicly available driving datasets and virtual testing environments,” IEEE Transactions on Intelligent Vehicles, vol. 4, no. 2, pp. 171–185, 2019.
  • [9] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile robotics tasks: A survey,” Robotics and Autonomous Systems, vol. 66, pp. 86–103, 2015.
  • [10] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
  • [11] G. Bresson, Z. Alsayed, L. Yu, and S. Glaser, “Simultaneous localization and mapping: A survey of current trends in autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 2, no. 3, pp. 194–220, 2017.
  • [12] G. Kim and A. Kim, “Scan context: Egocentric spatial descriptor for place recognition within 3d point cloud map,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 4802–4809.
  • [13] H. Yin, L. Tang, X. Ding, Y. Wang, and R. Xiong, “Locnet: Global localization in 3d point clouds for mobile vehicles,” in 2018 IEEE Intelligent Vehicles Symposium (IV), 2018, pp. 728–733.
  • [14] M. Gadd, D. De Martini, and P. Newman, “Look around you: Sequence-based radar place recognition with learned rotational invariance,” arXiv preprint arXiv:2003.04699, 2020.
  • [15] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, 1999, pp. 1150–1157 vol.2.
  • [16] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in European conference on computer vision.   Springer, 2006, pp. 404–417.
  • [17] P. Neubert and P. Protzel, “Beyond holistic descriptors, keypoints, and fixed patches: Multiscale superpixel grids for place recognition in changing environments,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp. 484–491, 2016.
  • [18] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1.   IEEE, 2005, pp. 886–893.
  • [19] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, 2005, pp. 524–531 vol. 2.
  • [20] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [21]

    J. Yue-Hei Ng, F. Yang, and L. S. Davis, “Exploiting local features from deep networks for image retrieval,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2015, pp. 53–61.
  • [22] N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford, “On the performance of convnet features for place recognition,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2015, pp. 4297–4304.
  • [23] S. Garg, N. Suenderhauf, and M. Milford, “Don’t look back: Robustifying place categorization for viewpoint- and condition-invariant place recognition,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 3645–3652.
  • [24] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, 2018, pp. 1437–1451.
  • [25] A. Oertel, T. Cieslewski, and D. Scaramuzza, “Augmenting visual place recognition with structural cues,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5534–5541, 2020.
  • [26] J. L. Schönberger, M. Pollefeys, A. Geiger, and T. Sattler, “Semantic visual localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6896–6906.
  • [27] S. Garg, M. Babu, T. Dharmasiri, S. Hausler, N. Suenderhauf, S. Kumar, T. Drummond, and M. Milford, “Look no deeper: Recognizing places from opposing viewpoints under varying scene appearance using single-view depth estimation,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 4916–4923.
  • [28] R. Paul and P. Newman, “Fab-map 3d: Topological mapping with spatial and visual appearance,” in 2010 IEEE International Conference on Robotics and Automation.   IEEE, 2010, pp. 2649–2656.
  • [29] H. Korrapati, J. Courbon, Y. Mezouar, and P. Martinet, “Image sequence partitioning for outdoor mapping,” in 2012 IEEE International Conference on Robotics and Automation.   IEEE, 2012, pp. 1650–1655.
  • [30] Z. Hong, Y. Petillot, D. Lane, Y. Miao, and S. Wang, “Textplace: Visual place recognition and topological localization through reading scene texts,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 2861–2870.
  • [31] F. Dayoub, G. Cielniak, and T. Duckett, “Long-term experiments with an adaptive spherical view representation for navigation in changing environments,” Robotics and Autonomous Systems, vol. 59, no. 5, pp. 285–295, 2011.
  • [32] W. Churchill and P. Newman, “Experience-based navigation for long-term localisation,” The International Journal of Robotics Research, vol. 32, no. 14, pp. 1645–1661, 2013.
  • [33] M. J. Milford, G. F. Wyeth, and D. Prasser, “Ratslam: a hippocampal model for simultaneous localization and mapping,” in IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004, vol. 1, 2004, pp. 403–408 Vol.1.
  • [34] L. Schaupp, M. Bürki, R. Dubé, R. Siegwart, and C. Cadena, “Oreos: Oriented recognition of 3d point clouds in outdoor scenarios,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 3255–3261.
  • [35] M. Cummins and P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
  • [36] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligencee, 2018.
  • [37] E. Garcia-Fidalgo and A. Ortiz, “Hierarchical place recognition for topological mapping,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1061–1074, 2017.
  • [38] T. Morris, F. Dayoub, P. Corke, G. Wyeth, and B. Upcroft, “Multiple map hypotheses for planning and navigating in non-stationary environments,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014, pp. 2765–2770.
  • [39] M. Zaffar, S. Ehsan, M. Milford, and K. McDonald-Maier, “Cohog: A light-weight, compute-efficient, and training-free visual place recognition technique for changing environments,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1835–1842, 2020.
  • [40] Z. Chen, F. Maffra, I. Sa, and M. Chli, “Only look once, mining distinctive landmarks from convnet for visual place recognition,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 9–16.
  • [41] S. Garg, N. Suenderhauf, and M. Milford, “Lost? appearance-invariant place recognition for opposite viewpoints using visual semantics,” Proceedings of Robotics: Science and Systems XIV, 2018.
  • [42] M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights,” in 2012 IEEE International Conference on Robotics and Automation, 2012, pp. 1643–1649.
  • [43] S. Garg, B. Harwood, G. Anand, and M. Milford, “Delta descriptors: Change-based place representation for robust visual localization,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5120–5127, 2020.
  • [44] H. Zhang, F. Han, and H. Wang, “Robust multimodal sequence-based loop closure detection via structured sparsity.” in Robotics: Science and systems, 2016.
  • [45] P. Gao and H. Zhang, “Long-term place recognition through worst-case graph matching to integrate landmark appearances and spatial relationships,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 1070–1076.
  • [46] J. Guo, P. V. K. Borges, C. Park, and A. Gawel, “Local descriptor for robust place recognition using lidar intensity,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1470–1477, 2019.
  • [47] P. Hansen and B. Browning, “Visual place recognition using hmm sequence matching,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2014, pp. 4549–4555.
  • [48] C. Cadena, D. Galvez-López, J. D. Tardos, and J. Neira, “Robust place recognition with stereo sequences,” IEEE Transactions on Robotics, vol. 28, no. 4, pp. 871–885, 2012.
  • [49] Z. Liu, S. Zhou, C. Suo, P. Yin, W. Chen, H. Wang, H. Li, and Y.-H. Liu, “LPD-Net: 3D point cloud learning for large-scale place recognition and environment analysis,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2831–2840.
  • [50] L. Wu and Y. Wu, “Deep supervised hashing with similar hierarchy for place recognition,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 3781–3786.
  • [51] S. M. Siam and H. Zhang, “Fast-seqslam: A fast appearance based place recognition algorithm,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 5702–5708.
  • [52] Z. Liu, C. Suo, S. Zhou, H. Wei, Y. Liu, H. Wang, and Y.-H. Liu, “Seqlpd: Sequence matching enhanced loop-closure detection based on large-scale point cloud description for self-driving vehicles,” arXiv preprint arXiv:1904.13030, 2019.
  • [53] O. Vysotska and C. Stachniss, “Effective visual place recognition using multi-sequence maps,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1730–1736, 2019.
  • [54] O. Vysotska and C. Stachniss, “Lazy data association for image sequences matching under substantial appearance changes,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp. 213–220, 2015.
  • [55] L. Bampis, A. Amanatiadis, and A. Gasteratos, “Encoding the description of image sequences: A two-layered pipeline for loop closure detection,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 4530–4536.
  • [56] T. Naseer, L. Spinello, W. Burgard, and C. Stachniss, “Robust visual robot localization across seasons using network flows,” in

    Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence

    , ser. AAAI’14.   AAAI Press, 2014, p. 2564–2570.
  • [57] P. Yin, L. Xu, X. Li, C. Yin, Y. Li, R. A. Srivatsan, L. Li, J. Ji, and Y. He, “A multi-domain feature learning method for visual place recognition,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 319–324.
  • [58] M. Shakeri and H. Zhang, “Illumination invariant representation of natural images for visual place recognition,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 466–472.
  • [59] T. Naseer, G. L. Oliveira, T. Brox, and W. Burgard, “Semantics-aware visual localization under challenging perceptual conditions,” in 2017 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2017, pp. 2614–2620.
  • [60] Yang Liu and Hong Zhang, “Visual loop closure detection with a compact image descriptor,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 1051–1056.
  • [61] S. Lowry and M. J. Milford, “Supervised and unsupervised linear learning techniques for visual place recognition in changing environments,” IEEE Transactions on Robotics, vol. 32, no. 3, pp. 600–613, 2016.
  • [62] S. Schubert, P. Neubert, and P. Protzel, “Unsupervised learning methods for visual place recognition in discretely and continuously changing environments,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 4372–4378.
  • [63] S. Garg, N. Suenderhauf, and M. Milford, “Semantic–geometric visual place recognition: a new perspective for reconciling opposing views,” The International Journal of Robotics Research, p. 0278364919839761, 2019.
  • [64] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, J. J. Yebes, and S. Gámez, “Bidirectional loop closure detection on panoramas for visual navigation,” in 2014 IEEE Intelligent Vehicles Symposium Proceedings.   IEEE, 2014, pp. 1378–1383.
  • [65] M. Angelina Uy and G. Hee Lee, “Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4470–4479.
  • [66] L. Yu, A. Jacobson, and M. Milford, “Rhythmic representations: Learning periodic patterns for scalable place recognition at a sublinear storage cost,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 811–818, 2018.
  • [67] D. Doan, Y. Latif, T. Chin, Y. Liu, T. Do, and I. Reid, “Scalable place recognition under appearance change for autonomous driving,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9318–9327.
  • [68] T.-T. Do, T. Hoang, D.-K. Le Tan, A.-D. Doan, and N.-M. Cheung, “Compact hash code learning with binary deep neural network,” IEEE Transactions on Multimedia, vol. 22, no. 4, pp. 992–1004, 2019.
  • [69] Y. Latif, A. D. Doan, T. J. Chin, and I. Reid, “Sprint: Subgraph place recognition for intelligent transportation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 5408–5414.
  • [70] S. Garg and M. Milford, “Fast, compact and highly scalable visual place recognition through sequence-based matching of overloaded representations,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 3341–3348.
  • [71] M. Cummins and P. Newman, “Appearance-only slam at large scale with fab-map 2.0,” The International Journal of Robotics Research, vol. 30, no. 9, pp. 1100–1123, 2011.
  • [72] T. Cieslewski and D. Scaramuzza, “Efficient decentralized visual place recognition using a distributed inverted index,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 640–647, 2017.
  • [73] M. Mohan, D. Gálvez-López, C. Monteleoni, and G. Sibley, “Environment selection and hierarchical place recognition,” in 2015 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2015, pp. 5487–5494.
  • [74] K. MacTavish and T. D. Barfoot, “Towards hierarchical place recognition for long-term autonomy,” in ICRA Workshop on Visual Place Recognition in Changing Environments, 2014, pp. 1–6.
  • [75] L. Han and L. Fang, “Mild: Multi-index hashing for appearance based loop closure detection,” in 2017 IEEE International Conference on Multimedia and Expo (ICME), 2017, pp. 139–144.
  • [76] P. Hansen and B. Browning, “Visual place recognition using hmm sequence matching,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2014, pp. 4549–4555.
  • [77] R. Dubé, A. Cramariuc, D. Dugas, H. Sommer, M. Dymczyk, J. Nieto, R. Siegwart, and C. Cadena, “Segmap: Segment-based mapping and localization using data-driven descriptors,” The International Journal of Robotics Research, vol. 39, no. 2-3, pp. 339–355, 2020.
  • [78] D. L. Rizzini, F. Galasso, and S. Caselli, “Geometric relation distribution for place recognition,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 523–529, 2019.
  • [79]

    C. Premebida, D. R. Faria, and U. Nunes, “Dynamic bayesian network for semantic place classification in mobile robotics,”

    Autonomous Robots, vol. 41, no. 5, pp. 1161–1172, 2017.
  • [80] F. Cao, Y. Zhuang, H. Zhang, and W. Wang, “Robust place recognition and loop closing in laser-based slam for ugvs in urban environments,” IEEE Sensors Journal, vol. 18, no. 10, pp. 4242–4252, 2018.
  • [81] Ş. Săftescu, M. Gadd, D. De Martini, D. Barnes, and P. Newman, “Kidnapped radar: Topological radar localisation using rotationally-invariant metric learning,” arXiv preprint arXiv:2001.09438, 2020.
  • [82] T. Y. Tang, D. De Martini, D. Barnes, and P. Newman, “Rsl-net: Localising in satellite images from a radar on the ground,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1087–1094, 2020.
  • [83] T. Cieslewski, E. Stumm, A. Gawel, M. Bosse, S. Lynen, and R. Siegwart, “Point cloud descriptors for place recognition using sparse visual information,” in 2016 IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 4830–4836.
  • [84] S. Campbell, N. O’Mahony, L. Krpalcova, D. Riordan, J. Walsh, A. Murphy, and C. Ryan, “Sensor technology in autonomous vehicles : A review,” in 2018 29th Irish Signals and Systems Conference (ISSC), 2018, pp. 1–4.
  • [85] A. Broggi, P. Grisleri, and P. Zani, “Sensors technologies for intelligent vehicles perception systems: A comparison between vision and 3d-lidar,” in 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), 2013, pp. 887–892.
  • [86] C. Cadena, D. Gálvez-López, F. Ramos, J. D. Tardós, and J. Neira, “Robust place recognition with stereo cameras,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2010, pp. 5182–5189.
  • [87] T. Morris, F. Dayoub, P. Corke, G. Wyeth, and B. Upcroft, “Multiple map hypotheses for planning and navigating in non-stationary environments,” in 2014 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2014, pp. 2765–2770.
  • [88] D. Han, Y. Hwang, N. Kim, and Y. Choi, “Multispectral domain invariant image for retrieval-based place recognition,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 9271–9277.
  • [89] T. Fischer and M. J. Milford, “Event-based visual place recognition with ensembles of temporal windows,” IEEE Robotics and Automation Letters, 2020.
  • [90] H. Yu, H.-W. Chae, and J.-B. Song, “Place recognition based on surface graph for a mobile robot,” in 2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI).   IEEE, 2017, pp. 342–346.
  • [91] D. Scaramuzza, F. Fraundorfer, and M. Pollefeys, “Closing the loop in appearance-guided omnidirectional visual odometry by using vocabulary trees,” Robotics and Autonomous Systems, vol. 58, no. 6, pp. 820–827, 2010.
  • [92] L. Sless, G. Cohen, B. E. Shlomo, and S. Oron, “Self supervised occupancy grid learning from sparse radar for autonomous driving,” arXiv preprint arXiv:1904.00415, 2019.
  • [93] J. Dickmann, J. Klappstein, M. Hahn, N. Appenrodt, H.-L. Bloecher, K. Werber, and A. Sailer, “Automotive radar the key technology for autonomous driving: From detection and ranging to environmental understanding,” in 2016 IEEE Radar Conference (RadarConf).   IEEE, 2016, pp. 1–6.
  • [94] N. Merrill and G. Huang, “Lightweight unsupervised deep loop closure,” arXiv preprint arXiv:1805.07703, 2018.
  • [95] S. Saftescu, M. Gadd, D. D. Martini, D. Barnes, and P. Newman, “Kidnapped radar: Topological radar localisation using rotationally-invariant metric learning,” in 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020.   IEEE, 2020, pp. 4358–4364. [Online]. Available: https://doi.org/10.1109/ICRA40945.2020.9196682
  • [96] J. Yu, C. Zhu, J. Zhang, Q. Huang, and D. Tao, “Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition,” IEEE transactions on Neural Networks and Learning Systems, vol. 31, no. 2, pp. 661–674, 2019.
  • [97] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
  • [98] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
  • [99] H. Jegou, F. Perronnin, M. Douze, J. Sánchez, P. Perez, and C. Schmid, “Aggregating local image descriptors into compact codes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp. 1704–1716, 2011.
  • [100] H. Jégou and O. Chum, “Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening,” in European conference on computer vision.   Springer, 2012, pp. 774–787.
  • [101] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in European conference on computer vision.   Springer, 2008, pp. 304–317.
  • [102] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in 2007 IEEE conference on computer vision and pattern recognition.   IEEE, 2007, pp. 1–8.
  • [103] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
  • [104]

    R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, and E. Romera, “Fusion and binarization of cnn features for robust topological localization across seasons,” in

    2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 4656–4663.
  • [105] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” in Proceedings of the British Machine Vision Conference.   BMVA Press, 2014.
  • [106] X. Yang and K.-T. T. Cheng, “Local difference binary for ultrafast and distinctive feature description,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 1, pp. 188–194, 2013.
  • [107] H. Badino, D. Huber, and T. Kanade, “Real-time topometric localization,” in 2012 IEEE International Conference on Robotics and Automation.   IEEE, 2012, pp. 1635–1642.
  • [108]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”

    Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
  • [109] A. J. Glover, W. P. Maddern, M. J. Milford, and G. F. Wyeth, “Fab-map + ratslam: Appearance-based slam for multiple times of day,” in 2010 IEEE International Conference on Robotics and Automation, 2010, pp. 3507–3512.
  • [110]

    B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,”

    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464, 2018.
  • [111] N. Sünderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” Proceedings of Robotics: Science and Systems XII, 2015.
  • [112] S. Dasgupta, “Experiments with random projection,” in Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, ser. UAI ’00.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2000, p. 143–151.
  • [113] Y. Kong, W. Liu, and Z. Chen, “Robust convnet landmark-based visual place recognition by optimizing landmark matching,” IEEE Access, vol. 7, pp. 30 754–30 767, 2019.
  • [114] M. Cheng, Z. Zhang, W. Lin, and P. Torr, “Bing: Binarized normed gradients for objectness estimation at 300fps,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286–3293.
  • [115] E. Bingham and H. Mannila, “Random projection in dimensionality reduction: applications to image and text data,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001, pp. 245–250.
  • [116] G. L. Oliveira, W. Burgard, and T. Brox, “Efficient deep models for monocular road segmentation,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 4885–4891.
  • [117] D. Achlioptas, “Database-friendly random projections: Johnson-lindenstrauss with binary coins,” Journal of computer and System Sciences, vol. 66, no. 4, pp. 671–687, 2003.
  • [118] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3213–3223.
  • [119] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtualworlds as proxy for multi-object tracking analysis,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4340–4349.
  • [120] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [121] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in null.   IEEE, 2003, p. 1470.
  • [122] A. Khaliq, S. Ehsan, Z. Chen, M. Milford, and K. McDonald-Maier, “A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes,” IEEE Transactions on Robotics, vol. 36, no. 2, pp. 561–569, 2020.
  • [123] Z. Chen, F. Maffra, I. Sa, and M. Chli, “Only look once, mining distinctive landmarks from convnet for visual place recognition,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 9–16.
  • [124]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [125] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in Advances in neural information processing systems, 2014, pp. 487–495.
  • [126] D. Ravichandran, P. Pantel, and E. Hovy, “Randomized algorithms and nlp: Using locality sensitive hash functions for high speed noun clustering,” in Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), 2005, pp. 622–629.
  • [127] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in European conference on computer vision.   Springer, 2014, pp. 391–405.
  • [128] A. Torii, J. Sivic, M. Okutomi, and T. Pajdla, “Visual place recognition with repetitive structures,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 11, 2015, pp. 2346–2359.
  • [129] C. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl, “Sampling matters in deep embedding learning,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2859–2867.
  • [130] A. Torii, R. Arandjelović, J. Sivic, M. Okutomi, and T. Pajdla, “24/7 place recognition by view synthesis,” in CVPR, 2015.
  • [131] K. Qiu, Y. Ai, B. Tian, B. Wang, and D. Cao, “Siamese-resnet: implementing loop closure detection based on siamese network,” in 2018 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2018, pp. 716–721.
  • [132] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1.   IEEE, 2005, pp. 539–546.
  • [133] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 573–580.
  • [134] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  • [135] X. Wang, Y. Shi, and K. M. Kitani, “Deep supervised hashing with triplet labels,” in Asian conference on computer vision.   Springer, 2016, pp. 70–84.
  • [136] W.-J. Li, S. Wang, and W.-C. Kang, “Feature learning based deep supervised hashing with pairwise labels,” arXiv preprint arXiv:1511.03855, 2015.
  • [137] S. Hausler, A. Jacobson, and M. Milford, “Filter early, match late: Improving network-based visual place recognition,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 3268–3275.
  • [138] Z. Chen, A. Jacobson, N. Sünderhauf, B. Upcroft, L. Liu, C. Shen, I. Reid, and M. Milford, “Deep learning features at scale for visual place recognition,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 3223–3230.
  • [139] S. Appalaraju and V. Chaoji, “Image similarity using deep cnn and curriculum learning,” arXiv preprint arXiv:1709.08761, 2017.
  • [140] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Similarity-Based Pattern Recognition, A. Feragen, M. Pelillo, and M. Loog, Eds.   Cham: Springer International Publishing, 2015, pp. 84–92.
  • [141] N. Carlevaris-Bianco, A. K. Ushani, and R. M. Eustice, “University of Michigan North Campus long-term vision and LiDAR dataset,” International Journal of Robotics Research, vol. 35, no. 9, pp. 1023–1035, 2015.
  • [142] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
  • [143] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, 2006, pp. 1735–1742.
  • [144] H. Yin, Y. Wang, X. Ding, L. Tang, S. Huang, and R. Xiong, “3d lidar-based global localization using siamese neural network,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 4, pp. 1380–1392, 2020.
  • [145] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: a deep quadruplet network for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 403–412.
  • [146] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in 2010 IEEE computer society conference on computer vision and pattern recognition.   IEEE, 2010, pp. 3304–3311.
  • [147] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [148] D. Blalock, J. J. G. Ortiz, J. Frankle, and J. Guttag, “What is the state of neural network pruning?” arXiv preprint arXiv:2003.03033, 2020.
  • [149] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 611–625, 2017.
  • [150] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  • [151]

    F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  • [152] Y. Latif, R. Garg, M. Milford, and I. Reid, “Addressing challenging place recognition tasks using generative adversarial networks,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 2349–2355.
  • [153] P. Yin, L. Xu, Z. Liu, L. Li, H. Salman, Y. He, W. Xu, H. Wang, and H. Choset, “Stabilize an unsupervised feature learning for lidar-based place recognition,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1162–1167.
  • [154]

    J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in

    2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2242–2251.
  • [155] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [156] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, 2005, pp. 886–893 vol. 1.
  • [157] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.   Cambridge university press, 2003.
  • [158] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” arXiv preprint arXiv:1703.05192, 2017.
  • [159] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, “Multispectral pedestrian detection: Benchmark dataset and baseline,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1037–1045.
  • [160] Z. Wang, J. Li, S. Khademi, and J. van Gemert, “Attention-aware age-agnostic visual place recognition,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, pp. 1437–1446.
  • [161] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola, “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
  • [162] L. Tang, Y. Wang, Q. Luo, X. Ding, and R. Xiong, “Adversarial feature disentanglement for place recognition across changing appearance,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 1301–1307.
  • [163] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2813–2821.
  • [164] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” British Machine Vision Association, 2015.
  • [165] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Generative and discriminative voxel modeling with convolutional neural networks,” In: Workshop on 3D Deep Learning, NIPS, 2016.
  • [166] S. Hausler, A. Jacobson, and M. Milford, “Multi-process fusion: Visual place recognition using multiple image processing methods,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1924–1931, 2019.
  • [167] A. Jacobson, Z. Chen, and M. Milford, “Leveraging variable sensor spatial acuity with a homogeneous, multi-scale place recognition framework,” Biological cybernetics, vol. 112, no. 3, pp. 209–225, 2018.
  • [168] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, and E. Romera, “Towards life-long visual localization using an efficient matching of binary sequences from images,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 6328–6335.
  • [169] Y. Latif, G. Huang, J. Leonard, and J. Neira, “An online sparsity-cognizant loop-closure algorithm for visual navigation,” in Proceedings of Robotics: Science and Systems, Berkeley, USA, July 2014.
  • [170] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [171] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “Orb-slam: A versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [172] E. Pepperell, P. I. Corke, and M. J. Milford, “All-environment visual place recognition with smart,” in 2014 IEEE international conference on robotics and automation (ICRA).   IEEE, 2014, pp. 1612–1618.
  • [173] T. Naseer, W. Burgard, and C. Stachniss, “Robust visual localization across seasons,” IEEE Transactions on Robotics, vol. 34, no. 2, pp. 289–302, 2018.
  • [174] C. McManus, B. Upcroft, and P. Newmann, “Scene signatures: Localised and point-less features for localisation,” in Proceedings of Robotics: Science and Systems, Berkeley, USA, July 2014.
  • [175]

    Z. Chen, L. Liu, I. Sa, Z. Ge, and M. Chli, “Learning context flexible attention model for long-term visual place recognition,”

    IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4015–4022, 2018.
  • [176] S. Hausler and M. Milford, “Hierarchical multi-process fusion for visual place recognition,” in 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020.   IEEE, 2020, pp. 3327–3333. [Online]. Available: https://doi.org/10.1109/ICRA40945.2020.9197360
  • [177] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International journal of computer vision, vol. 42, no. 3, pp. 145–175, 2001.
  • [178] P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “Kaze features,” in European Conference on Computer Vision.   Springer, 2012, pp. 214–227.
  • [179] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1925–1934.
  • [180] S. An, G. Che, F. Zhou, X. Liu, X. Ma, and Y. Chen, “Fast and incremental loop closure detection using proximity graphs,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 378–385.
  • [181] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
  • [182] J. Sivic and A. Zisserman, “Efficient visual search of videos cast as text retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 591–606, 2008.
  • [183] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3234–3243.
  • [184] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
  • [185] J.-L. Blanco, F.-A. Moreno, and J. Gonzalez, “A collection of outdoor robotic datasets with centimeter-accuracy ground truth,” Autonomous Robots, vol. 27, no. 4, p. 327, 2009.
  • [186] M. Smith, I. Baldwin, W. Churchill, R. Paul, and P. Newman, “The new college vision and laser data set,” The International Journal of Robotics Research, vol. 28, no. 5, pp. 595–599, 2009.
  • [187] S. Niko, P. Neubert, and P. Protzel, “Are we there yet? challenging SeqSLAM on a 3000 km journey across all four seasons,” in Proc. of IEEE International Conference on Robotics and Automation Workshops, 2013.