Log In Sign Up

A Hierarchical Dual Model of Environment- and Place-Specific Utility for Visual Place Recognition

by   Nikhil Varma Keetha, et al.

Visual Place Recognition (VPR) approaches have typically attempted to match places by identifying visual cues, image regions or landmarks that have high “utility” in identifying a specific place. But this concept of utility is not singular - rather it can take a range of forms. In this paper, we present a novel approach to deduce two key types of utility for VPR: the utility of visual cues `specific' to an environment, and to a particular place. We employ contrastive learning principles to estimate both the environment- and place-specific utility of Vector of Locally Aggregated Descriptors (VLAD) clusters in an unsupervised manner, which is then used to guide local feature matching through keypoint selection. By combining these two utility measures, our approach achieves state-of-the-art performance on three challenging benchmark datasets, while simultaneously reducing the required storage and compute time. We provide further analysis demonstrating that unsupervised cluster selection results in semantically meaningful results, that finer grained categorization often has higher utility for VPR than high level semantic categorization (e.g. building, road), and characterise how these two utility measures vary across different places and environments. Source code is made publicly available at


page 1

page 3

page 6

page 7


Delta Descriptors: Change-Based Place Representation for Robust Visual Localization

Visual place recognition is challenging because there are so many factor...

Deep Learning Features at Scale for Visual Place Recognition

The success of deep learning techniques in the computer vision domain ha...

OpenMPR: Recognize Places Using Multimodal Data for People with Visual Impairments

Place recognition plays a crucial role in navigational assistance, and i...

Semantic Reinforced Attention Learning for Visual Place Recognition

Large-scale visual place recognition (VPR) is inherently challenging bec...

Learning Sequential Descriptors for Sequence-based Visual Place Recognition

In robotics, Visual Place Recognition is a continuous process that recei...

SeqNet: Learning Descriptors for Sequence-based Hierarchical Place Recognition

Visual Place Recognition (VPR) is the task of matching current visual im...

Sequential Place Learning: Heuristic-Free High-Performance Long-Term Place Recognition

Sequential matching using hand-crafted heuristics has been standard prac...

Code Repositories


Code for the RA-L (IROS) 2021 paper "A Hierarchical Dual Model of Environment- and Place-Specific Utility for Visual Place Recognition"

view repo

I Introduction

Mobile robot localization can be challenging due to extreme variations in scene appearance and camera viewpoint, which affect key robot capabilities including semantic scene understanding and Visual Place Recognition (VPR). Several solutions have been proposed in the literature to improve VPR, including contrastive representation learning 

[1, 2], domain translation [3], sequential matching [4, 5], semantic saliency [6, 7] and hierarchical matching [8, 9]. Many of these methods tend to learn salient visual cues that can improve VPR, but which are typically non-interpretable [1, 10], or in the case of those based on explicit semantics, require human input for interpretation and performance enhancement [6, 11, 7].

Fig. 1: We propose Environment-Specific (ES) and Place-Specific (PS) utility estimation methods that determine the relevance of unique visual cues in a reference map. Our combined ES and PS utility, estimated from the global NetVLAD descriptors, guides SuperPoint’s keypoint selection (cyan mask) to obtain correct feature correspondences, where vanilla SuperPoint fails due to matches found on pedestrians and vehicles.

In this work, we propose novel visual utility estimation techniques that not only lead to state-of-the-art VPR performance, but also offer semantic interpretability. For example, they enable insights into the role of coarse (buildings) vs fine-grained semantics (parts of buildings), as well as the varying relevance of a particular semantic class (road) in different environmental contexts (city vs rail traverse). We present a novel hierarchical VPR pipeline that uses global descriptors to guide local feature matching in a more unified manner. Moving beyond existing hierarchical VPR methods which only pass candidate hypotheses from the coarse to fine stage, we also estimate the visual utility of different elements in the scene through a VLAD-based global descriptor (NetVLAD [1]). In particular, we use the cluster-level VLAD representations to estimate a cluster’s utility in environment-specific and place-specific manner for a given reference map. Here, environment-specific refers to estimating utility at a global level applicable to all the places, whereas place-specific refers to estimating utility at a local level applicable to that particular place only. Without requiring any special iterative training, the proposed method is tested on different environments, where we show that the combination of global environment-specific and local place-specific utility leads to informed local feature matching and reduces storage and compute requirements (see Fig 1).

We make the following specific contributions:

  • an unsupervised method for estimating the global environment-specific (ES) and local place-specific (PS) utility of visual elements represented as VLAD clusters;

  • a more unified hierarchical global-to-local VPR pipeline where utility estimated from global descriptors guides local feature matching;

  • a combined ES and PS utility-based method which achieves state-of-the-art VPR performance while offering reduced storage and compute time properties; and

  • a ‘bridge’ between human semantics and automated segmentation-based understanding of visual relevance for VPR, achieved through several visualizations and qualitative insights.

Ii Related Work

Ii-a Global and Local Descriptors for VPR

VPR is commonly posed as an image retrieval problem, where an image is described by a global descriptor or a set of local descriptors and keypoints to match with other images. Recent surveys 

[12, 13, 14] have reviewed the many representations used to describe images for VPR, ranging from hand-crafted features such as SIFT [15] to learned global descriptors such as NetVLAD [1], AP-GeM [2], and DeLG [16]; local descriptors such as SuperPoint [10] and DeLF [16]; and local matchers such as SuperGlue [17].

Furthermore, hierarchical approaches have been used in several VPR, and SLAM pipelines where global descriptors are used to retrieve top candidates and local feature matching is used to obtain the best match amongst the top candidates [8, 6, 18, 19]. One such approach, HF-Net [9] proposed a ‘monolithic’ CNN to simultaneously learn global NetVLAD descriptors and local SuperPoint features for 6-DoF localization. However, in such hierarchical approaches, the global descriptors are not used to guide the local feature matching, and are limited to providing top matching candidates. In this context, our proposed hierarchical method uses unsupervised utility estimated from global descriptors to guide local feature matching, and is applicable to any existing hierarchical VPR method.

Ii-B Visual Feature Selection

The VPR problem has been posed in many ways ranging from a classification task [20, 16] to contrastive learning [1, 2, 21]. Most methods share the core intuition of attempting to represent images of the same place similarly. However, environments are filled with distractors, and so much work has attempted to automatically learn and identify the areas of an image with the most utility for the VPR task.

Semantics Based

As surveyed recently [22], visual semantics is an emerging area of research in the field of robotics with huge potential for VPR and localization. A number of methods have demonstrated the use of semantic information or distinctive and informative visual elements for improving VPR [23, 24, 25, 26, 7, 6, 11]. However, these methods rely largely on human-based semantic categories, where relevant information is retained based on human intuition of the semantic classes, for example, buildings [7], roads [6, 11], lanes [23] and the skyline [25]. Such approaches also tend to require segmentation masks or supervision involving semantic labels to endow the system with higher-level semantic knowledge [7, 11, 6]. More recently, [27] explored using more fine-grained semantic categories beyond those derived from humans, showing promising potential.

Region and Attention Based

Past methods have adopted region-based approaches, including grid-based region selection, region proposal networks, Hashing based landmark detection, and Convolutional Neural Network (CNN) activations based region extraction 

[28, 29, 30, 31]. Other approaches involve using attention to weigh image features based on a relevance criterion [32]. However, even though semantics and saliency are strongly intertwined, there are no methods that leverage learned descriptors’ inherent semantic properties for this problem. In this context, our proposed framework estimates the utility of NetVLAD [1] clusters, leveraging their intrinsic semantic properties.

Place-Specific Feature Selection

Place-specific learning has been explored previously [33, 34, 35] but these methods require significant training. [36] proposed an image-specific and spatially-localized detection of confusing features using local Term frequency-Inverse document frequency (tf-idf) weighting, however, it does not consider a simultaneous global environment-level utility as proposed in this work. Furthermore, we show that our method can be employed under challenging appearance conditions and is not limited to city-like environments.

Fig. 2: Schematic of our proposed approach. In the offline stage, global and local descriptors are extracted from the reference database images, and environment-specific (ES) and place-specific (PS) utility is estimated to further filter the local keypoints. During the online localization stage, for a given query, the top C matching candidates are retrieved from the reference database using global descriptor matching. The final place match is then obtained through local feature matching of query image features with the high utility features of the candidates.

Iii Proposed Approach

In this section, we first present the proposed unsupervised techniques to estimate environment- and place-specific feature utility for VPR. We then describe our unified approach to hierarchical VPR, where the coarse global descriptor matching stage guides the fine local feature matching stage via keypoint filtering based on utility estimates (see Fig 2).

Iii-a Feature Utility Estimation

VLAD based place representations [37, 38, 1, 6]

have been demonstrated to achieve high performance VPR, in particular the recent deep learning based adaptations such as NetVLAD 

[1]. Our approach here is motivated by the observation that the cluster assignment of NetVLAD descriptors has inherent semantic properties, whose level of detail varies with the number of clusters. For example, a NetVLAD descriptor with clusters111

was chosen as being the closest to typical number of semantic classes for road-based datasets such as Cityscapes 

[39]. Please refer to Section V-A2 for an ablation study on vocabulary size.
results in cluster assignment that is analogous to human-based semantics while also comprising fine-grained segmentation of typical broad semantic classes like buildings and roads. These observations lead to the hypothesis we pursue here: instead of comparing the full concatenated VLAD descriptor, if we compare the aggregated residuals at cluster level, the distribution of cluster-wise distances within the reference map can be used to estimate that particular cluster’s utility for VPR. The second component of this hypothesis is that clusters with lower cluster-wise distances tend to cause high perceptual aliasing. We formulate this procedure in accordance with the well established max-margin based contrastive learning regime, as described in the following subsections.

Iii-A1 Maximizing cluster-wise margins

Contrastive learning has been demonstrated to achieve state-of-the-art performance for representing places [1, 2, 21]. With the use of triplets, that is, an anchor (), positive () and a negative (), descriptors are typically learnt such that the margin, , between the anchor-negative distance and the anchor-positive distance is maximized.


In this work, we maximize this margin in a non-iterative manner since only the reference map traverse is used, unlike the typical use of multiple place views [1, 2]. We further adapt the computation of this margin with two key points: the triplets are represented with cluster-wise aggregated residuals instead of considering a full concatenated VLAD vector and the anchor-positive distance in our case approaches zero, since we only consider a single traverse where nearby images are similar. The margin is thus computed for each cluster independently using only the anchor-negative distances, and is referred to as utility from here. Thus, the higher the utility of a cluster, the lower the perceptual aliasing it causes. This process can be used at both the environment-level (that is, globally across the full reference map) and place-level (that is, specific to individual local places), respectively referred to as environment-specific and place-specific, as discussed further here.

Iii-A2 Place-Specific Utility

High-utility clusters tend to indicate salient areas of a particular place. Based on this observation, we formulate a per cluster place-specific utility estimation which, when sorted, gives us a relative saliency ranking for each cluster at the particular place in the reference map.

For a given Image (considered as a unique place) from a particular geographical location in the Reference Map , a positive localization radius exists where places (images) within this radius are considered positives. Also, a non-negative localization radius is considered beyond which all places (images) are considered negatives. Let us suppose that there are negatives in for anchor Image such that represent each negative. Then the place-specific utility of cluster , given that it exists, at that particular place with anchor image in can be formulated as:


where represents the sum of residuals for the cluster K.

Iii-A3 Environment-Specific Utility

At an environment-level, clusters with a low variance in residual values across the reference map contain objects with high perceptual aliasing for that particular environment. We hypothesize that such clusters vary by property of the specific environment and such global utility can guide the place-specific utility to avoid non-relevant clusters for that environment while also preventing transient errors.

Considering all places222Unlike SLAM, for global re-localization the map size is known a priori. in the reference map , the Environment-Specific utility, , of a cluster , is formulated as:


where represents the cluster-level representation of a place, that is, the sum of residuals, for cluster . Once the utility values for all the

clusters are determined, k-means segregation

333The use of terms ‘segregation’ and ‘bins’ instead of clustering and clusters for k-means is intentional to avoid confusing it with VLAD clusters. with k= is used to divide the clusters into two bins. VLAD clusters falling in the bin with high utility are regarded as environment-specific high utility clusters, while others are discarded as dustbin clusters.

Iii-B Unified Hierarchical Visual Place Recognition

When local feature matching pipelines compare local descriptors, they can struggle to discard distractors within an image or areas of the image with high perceptual aliasing. We propose to increase their robustness to aliasing by filtering the local descriptors and keypoints based on our place-specific and environment-specific utility estimated from global descriptors, with an additional benefit being the reduced storage and compute requirements. Distinct from existing work, our hierarchical VPR better unifies the global and local feature stages as the keypoint utility is directly estimated through the utility of VLAD clusters of the global descriptors.

Iii-B1 Global Descriptor Matching

We use NetVLAD representation as our global descriptor. A query image descriptor is matched with the reference global descriptors using Euclidean distance to retrieve top matching candidates.

In order to obtain a cluster-level segmentation mask, we replace NetVLAD’s differentiable soft cluster assignment with its original counterpart of hard cluster assignment, leading to a mask of size

corresponding to the spatial dimensions of NetVLAD’s last convolutional layer tensor. This cluster assignment mask is then rescaled to the original image size

. These cluster assignments along with the top candidate matches are then passed on to the local feature matching stage.

Iii-B2 Local Feature Matching

For local feature matching, we use SuperPoint [10] (SP) descriptors/keypoints. Given the reference images database , cluster assignment mask from the global descriptors and keypoint spatial locations, we employ the place-specific and environment-specific utility for SP descriptors/keypoints filtering in the following manner:

a) Environment-Specific keypoint filtering: Based on the unsupervised environment specific utility estimation, we select the SP keypoints corresponding to environment-specific high utility clusters.

b) Place-Specific keypoint filtering: Once the place-specific utility of each cluster for an Image in the Reference map is estimated, the utility values of all the clusters are sorted to obtain a relative cluster saliency ranking for that specific Image . Based on this place-specific cluster saliency ranking, we use the Top X Clusters to select SP keypoints.

c) Combined keypoint filtering: We propose a combination of the environment-specific and place-specific utility approaches. Initially, the SP keypoints of an image are subsampled using the Top X clusters formulation, and then a further filtering is performed to obtain keypoints belonging only to environment-specific high utility clusters.

Once the filtered SP keypoints for all the images in the reference map are obtained, we consider two state-of-the-art feature matching pipelines to match the SP descriptors of the query with the filtered SP descriptors of the top reference candidates obtained through global descriptor matching:

i) Based on SuperPoint’s matching pipeline [10], the descriptors of query and reference image are first matched using absolute Euclidean distance-based Nearest Neighbor (NN) search with mutual NN cross-check, which is followed by geometric verification using RANSAC based homography with a pixel threshold of .

ii) Based on SuperGlue’s matching pipeline [17]

, a graph neural network takes as input SuperPoint keypoints and descriptors to produce inliers between an image pair.

For a matched image pair, inliers are used to compute the match score:


where , and are the number of inliers, the number of SP keypoints in the query, and the number of filtered SP keypoints in the candidate reference image, respectively. Amongst the top candidates, the candidate with the highest match score is selected as the final match.

Iv Experimental Setup

Iv-a Datasets

We used three widely used benchmark datasets to evaluate our proposed approach: Berlin Kudamm [28, 31], Oxford Robot Car [40], and Nordland [41]. All the datasets present challenging scenarios for VPR in terms of substantial viewpoint shift and drastic shift in visual appearance due to seasonal cycles or time of day, as described below:

Iv-A1 Berlin Kudamm

This dataset is downloaded from the crowd-sourced photo mapping platform Mapillary where two different perspectives of the same route are captured. In this dataset, confusing objects and dynamic distractors such as vehicles and pedestrians with homogeneous scenes lead to perceptual aliasing. The substantial viewpoint shift in particular adds to the complexity. The total traverse span is about Km, where the reference traverse contains frames and the query traverse has frames. In both the traverses, all the frames are geotagged.

Iv-A2 Oxford RobotCar

This dataset contains traverses of Oxford city captured during different seasonal cycles and times of the day. We use a subsampled version of the Overcast Summer and Autumn Night traverses444Originally 2015-03-17-11-08-44 and 2014-12-16-18-44-24 in [40]. We use GPS data to subsample the original data to obtain a total traverse span of Km, resulting in a total of frames in summer and frames during the night, with frame spacing of approximately - meters. The Overcast Summer and Autumn Night traverse provide a drastic shift in visual appearance due to season and time of day.

Iv-A3 Nordland

This dataset captures a km train journey during different seasonal cycles. We use the Summer and Winter traverse for our experiments. We use the first images from both the traverses, which are uniformly subsampled to obtain images for the reference summer traverse and for the query winter traverse. The combination of widespread vegetation and occasional unique objects in this dataset presents a challenging scenario on top of extreme appearance variations.

Iv-B Evaluation

We use Recall as the performance metric since the output of a VPR system can be typically employed for precise 6-DoF SLAM/localization [42]. For a given localization radius, Recall is defined as the ratio of correctly retrieved queries within the top K predictions to the total number of queries. We use a ground truth localization radius of meters, meters and frame respectively for Berlin, Oxford and Nordland datasets. For our place-specific utility estimation, we use top clusters and for the combined environment- and place-specific system, we use top clusters, where is the number of useful clusters determined by the environment-specific system. We provide full parameter sweeps in the results section for sensitivity analysis.

Berlin Oxford Nordland
Methods Recall Storage Time Recall Storage Time Recall Storage Time
NetVLAD [1] 38.21 - - 46.61 - - 9.21 - -
Vanilla SuperPoint (SP) [10] 46.07 1 1 72.11 1 1 14.99 1 1
Semantic Consistency 44.64 0.66 0.83 64.14 0.81 0.90 13.91 0.58 0.90
Cluster Consistency 43.21 0.49 0.74 58.96 0.62 0.80 11.99 0.69 0.96
Ours: ES Utility 50.36 0.79 0.88 74.10 0.53 0.76 16.06 0.96 0.99
Ours: PS Utility 47.14 0.48 0.73 69.32 0.37 0.71 14.56 0.90 0.97
Ours: ES + PS Utility 49.64 0.70 0.84 74.10 0.48 0.74 16.06 0.94 0.98
Vanilla SP + SuperGlue (SG) [17] 59.64 1 1 86.45 1 1 20.34 1 1
Ours: ES Utility 61.07 0.79 0.88 86.06 0.53 0.76 20.12 0.96 0.99
Ours: PS Utility 51.43 0.48 0.73 82.86 0.37 0.71 20.34 0.90 0.97
Ours: ES + PS Utility 59.64 0.70 0.84 86.06 0.48 0.74 20.98 0.94 0.98
HF-Net’s MobileNetVLAD [9] 35.36 - - 60.55 - - 16.91 - -
Vanilla HF-Net 46.78 1 1 86.00 1 1 27.83 1 1
Ours: ES Utility 48.57 0.79 0.86 84.46 0.52 0.77 29.34 0.97 0.99
Ours: PS Utility 45.35 0.63 0.79 81.67 0.47 0.75 26.34 0.90 0.97
Ours: ES + PS Utility 47.85 0.70 0.81 85.66 0.47 0.75 28.69 0.95 0.98
TABLE I: Quantitative Results: Performance comparison on three benchmark datasets.
Fig. 3: Performance Vs Top-X cluster keypoints. Results displayed via single markers at use all the clusters and their keypoints (except Semantic Consistency).

Iv-C Baseline Comparisons

We use vanilla NetVLAD (Pitts30K trained [1]), vanilla SuperPoint [10] and vanilla SuperPoint + SuperGlue as baseline in the results, represented as NetVLAD, Vanilla SuperPoint (SP), and Vanilla SP + SuperGlue (SG) respectively. In all our local feature based methods including other baselines (described below), we use NetVLAD top-20 candidates to select the final match using local feature matching.

We also provide two additional baselines: Semantic Segmentation Consistency and Cluster Consistency, which employ human-level semantics in place of our ES & PS utility in the proposed hierarchical VPR framework. These baselines are based on previous work leveraging human-level semantics [7, 6, 11] or cluster-based fine-grained semantics [27], where semantic or cluster label consistency across reference and query is used to improve the matching framework.

(a) (b)
Fig. 4: (a) Performance of ES + PS Utility with respect to storage and compute time relative to vanilla SuperPoint for Oxford and Berlin, and (b) performance variations on Berlin with respect to the vocabulary size, using Vanilla NetVLAD, Vanilla SuperPoint and ES Utility.

For the Semantic Consistency baseline, we generate semantic segmentation masks for all reference and query images based on the Cityscapes [39] scheme. We then filter the SP keypoints to only retain points belonging to buildings, vegetation, and roads for Oxford and Berlin (this particular choice of semantic classes is similar to the selection by [11, 6]); and the ones belonging to buildings, vegetation and terrain for Nordland. In [6], implicit keypoint correspondences are first obtained and then semantic label consistency is imposed. For the Semantic Consistency baseline considered in this paper, we first obtain the keypoints from the chosen semantic classes and then use the vanilla SuperPoint’s feature matching pipeline to find keypoint correspondences. Similarly, for the Cluster Consistency baseline, we select the clusters corresponding to the aforementioned semantic classes respectively for all three datasets based on visual inspection.

Finally, we also demonstrate the use of our proposed method on an existing hierarchical 6-DoF localization pipeline HF-Net [9] but in the context of VPR. In Table I, we include results for HF-Net’s MobileNetVLAD as a global descriptor, vanilla HF-Net (using their global and local descriptors in accordance with the vanilla SP pipeline) and proposed utility-based keypoint filtering applied to vanilla HF-Net’s local descriptors.

Fig. 5: Example patches from clusters with low (red) & high (blue) environment-specific utility for Berlin, Oxford and Nordland.

V Results & Discussion

In this section, we first present the key quantitative results from testing the proposed framework on three benchmark datasets. We then provide a qualitative analysis with visualizations and insights from both Environment-Specific and Place-Specific cluster utility.

V-a Performance Characteristics

Table I and Fig 3 show the performance of the proposed pipeline on all three benchmark datasets. We also present all the seven baselines’ performance.

Across all datasets and all baseline matching systems (SP, SP+SG, HF-Net), it can be observed that the Environment-Specific (ES) utility results in improved recall in most cases with noticeable reduction in storage and compute time. On the other hand, the Place-Specific (PS) system performs close to its respective vanilla method but significantly reduces storage and compute requirements. Hence, the combined ES + PS method balances the trade-off between recall and efficiency advantages, leading to improved efficiency than ES alone and consistently superior recall performance as compared to the vanilla methods (with only exception being the Oxford dataset when using SP+SG and HF-Net).

Fig 3 shows the full performance curves for the PS and ES+PS systems for SuperPoint-based matching. Performance saturates at a relatively small number of top-X clusters. When employing a combined filtering approach based on both ES and PS Utility, this peak performance is improved and achieved rather earlier.

(a) Berlin (b) Oxford
(c) Nordland Summer
Fig. 6: Visualizations of PS utility ranking of different clusters, where utility decreases from blue to green to red.

V-A1 Storage & Compute Time Benefits

Fig 4(a) shows the ES utility-based filtering resulting in a consistent reduction of the reference map size, measured in terms of total number of local descriptors stored. This leads to reduction in compute time. In particular, for the Oxford dataset, where perceptual aliasing is high due to day-night matching, the ES utility-based filtering leads to an increase in performance while only requiring storage of of the original descriptors.

Both the standalone PS and combined PS+ES system is able to retain near peak performance while reducing the number of Top-X clusters selected for PS utility. In particular for the Berlin dataset, PS+ES continues to outperform the vanilla system even with storage and compute-time requirements. At storage and

compute-time, performance is still competitive with Vanilla SP. Similar trends can be observed for the Oxford and Nordland datasets. Further computational and storage gains are likely achievable with complementary methods including quantization, binarization, hashing and dimension reduction 

[43, 44, 37].

V-A2 Effect of Vocabulary/Cluster Size

To further understand the effect of the VLAD vocabulary size (number of clusters) on the proposed pipeline’s performance, we present an ablation using , , , , and clusters in the proposed pipeline. Fig 4(b) shows the performance of Vanilla NetVLAD, Vanilla SuperPoint and ES Utility methods on Berlin for varying numbers of clusters. It can be observed that the use of 16 clusters for ES offers relatively bigger jump in performance than other cluster size values, which even surpasses high baseline performance for cluster size 96 and 128. Although the distinctiveness of individual clusters increases with the vocabulary size, the results suggest that their relative utility ranking does not remain meaningful enough to noticeably improve performance.

V-B Qualitative Analysis of ES & PS Utility

Our quantitative analyses presented above show how the proposed ES and PS utility methods achieve state-of-the-art performance while also reducing the overall storage and compute time requirements. In this section, we present insights and visualizations of segmentation masks obtained from ES, PS and their combined utility estimation.

V-B1 Environment-Specific Utility

Prior work [7, 6, 11] was based on an assumption that a handful of broad-level semantic classes such as buildings, vegetation, and roads are more important to VPR. Fig 5 depicts patches from low and high ES utility clusters for all three datasets. It can be observed that generic distractors such as cars, pedestrians and sky are discarded by ES utility, which helps in improving performance and is in line with a broad semantics-based utility. However, a cluster representing “a large white planar patch with a window grid” in the Berlin dataset (top left, red) and “large road patches” in the Oxford dataset (top right, red) are deemed to have low utility due to their frequent occurrence, leading to high perceptual aliasing. This demonstrates that determining feature utility for VPR based on a broad-level semantic class is not sufficient, thus fine-grained representations (often a subset of a broad-level semantic class) specific to an environment can effectively determine cluster utility to improve performance.

In the Nordland dataset, there is a small watermark at the top-right corner of the image, which often leads to false local keypoint matches for vanilla SuperPoint. While this specific example could be trivially filtered, it demonstrates a key property of the methods presented here. As shown in Fig 5, in the Nordland cluster patches, this frequently-occurring watermark is assigned to a low environment-specific utility cluster which represents “text based visual elements”.

(a) Berlin
(b) Nordland Summer
Fig. 7: Visualization of ES utility (second column), PS utility (third column) and ES+PS utility (last column). For ES visualizations, low utility is represented in red and high in gray..

V-B2 Place-Specific Utility

Fig 6 shows the importance ranking visualizations based on PS utility, with blue being most useful and red the least. While the behavior of PS is similar to ES, notable examples can be observed in an open vegetative environment of Nordland, where an extra set of railway track (cyan colored in Fig 6(c) left) and a part of road (blue and cyan colored in Fig 6(c) right) specific to a particular place is marked as more important than other elements. Similarly, in Fig 6 c) right, the road visible on the left side is a highly-informative place-specific cue and is considered as high utility by the PS system. As a particular characteristic of the Nordland dataset, we observed that pixels belonging to vegetation, clouds and sky sometimes emerged as a single cluster and thus cumulatively assigned a middling place-specific utility. This could potentially be mitigated through a dynamic selection of cluster size per place and can be explored in future work.

V-B3 Combined ES + PS Utility

Fig 7 shows how ES and PS Utility complement each other and their combination leads to better utility ranking. For instance, in Berlin, both PS and ES utility consider buildings as important on both sides of the road but ES discards the window panes as discussed previously, leading to improved ES+PS performance. Similarly, as shown in Fig 7 for Nordland, PS assigned high utility to a small part of the railway track and watermark but ES utility discarded them. Hence, in the combined ES+PS system, the PS utility-based filtering becomes more robust to perceptual aliasing while also avoiding transient errors.

V-B4 Case Study: Frequent Virtual Landmark

In order to analyze the effect of uniqueness or frequency of an object in the reference traverse on utility estimation, we present a controlled experiment on the Oxford Summer day traverse. We introduce a unique virtual landmark (traffic sign) in the image such that it is well-aligned with the road across the traverse. We introduce this landmark in the traverse at image to create four duplicate reference traverses where varies as , , , and for Sparse, Moderate, High, and Dense setting respectively. As the frequency of the added landmark increased, the PS and ES utility values of that particular cluster decreased. In Fig 8, it can be observed that for the Sparse setting, the landmark is marked as a salient object since it only appears at very few places across the traverse. Conversely, for the Dense scenario, it can be observed that the landmark is marked as the least salient by both PS and ES utility estimation. This study shows the effectiveness of the proposed unsupervised method where regardless of any prior knowledge of the virtual landmark or specific training, the utility is correctly estimated.

Vi Conclusion

Discriminatively identifying individual places from a set of already-seen images is a critical and challenging problem for VPR. Estimating the uniqueness of visual cues and their relevance to VPR is crucial in the context of this problem. In this research, we proposed a novel approach to deduce the utility of visual cues ‘specific’ to an environment and a particular place, unified through a pipeline that guides keypoint filtering at the local feature matching stage. Our proposed pipeline leads to consistent state-of-the-art performance on three standard benchmark datasets exhibiting challenging appearance change and viewpoint shift, while simultaneously reducing the storage and compute time requirements.

(a) Sparse (b) Dense
Fig. 8: Visualization of ES (left) and PS (right) utility ranking when a virtual landmark is placed (a) sparsely or (b) densely in the traverse. For ES visualizations, low utility is represented in red and high in gray.

A number of areas are of interest for further investigation. We employed a contrastive learning approach here but there may be other better suited learning schemes. Moving beyond the two utility measures investigated here, it may be profitable to learn a much larger number of measures of utility, for example learning a measure of utility for local areas that lies partway between specific places and whole environment measures. The ultimate ideal number of utility measures may depend on their complementarity, which could also be assessed. While we have shown here that finer grained semantic segmentation below broad classes like road or building may have particular utility for VPR, further research could investigate the relationship between human-defined categories and those which are most useful for VPR. Collectively, future work in this area will help further improve the capabilities of these systems while also bridging the divide between human navigation and autonomous navigation systems, with potential additional benefits in areas like human robot interaction.


  • [1] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016, pp. 5297–5307.
  • [2] J. Revaud, J. Almazán, R. S. Rezende, and C. R. d. Souza, “Learning with average precision: Training image retrieval with a listwise loss,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5107–5116.
  • [3]

    Y. Latif, R. Garg, M. Milford, and I. Reid, “Addressing challenging place recognition tasks using generative adversarial networks,” in

    2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 2349–2355.
  • [4] S. Garg and M. J. Milford, “Seqnet: Learning descriptors for sequence-based hierarchical place recognition,” IEEE Robotics and Automation Letters, 2021.
  • [5] S. Garg, B. Harwood, G. Anand, and M. Milford, “Delta descriptors: Change-based place representation for robust visual localization,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5120–5127, 2020.
  • [6] S. Garg, N. Suenderhauf, and M. Milford, “Lost? appearance-invariant place recognition for opposite viewpoints using visual semantics,” arXiv preprint arXiv:1804.05526, 2018.
  • [7] T. Naseer, G. L. Oliveira, T. Brox, and W. Burgard, “Semantics-aware visual localization under challenging perceptual conditions,” in IEEE International Conference on Robotics and Automation (ICRA), 2017.
  • [8] M. Cummins and P. Newman, “Appearance-only SLAM at large scale with FAB-MAP 2.0,” Int. J. Robot. Res., vol. 30, no. 9, pp. 1100–1123, 2011.
  • [9] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robust hierarchical localization at large scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 716–12 725.
  • [10] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224–236.
  • [11] A. Gawel, C. Del Don, R. Siegwart, J. Nieto, and C. Cadena, “X-view: Graph-based semantic multi-view localization,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1687–1694, 2018.
  • [12] C. Masone and B. Caputo, “A survey on deep visual place recognition,” IEEE Access, 2021.
  • [13] S. Garg, T. Fischer, and M. Milford, “Where is your place, visual place recognition?” in IJCAI, 2021.
  • [14] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 1–19, 2015.
  • [15] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
  • [16] B. Cao, A. Araujo, and J. Sim, “Unifying deep local and global features for image search,” in European Conference on Computer Vision.   Springer, 2020, pp. 726–743.
  • [17] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4938–4947.
  • [18] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE transactions on robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [19] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in European conference on computer vision.   Springer, 2014, pp. 834–849.
  • [20] Z. Chen, A. Jacobson, N. Sünderhauf, B. Upcroft, L. Liu, C. Shen, I. Reid, and M. Milford, “Deep learning features at scale for visual place recognition,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on.   IEEE, 2017, pp. 3223–3230.
  • [21] F. Radenović, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1655–1668, 2018.
  • [22] S. Garg, N. Sünderhauf, F. Dayoub, D. Morrison, A. Cosgun, G. Carneiro, Q. Wu, T.-J. Chin, I. Reid, S. Gould, P. Corke, and M. Milford, “Semantics for robotic mapping, perception and interaction: A survey,” Foundations and Trends® in Robotics, vol. 8, no. 1–2, pp. 1–224, 2020. [Online]. Available:
  • [23] M. Schreiber, C. Knöppel, and U. Franke, “Laneloc: Lane marking based localization using highly accurate maps,” in Intelligent Vehicles Symposium (IV), 2013 IEEE.   IEEE, 2013, pp. 449–454.
  • [24] N. Atanasov, M. Zhu, K. Daniilidis, and G. J. Pappas, “Localization from semantic observations via the matrix permanent,” The International Journal of Robotics Research, vol. 35, no. 1-3, pp. 73–99, 2016.
  • [25] T. Stone, D. Differt, M. Milford, and B. Webb, “Skyline-based localisation for aggressively manoeuvring robots using uv sensors and spherical harmonics,” in 2016 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2016, pp. 5615–5622.
  • [26] L. Weng, B. Soheilian, and V. Gouet-Brunet, “Semantic signatures for urban visual localization,” in 2018 International Conference on Content-Based Multimedia Indexing (CBMI).   IEEE, 2018, pp. 1–6.
  • [27] M. Larsson, E. Stenborg, C. Toft, L. Hammarstrand, T. Sattler, and F. Kahl, “Fine-grained segmentation networks: Self-supervised segmentation for improved long-term visual localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 31–41.
  • [28] Z. Chen, F. Maffra, I. Sa, and M. Chli, “Only look once, mining distinctive landmarks from convnet for visual place recognition,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 9–16.
  • [29] A. Khaliq, S. Ehsan, Z. Chen, M. Milford, and K. McDonald-Maier, “A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes,” IEEE Transactions on Robotics, vol. 36, no. 2, pp. 561–569, 2019.
  • [30]

    Z. Chen, L. Liu, I. Sa, Z. Ge, and M. Chli, “Learning context flexible attention model for long-term visual place recognition,”

    IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4015–4022, 2018.
  • [31] N. Sünderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” Robotics: Science and Systems XI:, pp. 1–10, 2015.
  • [32] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval: Learning global representations for image search,” in European conference on computer vision.   Springer, 2016, pp. 241–257.
  • [33] D. Carl, S. Saurabh, G. Abhinav, S. Josef, and A. A. Efros, “What makes paris look like paris?” ACM Transactions on Graphics (SIGGRAPH), vol. 31, no. 4, 2012.
  • [34]

    P. Gronat, G. Obozinski, J. Sivic, and T. Pajdla, “Learning and calibrating per-location classifiers for visual place recognition,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 907–914.
  • [35] C. McManus, B. Upcroft, and P. Newmann, “Scene signatures: Localised and point-less features for localisation,” in Robotics: Science and Systems, 2014.
  • [36] J. Knopp, J. Sivic, and T. Pajdla, “Avoiding confusing features in place recognition,” in European Conference on Computer Vision.   Springer, 2010, pp. 748–761.
  • [37] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.   IEEE, 2010, pp. 3304–3311.
  • [38] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, “24/7 place recognition by view synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1808–1817.
  • [39] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223.
  • [40] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The oxford robotcar dataset,” The International Journal of Robotics Research, vol. 36, no. 1, pp. 3–15, 2017.
  • [41] D. Olid, J. M. Fácil, and J. Civera, “Single-view place recognition under seasonal changes,” in PPNIV Workshop at IROS 2018, 2018.
  • [42]

    M. Zaffar, S. Garg, M. Milford, J. Kooij, D. Flynn, K. McDonald-Maier, and S. Ehsan, “Vpr-bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change,”

    International Journal of Computer Vision, pp. 1–39, 2021.
  • [43] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, and E. Romera, “Towards life-long visual localization using an efficient matching of binary sequences from images,” in IEEE Int. Conf. Robot. Autom., 2015, pp. 6328–6335.
  • [44] O. Vysotska and C. Stachniss, “Relocalization under substantial appearance changes using hashing,” in IEEE/RSJ Int. Conf. Intell. Robot. Syst. Worksh., 2017.