HEAPUtil
Code for the RA-L (IROS) 2021 paper "A Hierarchical Dual Model of Environment- and Place-Specific Utility for Visual Place Recognition"
view repo
Visual Place Recognition (VPR) approaches have typically attempted to match places by identifying visual cues, image regions or landmarks that have high “utility” in identifying a specific place. But this concept of utility is not singular - rather it can take a range of forms. In this paper, we present a novel approach to deduce two key types of utility for VPR: the utility of visual cues `specific' to an environment, and to a particular place. We employ contrastive learning principles to estimate both the environment- and place-specific utility of Vector of Locally Aggregated Descriptors (VLAD) clusters in an unsupervised manner, which is then used to guide local feature matching through keypoint selection. By combining these two utility measures, our approach achieves state-of-the-art performance on three challenging benchmark datasets, while simultaneously reducing the required storage and compute time. We provide further analysis demonstrating that unsupervised cluster selection results in semantically meaningful results, that finer grained categorization often has higher utility for VPR than high level semantic categorization (e.g. building, road), and characterise how these two utility measures vary across different places and environments. Source code is made publicly available at https://github.com/Nik-V9/HEAPUtil.
READ FULL TEXT VIEW PDFCode for the RA-L (IROS) 2021 paper "A Hierarchical Dual Model of Environment- and Place-Specific Utility for Visual Place Recognition"
Mobile robot localization can be challenging due to extreme variations in scene appearance and camera viewpoint, which affect key robot capabilities including semantic scene understanding and Visual Place Recognition (VPR). Several solutions have been proposed in the literature to improve VPR, including contrastive representation learning
[1, 2], domain translation [3], sequential matching [4, 5], semantic saliency [6, 7] and hierarchical matching [8, 9]. Many of these methods tend to learn salient visual cues that can improve VPR, but which are typically non-interpretable [1, 10], or in the case of those based on explicit semantics, require human input for interpretation and performance enhancement [6, 11, 7].![]() |
In this work, we propose novel visual utility estimation techniques that not only lead to state-of-the-art VPR performance, but also offer semantic interpretability. For example, they enable insights into the role of coarse (buildings) vs fine-grained semantics (parts of buildings), as well as the varying relevance of a particular semantic class (road) in different environmental contexts (city vs rail traverse). We present a novel hierarchical VPR pipeline that uses global descriptors to guide local feature matching in a more unified manner. Moving beyond existing hierarchical VPR methods which only pass candidate hypotheses from the coarse to fine stage, we also estimate the visual utility of different elements in the scene through a VLAD-based global descriptor (NetVLAD [1]). In particular, we use the cluster-level VLAD representations to estimate a cluster’s utility in environment-specific and place-specific manner for a given reference map. Here, environment-specific refers to estimating utility at a global level applicable to all the places, whereas place-specific refers to estimating utility at a local level applicable to that particular place only. Without requiring any special iterative training, the proposed method is tested on different environments, where we show that the combination of global environment-specific and local place-specific utility leads to informed local feature matching and reduces storage and compute requirements (see Fig 1).
We make the following specific contributions:
an unsupervised method for estimating the global environment-specific (ES) and local place-specific (PS) utility of visual elements represented as VLAD clusters;
a more unified hierarchical global-to-local VPR pipeline where utility estimated from global descriptors guides local feature matching;
a combined ES and PS utility-based method which achieves state-of-the-art VPR performance while offering reduced storage and compute time properties; and
a ‘bridge’ between human semantics and automated segmentation-based understanding of visual relevance for VPR, achieved through several visualizations and qualitative insights.
VPR is commonly posed as an image retrieval problem, where an image is described by a global descriptor or a set of local descriptors and keypoints to match with other images. Recent surveys
[12, 13, 14] have reviewed the many representations used to describe images for VPR, ranging from hand-crafted features such as SIFT [15] to learned global descriptors such as NetVLAD [1], AP-GeM [2], and DeLG [16]; local descriptors such as SuperPoint [10] and DeLF [16]; and local matchers such as SuperGlue [17].Furthermore, hierarchical approaches have been used in several VPR, and SLAM pipelines where global descriptors are used to retrieve top candidates and local feature matching is used to obtain the best match amongst the top candidates [8, 6, 18, 19]. One such approach, HF-Net [9] proposed a ‘monolithic’ CNN to simultaneously learn global NetVLAD descriptors and local SuperPoint features for 6-DoF localization. However, in such hierarchical approaches, the global descriptors are not used to guide the local feature matching, and are limited to providing top matching candidates. In this context, our proposed hierarchical method uses unsupervised utility estimated from global descriptors to guide local feature matching, and is applicable to any existing hierarchical VPR method.
The VPR problem has been posed in many ways ranging from a classification task [20, 16] to contrastive learning [1, 2, 21]. Most methods share the core intuition of attempting to represent images of the same place similarly. However, environments are filled with distractors, and so much work has attempted to automatically learn and identify the areas of an image with the most utility for the VPR task.
As surveyed recently [22], visual semantics is an emerging area of research in the field of robotics with huge potential for VPR and localization. A number of methods have demonstrated the use of semantic information or distinctive and informative visual elements for improving VPR [23, 24, 25, 26, 7, 6, 11]. However, these methods rely largely on human-based semantic categories, where relevant information is retained based on human intuition of the semantic classes, for example, buildings [7], roads [6, 11], lanes [23] and the skyline [25]. Such approaches also tend to require segmentation masks or supervision involving semantic labels to endow the system with higher-level semantic knowledge [7, 11, 6]. More recently, [27] explored using more fine-grained semantic categories beyond those derived from humans, showing promising potential.
Past methods have adopted region-based approaches, including grid-based region selection, region proposal networks, Hashing based landmark detection, and Convolutional Neural Network (CNN) activations based region extraction
[28, 29, 30, 31]. Other approaches involve using attention to weigh image features based on a relevance criterion [32]. However, even though semantics and saliency are strongly intertwined, there are no methods that leverage learned descriptors’ inherent semantic properties for this problem. In this context, our proposed framework estimates the utility of NetVLAD [1] clusters, leveraging their intrinsic semantic properties.Place-specific learning has been explored previously [33, 34, 35] but these methods require significant training. [36] proposed an image-specific and spatially-localized detection of confusing features using local Term frequency-Inverse document frequency (tf-idf) weighting, however, it does not consider a simultaneous global environment-level utility as proposed in this work. Furthermore, we show that our method can be employed under challenging appearance conditions and is not limited to city-like environments.
![]() |
In this section, we first present the proposed unsupervised techniques to estimate environment- and place-specific feature utility for VPR. We then describe our unified approach to hierarchical VPR, where the coarse global descriptor matching stage guides the fine local feature matching stage via keypoint filtering based on utility estimates (see Fig 2).
VLAD based place representations [37, 38, 1, 6]
have been demonstrated to achieve high performance VPR, in particular the recent deep learning based adaptations such as NetVLAD
[1]. Our approach here is motivated by the observation that the cluster assignment of NetVLAD descriptors has inherent semantic properties, whose level of detail varies with the number of clusters. For example, a NetVLAD descriptor with clusters111was chosen as being the closest to typical number of semantic classes for road-based datasets such as Cityscapes
[39]. Please refer to Section V-A2 for an ablation study on vocabulary size. results in cluster assignment that is analogous to human-based semantics while also comprising fine-grained segmentation of typical broad semantic classes like buildings and roads. These observations lead to the hypothesis we pursue here: instead of comparing the full concatenated VLAD descriptor, if we compare the aggregated residuals at cluster level, the distribution of cluster-wise distances within the reference map can be used to estimate that particular cluster’s utility for VPR. The second component of this hypothesis is that clusters with lower cluster-wise distances tend to cause high perceptual aliasing. We formulate this procedure in accordance with the well established max-margin based contrastive learning regime, as described in the following subsections.Contrastive learning has been demonstrated to achieve state-of-the-art performance for representing places [1, 2, 21]. With the use of triplets, that is, an anchor (), positive () and a negative (), descriptors are typically learnt such that the margin, , between the anchor-negative distance and the anchor-positive distance is maximized.
(1) |
In this work, we maximize this margin in a non-iterative manner since only the reference map traverse is used, unlike the typical use of multiple place views [1, 2]. We further adapt the computation of this margin with two key points: the triplets are represented with cluster-wise aggregated residuals instead of considering a full concatenated VLAD vector and the anchor-positive distance in our case approaches zero, since we only consider a single traverse where nearby images are similar. The margin is thus computed for each cluster independently using only the anchor-negative distances, and is referred to as utility from here. Thus, the higher the utility of a cluster, the lower the perceptual aliasing it causes. This process can be used at both the environment-level (that is, globally across the full reference map) and place-level (that is, specific to individual local places), respectively referred to as environment-specific and place-specific, as discussed further here.
High-utility clusters tend to indicate salient areas of a particular place. Based on this observation, we formulate a per cluster place-specific utility estimation which, when sorted, gives us a relative saliency ranking for each cluster at the particular place in the reference map.
For a given Image (considered as a unique place) from a particular geographical location in the Reference Map , a positive localization radius exists where places (images) within this radius are considered positives. Also, a non-negative localization radius is considered beyond which all places (images) are considered negatives. Let us suppose that there are negatives in for anchor Image such that represent each negative. Then the place-specific utility of cluster , given that it exists, at that particular place with anchor image in can be formulated as:
(2) |
where represents the sum of residuals for the cluster K.
At an environment-level, clusters with a low variance in residual values across the reference map contain objects with high perceptual aliasing for that particular environment. We hypothesize that such clusters vary by property of the specific environment and such global utility can guide the place-specific utility to avoid non-relevant clusters for that environment while also preventing transient errors.
Considering all places222Unlike SLAM, for global re-localization the map size is known a priori. in the reference map , the Environment-Specific utility, , of a cluster , is formulated as:
(3) |
where represents the cluster-level representation of a place, that is, the sum of residuals, for cluster . Once the utility values for all the
clusters are determined, k-means segregation
333The use of terms ‘segregation’ and ‘bins’ instead of clustering and clusters for k-means is intentional to avoid confusing it with VLAD clusters. with k= is used to divide the clusters into two bins. VLAD clusters falling in the bin with high utility are regarded as environment-specific high utility clusters, while others are discarded as dustbin clusters.When local feature matching pipelines compare local descriptors, they can struggle to discard distractors within an image or areas of the image with high perceptual aliasing. We propose to increase their robustness to aliasing by filtering the local descriptors and keypoints based on our place-specific and environment-specific utility estimated from global descriptors, with an additional benefit being the reduced storage and compute requirements. Distinct from existing work, our hierarchical VPR better unifies the global and local feature stages as the keypoint utility is directly estimated through the utility of VLAD clusters of the global descriptors.
We use NetVLAD representation as our global descriptor. A query image descriptor is matched with the reference global descriptors using Euclidean distance to retrieve top matching candidates.
In order to obtain a cluster-level segmentation mask, we replace NetVLAD’s differentiable soft cluster assignment with its original counterpart of hard cluster assignment, leading to a mask of size
corresponding to the spatial dimensions of NetVLAD’s last convolutional layer tensor. This cluster assignment mask is then rescaled to the original image size
. These cluster assignments along with the top candidate matches are then passed on to the local feature matching stage.For local feature matching, we use SuperPoint [10] (SP) descriptors/keypoints. Given the reference images database , cluster assignment mask from the global descriptors and keypoint spatial locations, we employ the place-specific and environment-specific utility for SP descriptors/keypoints filtering in the following manner:
a) Environment-Specific keypoint filtering: Based on the unsupervised environment specific utility estimation, we select the SP keypoints corresponding to environment-specific high utility clusters.
b) Place-Specific keypoint filtering: Once the place-specific utility of each cluster for an Image in the Reference map is estimated, the utility values of all the clusters are sorted to obtain a relative cluster saliency ranking for that specific Image . Based on this place-specific cluster saliency ranking, we use the Top X Clusters to select SP keypoints.
c) Combined keypoint filtering: We propose a combination of the environment-specific and place-specific utility approaches. Initially, the SP keypoints of an image are subsampled using the Top X clusters formulation, and then a further filtering is performed to obtain keypoints belonging only to environment-specific high utility clusters.
Once the filtered SP keypoints for all the images in the reference map are obtained, we consider two state-of-the-art feature matching pipelines to match the SP descriptors of the query with the filtered SP descriptors of the top reference candidates obtained through global descriptor matching:
i) Based on SuperPoint’s matching pipeline [10], the descriptors of query and reference image are first matched using absolute Euclidean distance-based Nearest Neighbor (NN) search with mutual NN cross-check, which is followed by geometric verification using RANSAC based homography with a pixel threshold of .
ii) Based on SuperGlue’s matching pipeline [17]
, a graph neural network takes as input SuperPoint keypoints and descriptors to produce inliers between an image pair.
For a matched image pair, inliers are used to compute the match score:
(4) |
where , and are the number of inliers, the number of SP keypoints in the query, and the number of filtered SP keypoints in the candidate reference image, respectively. Amongst the top candidates, the candidate with the highest match score is selected as the final match.
We used three widely used benchmark datasets to evaluate our proposed approach: Berlin Kudamm [28, 31], Oxford Robot Car [40], and Nordland [41]. All the datasets present challenging scenarios for VPR in terms of substantial viewpoint shift and drastic shift in visual appearance due to seasonal cycles or time of day, as described below:
This dataset is downloaded from the crowd-sourced photo mapping platform Mapillary where two different perspectives of the same route are captured. In this dataset, confusing objects and dynamic distractors such as vehicles and pedestrians with homogeneous scenes lead to perceptual aliasing. The substantial viewpoint shift in particular adds to the complexity. The total traverse span is about Km, where the reference traverse contains frames and the query traverse has frames. In both the traverses, all the frames are geotagged.
This dataset contains traverses of Oxford city captured during different seasonal cycles and times of the day. We use a subsampled version of the Overcast Summer and Autumn Night traverses444Originally 2015-03-17-11-08-44 and 2014-12-16-18-44-24 in [40]. We use GPS data to subsample the original data to obtain a total traverse span of Km, resulting in a total of frames in summer and frames during the night, with frame spacing of approximately - meters. The Overcast Summer and Autumn Night traverse provide a drastic shift in visual appearance due to season and time of day.
This dataset captures a km train journey during different seasonal cycles. We use the Summer and Winter traverse for our experiments. We use the first images from both the traverses, which are uniformly subsampled to obtain images for the reference summer traverse and for the query winter traverse. The combination of widespread vegetation and occasional unique objects in this dataset presents a challenging scenario on top of extreme appearance variations.
We use Recall as the performance metric since the output of a VPR system can be typically employed for precise 6-DoF SLAM/localization [42]. For a given localization radius, Recall is defined as the ratio of correctly retrieved queries within the top K predictions to the total number of queries. We use a ground truth localization radius of meters, meters and frame respectively for Berlin, Oxford and Nordland datasets. For our place-specific utility estimation, we use top clusters and for the combined environment- and place-specific system, we use top clusters, where is the number of useful clusters determined by the environment-specific system. We provide full parameter sweeps in the results section for sensitivity analysis.
Berlin | Oxford | Nordland | |||||||
Methods | Recall | Storage | Time | Recall | Storage | Time | Recall | Storage | Time |
NetVLAD [1] | 38.21 | - | - | 46.61 | - | - | 9.21 | - | - |
Vanilla SuperPoint (SP) [10] | 46.07 | 1 | 1 | 72.11 | 1 | 1 | 14.99 | 1 | 1 |
Semantic Consistency | 44.64 | 0.66 | 0.83 | 64.14 | 0.81 | 0.90 | 13.91 | 0.58 | 0.90 |
Cluster Consistency | 43.21 | 0.49 | 0.74 | 58.96 | 0.62 | 0.80 | 11.99 | 0.69 | 0.96 |
Ours: ES Utility | 50.36 | 0.79 | 0.88 | 74.10 | 0.53 | 0.76 | 16.06 | 0.96 | 0.99 |
Ours: PS Utility | 47.14 | 0.48 | 0.73 | 69.32 | 0.37 | 0.71 | 14.56 | 0.90 | 0.97 |
Ours: ES + PS Utility | 49.64 | 0.70 | 0.84 | 74.10 | 0.48 | 0.74 | 16.06 | 0.94 | 0.98 |
Vanilla SP + SuperGlue (SG) [17] | 59.64 | 1 | 1 | 86.45 | 1 | 1 | 20.34 | 1 | 1 |
Ours: ES Utility | 61.07 | 0.79 | 0.88 | 86.06 | 0.53 | 0.76 | 20.12 | 0.96 | 0.99 |
Ours: PS Utility | 51.43 | 0.48 | 0.73 | 82.86 | 0.37 | 0.71 | 20.34 | 0.90 | 0.97 |
Ours: ES + PS Utility | 59.64 | 0.70 | 0.84 | 86.06 | 0.48 | 0.74 | 20.98 | 0.94 | 0.98 |
HF-Net’s MobileNetVLAD [9] | 35.36 | - | - | 60.55 | - | - | 16.91 | - | - |
Vanilla HF-Net | 46.78 | 1 | 1 | 86.00 | 1 | 1 | 27.83 | 1 | 1 |
Ours: ES Utility | 48.57 | 0.79 | 0.86 | 84.46 | 0.52 | 0.77 | 29.34 | 0.97 | 0.99 |
Ours: PS Utility | 45.35 | 0.63 | 0.79 | 81.67 | 0.47 | 0.75 | 26.34 | 0.90 | 0.97 |
Ours: ES + PS Utility | 47.85 | 0.70 | 0.81 | 85.66 | 0.47 | 0.75 | 28.69 | 0.95 | 0.98 |
![]() |
![]() |
![]() |
![]() |
We use vanilla NetVLAD (Pitts30K trained [1]), vanilla SuperPoint [10] and vanilla SuperPoint + SuperGlue as baseline in the results, represented as NetVLAD, Vanilla SuperPoint (SP), and Vanilla SP + SuperGlue (SG) respectively. In all our local feature based methods including other baselines (described below), we use NetVLAD top-20 candidates to select the final match using local feature matching.
We also provide two additional baselines: Semantic Segmentation Consistency and Cluster Consistency, which employ human-level semantics in place of our ES & PS utility in the proposed hierarchical VPR framework. These baselines are based on previous work leveraging human-level semantics [7, 6, 11] or cluster-based fine-grained semantics [27], where semantic or cluster label consistency across reference and query is used to improve the matching framework.
![]() |
![]() |
(a) | (b) |
For the Semantic Consistency baseline, we generate semantic segmentation masks for all reference and query images based on the Cityscapes [39] scheme. We then filter the SP keypoints to only retain points belonging to buildings, vegetation, and roads for Oxford and Berlin (this particular choice of semantic classes is similar to the selection by [11, 6]); and the ones belonging to buildings, vegetation and terrain for Nordland. In [6], implicit keypoint correspondences are first obtained and then semantic label consistency is imposed. For the Semantic Consistency baseline considered in this paper, we first obtain the keypoints from the chosen semantic classes and then use the vanilla SuperPoint’s feature matching pipeline to find keypoint correspondences. Similarly, for the Cluster Consistency baseline, we select the clusters corresponding to the aforementioned semantic classes respectively for all three datasets based on visual inspection.
Finally, we also demonstrate the use of our proposed method on an existing hierarchical 6-DoF localization pipeline HF-Net [9] but in the context of VPR. In Table I, we include results for HF-Net’s MobileNetVLAD as a global descriptor, vanilla HF-Net (using their global and local descriptors in accordance with the vanilla SP pipeline) and proposed utility-based keypoint filtering applied to vanilla HF-Net’s local descriptors.
![]() |
In this section, we first present the key quantitative results from testing the proposed framework on three benchmark datasets. We then provide a qualitative analysis with visualizations and insights from both Environment-Specific and Place-Specific cluster utility.
Table I and Fig 3 show the performance of the proposed pipeline on all three benchmark datasets. We also present all the seven baselines’ performance.
Across all datasets and all baseline matching systems (SP, SP+SG, HF-Net), it can be observed that the Environment-Specific (ES) utility results in improved recall in most cases with noticeable reduction in storage and compute time. On the other hand, the Place-Specific (PS) system performs close to its respective vanilla method but significantly reduces storage and compute requirements. Hence, the combined ES + PS method balances the trade-off between recall and efficiency advantages, leading to improved efficiency than ES alone and consistently superior recall performance as compared to the vanilla methods (with only exception being the Oxford dataset when using SP+SG and HF-Net).
Fig 3 shows the full performance curves for the PS and ES+PS systems for SuperPoint-based matching. Performance saturates at a relatively small number of top-X clusters. When employing a combined filtering approach based on both ES and PS Utility, this peak performance is improved and achieved rather earlier.
![]() |
![]() |
![]() |
![]() |
![]() |
(a) Berlin | (b) Oxford | |||
![]() |
![]() |
![]() |
![]() |
![]() |
(c) Nordland Summer |
Fig 4(a) shows the ES utility-based filtering resulting in a consistent reduction of the reference map size, measured in terms of total number of local descriptors stored. This leads to reduction in compute time. In particular, for the Oxford dataset, where perceptual aliasing is high due to day-night matching, the ES utility-based filtering leads to an increase in performance while only requiring storage of of the original descriptors.
Both the standalone PS and combined PS+ES system is able to retain near peak performance while reducing the number of Top-X clusters selected for PS utility. In particular for the Berlin dataset, PS+ES continues to outperform the vanilla system even with storage and compute-time requirements. At storage and
compute-time, performance is still competitive with Vanilla SP. Similar trends can be observed for the Oxford and Nordland datasets. Further computational and storage gains are likely achievable with complementary methods including quantization, binarization, hashing and dimension reduction
[43, 44, 37].To further understand the effect of the VLAD vocabulary size (number of clusters) on the proposed pipeline’s performance, we present an ablation using , , , , and clusters in the proposed pipeline. Fig 4(b) shows the performance of Vanilla NetVLAD, Vanilla SuperPoint and ES Utility methods on Berlin for varying numbers of clusters. It can be observed that the use of 16 clusters for ES offers relatively bigger jump in performance than other cluster size values, which even surpasses high baseline performance for cluster size 96 and 128. Although the distinctiveness of individual clusters increases with the vocabulary size, the results suggest that their relative utility ranking does not remain meaningful enough to noticeably improve performance.
Our quantitative analyses presented above show how the proposed ES and PS utility methods achieve state-of-the-art performance while also reducing the overall storage and compute time requirements. In this section, we present insights and visualizations of segmentation masks obtained from ES, PS and their combined utility estimation.
Prior work [7, 6, 11] was based on an assumption that a handful of broad-level semantic classes such as buildings, vegetation, and roads are more important to VPR. Fig 5 depicts patches from low and high ES utility clusters for all three datasets. It can be observed that generic distractors such as cars, pedestrians and sky are discarded by ES utility, which helps in improving performance and is in line with a broad semantics-based utility. However, a cluster representing “a large white planar patch with a window grid” in the Berlin dataset (top left, red) and “large road patches” in the Oxford dataset (top right, red) are deemed to have low utility due to their frequent occurrence, leading to high perceptual aliasing. This demonstrates that determining feature utility for VPR based on a broad-level semantic class is not sufficient, thus fine-grained representations (often a subset of a broad-level semantic class) specific to an environment can effectively determine cluster utility to improve performance.
In the Nordland dataset, there is a small watermark at the top-right corner of the image, which often leads to false local keypoint matches for vanilla SuperPoint. While this specific example could be trivially filtered, it demonstrates a key property of the methods presented here. As shown in Fig 5, in the Nordland cluster patches, this frequently-occurring watermark is assigned to a low environment-specific utility cluster which represents “text based visual elements”.
![]() |
![]() |
![]() |
![]() |
![]() |
(a) Berlin | ||||
![]() |
![]() |
![]() |
![]() |
![]() |
(b) Nordland Summer |
Fig 6 shows the importance ranking visualizations based on PS utility, with blue being most useful and red the least. While the behavior of PS is similar to ES, notable examples can be observed in an open vegetative environment of Nordland, where an extra set of railway track (cyan colored in Fig 6(c) left) and a part of road (blue and cyan colored in Fig 6(c) right) specific to a particular place is marked as more important than other elements. Similarly, in Fig 6 c) right, the road visible on the left side is a highly-informative place-specific cue and is considered as high utility by the PS system. As a particular characteristic of the Nordland dataset, we observed that pixels belonging to vegetation, clouds and sky sometimes emerged as a single cluster and thus cumulatively assigned a middling place-specific utility. This could potentially be mitigated through a dynamic selection of cluster size per place and can be explored in future work.
Fig 7 shows how ES and PS Utility complement each other and their combination leads to better utility ranking. For instance, in Berlin, both PS and ES utility consider buildings as important on both sides of the road but ES discards the window panes as discussed previously, leading to improved ES+PS performance. Similarly, as shown in Fig 7 for Nordland, PS assigned high utility to a small part of the railway track and watermark but ES utility discarded them. Hence, in the combined ES+PS system, the PS utility-based filtering becomes more robust to perceptual aliasing while also avoiding transient errors.
In order to analyze the effect of uniqueness or frequency of an object in the reference traverse on utility estimation, we present a controlled experiment on the Oxford Summer day traverse. We introduce a unique virtual landmark (traffic sign) in the image such that it is well-aligned with the road across the traverse. We introduce this landmark in the traverse at image to create four duplicate reference traverses where varies as , , , and for Sparse, Moderate, High, and Dense setting respectively. As the frequency of the added landmark increased, the PS and ES utility values of that particular cluster decreased. In Fig 8, it can be observed that for the Sparse setting, the landmark is marked as a salient object since it only appears at very few places across the traverse. Conversely, for the Dense scenario, it can be observed that the landmark is marked as the least salient by both PS and ES utility estimation. This study shows the effectiveness of the proposed unsupervised method where regardless of any prior knowledge of the virtual landmark or specific training, the utility is correctly estimated.
Discriminatively identifying individual places from a set of already-seen images is a critical and challenging problem for VPR. Estimating the uniqueness of visual cues and their relevance to VPR is crucial in the context of this problem. In this research, we proposed a novel approach to deduce the utility of visual cues ‘specific’ to an environment and a particular place, unified through a pipeline that guides keypoint filtering at the local feature matching stage. Our proposed pipeline leads to consistent state-of-the-art performance on three standard benchmark datasets exhibiting challenging appearance change and viewpoint shift, while simultaneously reducing the storage and compute time requirements.
![]() |
![]() |
![]() |
![]() |
![]() |
(a) Sparse | (b) Dense |
A number of areas are of interest for further investigation. We employed a contrastive learning approach here but there may be other better suited learning schemes. Moving beyond the two utility measures investigated here, it may be profitable to learn a much larger number of measures of utility, for example learning a measure of utility for local areas that lies partway between specific places and whole environment measures. The ultimate ideal number of utility measures may depend on their complementarity, which could also be assessed. While we have shown here that finer grained semantic segmentation below broad classes like road or building may have particular utility for VPR, further research could investigate the relationship between human-defined categories and those which are most useful for VPR. Collectively, future work in this area will help further improve the capabilities of these systems while also bridging the divide between human navigation and autonomous navigation systems, with potential additional benefits in areas like human robot interaction.
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2016, pp. 5297–5307.Y. Latif, R. Garg, M. Milford, and I. Reid, “Addressing challenging place recognition tasks using generative adversarial networks,” in
2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 2349–2355.Z. Chen, L. Liu, I. Sa, Z. Ge, and M. Chli, “Learning context flexible attention model for long-term visual place recognition,”
IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4015–4022, 2018.P. Gronat, G. Obozinski, J. Sivic, and T. Pajdla, “Learning and calibrating per-location classifiers for visual place recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 907–914.M. Zaffar, S. Garg, M. Milford, J. Kooij, D. Flynn, K. McDonald-Maier, and S. Ehsan, “Vpr-bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change,”
International Journal of Computer Vision, pp. 1–39, 2021.