Log In Sign Up

A Holistic Visual Place Recognition Approach using Lightweight CNNs for Severe ViewPoint and Appearance Changes

Recently, deep and complex Convolutional Neural Network (CNN) architectures have achieved encouraging results for Visual Place Recognition under strong viewpoint and appearance changes. However, the significant computation and memory overhead of these CNNs limit their practical deployment for resource-constrained mobile robots that are usually battery-operated. Achieving state-of-the-art performance/accuracy with light-weight CNN architectures is thus highly desirable, but a challenging problem. In this paper, a holistic approach is presented that combines novel regions-based features from a light-weight CNN architecture, pretrained on a place-/scene-centric image database, with Vector of Locally Aggregated Descriptors (VLAD) encoding methodology adapted specifically for Visual Place Recognition problem. The proposed approach is evaluated on a number of challenging benchmark datasets (under strong viewpoint and appearance variations) and achieves an average performance boost of 10% over state-of-the-art algorithms in terms of Area Under the Curve (AUC) calculated under precision-recall curves.


page 1

page 3

page 5

page 6


Convolutional Neural Network-based Place Recognition

Recently Convolutional Neural Networks (CNNs) have been shown to achieve...

STA-VPR: Spatio-temporal Alignment for Visual Place Recognition

Recently, the methods based on Convolutional Neural Networks (CNNs) have...

Training a Convolutional Neural Network for Appearance-Invariant Place Recognition

Place recognition is one of the most challenging problems in computer vi...

A Compact Neural Architecture for Visual Place Recognition

State-of-the-art algorithms for visual place recognition can be broadly ...

Filter Early, Match Late: Improving Network-Based Visual Place Recognition

CNNs have excelled at performing place recognition over time, particular...

I Introduction

Given a query image, an image retrieval system aims to retrieve all images within a large database that contain similar objects as in the query image. Visual Place Recognition (VPR) can also be interpreted as a system that tries to recognize a place by matching it with the places from the stored database


. As with a range of other computer vision applications, deep learned CNNs features have shown promising results for VPR problem and managed to shift the focus from traditional handmade features techniques

[2][3] to CNNs.

Using a pre-trained CNN for VPR, there are three standard approaches to produce a compact image representation: (a) the entire image is directly fed into the CNN and its layers responses are extracted [4]; (b) the CNN is applied on the user-defined regions of the image and prominent activations are pooled and aggregated from the layers representing those regions [5][6]; (c) the entire image is fed into the CNN and salient regions are identified by directly extracting distinguishing patterns based on convolutional layers responses [7][8]. Generally, category (a) results in global image representations which are not robust against severe viewpoint variations and partial occlusion. Image representations emerging from category (b) usually handle viewpoint changes better but are computation intensive. On the other hand, image representations resulting from category (c) address both the appearance and viewpoint variations. In this paper, we focus on category (c).

Fig. 1: For a query image (a), the proposed Region-VLAD approach successfully retrieves the correct image (c) from a stored image database under severe condition- and viewpoint-variation. (b) and (d) represent their CNN based discriminative regions identified by our proposed methodology.

The work by [7] and [8] are considered as state-of-the-arts in identifying prominent regions by directly extracting unique patterns based on convolutional layers responses for VPR problem. In [7], the authors used VGG16 network [9]

which was pre-trained on ImageNet

[10] and used late convolutional layers activations for regions identification. For regional features encoding, bag-of-words (BoW)[11] was employed on a separate training dataset to learn regional codebook. The system is tested on five severe viewpoint- and condition-variant benchmark place recognition datasets with AUC-PR curves [12]

as the evaluation metric. It claims to outperform the FABMAP

[13], SEQSLAM [14] and other pooling techniques such as cross-pooling [15], sum/average-pooling [16]

and max-pooling


Despite its good AUC-PR performance, the method proposed in [7]

has some shortcomings. A common strategy for improving CNN accuracy is to make it deep by adding more layers (provided sufficient data and strong regularization). However, increasing network size means more computation and using more memory both at training and test time (such as for storing outputs of intermediate layers and for storing parameters) which is not ideal for resource-constrained robots that are usually battery-operated. Utilizing late convolutional layers of deep VGG16 for features extraction along with

BoW regional dictionary degrades the performance of the method proposed by [7] in real-time applications. On the other hand, the employment of the CNN model pre-trained on an object-centric database in [7] results in CNN trying to put more emphasis on objects rather than the place itself. This reflects on the regional pooled features representations and leads to failure cases.

To bridge these research gaps, this paper proposes a holistic approach targeted for a CNN architecture comprising a small number of layers (such as AlexNet) pre-trained on a place-/scene-centric image database [18] to reduce the memory and computational cost for resource-constrained mobile robots. The proposed method detects novel CNN-based regional features and combines them with VLAD [19] features encoding methodology adapted specifically for VPR problem. The motivation behind employing VLAD comes from its better performance in various CNN-based image retrieval tasks utilizing a smaller visual word dictionary [19][20] compared to BoW [11]. To the best of our knowledge, this is the first work that combines novel light-weight CNN-based regional features with VLAD encoding adapted for computation-efficient and environment Invariant-VPR.

As opposed to [7] which uses VGG16 architecture pre-trained on an object-centric dataset and utilizes lower convolutional layer for feature descriptors and higher convolutional layer for identifying landmarks, the method proposed in this paper extracts and aggregates descriptors lying under the identified novel regions utilizing a single convolutional layer. The presented approach showcases enhanced accuracy by employing AlexNet architecture, which comprises a small number of layers, pretrained on Places365 dataset. Evaluation on several viewpoint- and condition-variant benchmark place recognition datasets shows an average performance boost of 10% over state-of-the-art VPR algorithms based on AUC computed on Precision-Recall curves. In Figure 1, for a query image (a), our proposed system retrieved image (c) from the stored database. (b) and (d) highlight the top distinguishing regions which our proposed methodology identified under severe viewpoint- and condition-variation.

The rest of the paper is organized as follows. Section II provides related work for CNN based VPR. In Section III, the proposed methodology is presented in detail. Section IV illustrates the implementation details and the results achieved on several benchmark place recognition datasets. The conclusion is presented in Section V.

Ii Literature Review

This section provides an overview of major developments in VPR under simultaneous viewpoint and appearance changes using hand-crafted features and CNN-based features.

FAB-MAP [13] is the first work that used handcrafted features (more specifically, SURF features) combined with BoW encoding methodology for VPR. It demonstrated robustness under viewpoint changes due to the invariance properties of SURF. Another work based on sequence matching of images named SeqSLAM [14] achieved remarkable results under severe appearance changes. However, it is unable to deal with simultaneous condition- and viewpoint-variation.

The first CNN-based VPR system is introduced in [4], which is followed by [21], [5] and [6]. In [4], the authors used Overfeat [22] trained on ImageNet. Eynsham [13] and QUT datasets with multiple traverses of the same route under environmental changes are used as benchmark datasets. Using the Euclidean distance on the pooled layers responses, test images are matched against the reference images. On the other hand, [6] and [5] used landmarks-based approach combined with pre-trained CNN models. In [23], the authors introduced two CNN models for the specific task of VPR (named AmosNet and HybridNet) which are trained and fine-tuned the original object-centric CaffeNet on place-recognition centric SPED dataset (

million images). SPED dataset consists of thousands of places with severe-condition variance among the same places over different times of the years. The results showed that HybridNet outperformed AmosNet, CaffeNet and PlaceNet on four publicly available datasets exhibiting strong appearance and moderate viewpoint changes

[23]. [7] presented an approach that identifies pivotal landmarks by directly extracting prominent patterns based on responses of the later convolutional layers of a deep object-centric VGG16 neural network for VPR. It achieves state-of-the-art performance on five severe viewpoint- and condition-variant datasets. Recently, [8]

introduced a context-flexible attention model and combines it with a pre-trained object-centric deep VGG16 model fine-tuned on SPED dataset

[23] to learn more powerful condition-invariant regional features. The system has shown state-of-the-art performance on three severe condition- and moderate viewpoint-variant datasets which reveals that identifying context based regions using a fine-tuned deep neural network is effective for condition-invariant VPR. However, the efficiency of the proposed approach may be compromised if there be a simultaneous severe viewpoint and condition variations. Moreover, performance and efficient resource usage become two important aspects to be looked upon in real-life robotic applications. Thus, in this paper, we have focused on resource- and computation-efficient VPR under simultaneous severe viewpoint- and condition-variation by utilizing a pre-trained scene-centric shallow CNN model and maintaining the accuracy for real time robotic VPR applications.

Fig. 2: Workflow of the proposed methodology; the test and reference images are feed into the CNN model. Region-of-Interests (ROIs) are identified across all the feature maps of the convolution layer. Features descriptors under those identified ROIs are pooled and aggregated for compact image representation. Vector of Locally Aggregated Descriptors (VLADs) are retrieved and matched by mapping the aggregated regional features on a pre-trained regional vocabulary.

Iii Proposed Technique

In this section, the key steps of the proposed methodology are described in detail. It starts from the idea of stacking feature maps activations and extraction of CNN-based regions from the convolutional layers. It then illustrates how to aggregate the stacked feature descriptors lying under those CNN-based regions. Finally, it shows how to adapt and integrate the VLAD on the aggregated CNN-based regions to determine the match between two images. The workflow of the proposed methodology is shown in Figure 2.

Iii-a Stacking of Convolutional Layers Activations for making Descriptors

For an image in a convolutional layer of the CNN model, the output is tensor of dimensions where denotes the number of feature maps. We can also interpret it as be the set of activations/responses for feature map . Each activation value of the feature maps can be considered as convolutional operation of some filter on the input image. For feature maps in the convolution layer, we stack each activation at some certain spatial location of all the feature maps into dimensional feature representations as shown in Figure 2 (c) with different colours. In (1), represents the dimensional feature descriptors at convolutional layer where is the feature map and be the convolutional layer of model.


Iii-B Identification of Regions of Interest

To make use of regions-based CNN features, most prominent regions are first identified by grouping the non-zero and spatially 8-connected activations from all the feature maps shown in Figure 2 (d). Energy of each of the region is calculated by averaging over all the activations lying under it, top energetic regions with their bounding boxes are picked. Figure 3 shows a sample images of top , and novel regions. With the inclusion of more prominent patterns, objects that vary with time such as cars, clouds and pedestrians also get included in our regions of interest due to the scene-centric training of the employed CNN. However, it is worth noticing that our novel CNN based identified regions strongly concentrate on static objects including buildings, trees and road signals.

Fig. 3: Sample image of top , and Regions-Of-Interest (ROIs) identified by our proposed approach; CNN based identified regions put emphasis on static objects including buildings, trees and road signals.

In (2), represents novel regions of interest from all the feature maps where . To pool the novel regions-based CNN representations, descriptors in (1) which fall under regions in (2) are aggregated using (3). This gives the final regions-based CNN features for novel regions representing an image at convolutional layer (see Figure 2 (e)).


In (4), contains the aggregated dimensional ROIs-Descriptors features representations of novel regions (see Figure 2 (f)).


Iii-C Region based Vocabulary and Extraction of VLADs for Image Matching

Vector of Locally Aggregated Descriptors (VLAD) adopts K-means

[11] based vector quantization and accumulates the quantization residues for features quantized to each dictionary cluster and concatenates those accumulated vectors into a single vector representation. To employ VLAD on the regional aggregated features, a pre-trained regions-based vocabulary is needed. Thus, a separate dataset of images is collected and afore-described regions-based aggregation is employed on it. To learn a diverse regional vocabulary, we employed place-recognition centric images of places from Query247 [24] (taken at day, evening and night times). Other images include a benchmark place recognition dataset St.lucia [23] with frames of two traverses captured in suburban environment at multiple times of the day. The left over images consist of multiple viewpoint- and condition-variant traverses of urban and suburban routes collected from Mapillary111 (please see Figure 4). Mapillary was previously employed in [5] and [7] for capturing viewpoint and condition-variant place recognition datasets.

aggregated ROIs-Descriptors are identified for all the images and clustered into regions. K-means is employed for clustering the regions such that represents the region centre in the regional codebook . The regional dictionary in (5) consists of aggregated ROIs-Descriptors clustered into regions. Using the learned codebook, the regions of the benchmark test and reference traverses are quantized to predict the clusters/labels .


In (6), contains the cluster numbers of all the regions under which they fall, where is the quantization function that maps the regions on the learned codebook. Using the original regions-based features , predicted labels and the regional codebook , VLAD descriptor for each region can be retrieved using (7).

Fig. 4: Sample images of dataset for Regional Vocabulary is presented here; first and second column represent Query247 images [24]. Same place with different conditions can be seen in the first column. Images in the third column are taken from the suburban datasets collected from Mapillary where forth column represents St.lucia traverses [23].

In (7), for regions that fall in region of the codebook, the sum of the residues of the regions and codebook’s region centre are calculated. Sometimes, few regions/words appear more frequently in an image than the statistical expectation known as visual word burstiness [25]. Standard techniques include power normalization [26] is performed in (8

) to avoid it where each component undergoes non-linear transformation

. In (9), power normalization is followed by normalization. For every image, number of components get stored in to get final VLAD representation.


To match a test image “A” against a reference image “B”, one to one cosine matching of their VLAD descriptors are performed using (11) (please see Figure 5).


Using (12), all the cosine dot products of regions are summed up to reach to a single score . For each test image “A”, this cosine matching is performed against all the reference images and at the end, reference image “X” with the highest similarity score is picked as a matched image.

Fig. 5: Pictorial view of the Regional Vocabulary is shown here along with the mapping of the ROIs-Descriptors of the test and reference images for VLAD retrieval.

Iv Datasets, Implementation details, Results and Analysis

This section presents the implementation details of our proposed system which will attempt to evaluate its runtime performance for real-time robotic VPR applications. Comparison of the proposed method with state-of-the-art VPR algorithms has been conducted over several benchmark datasets and the obtained results are stated. The section ends by displaying the results on correctly matched and mismatched scenarios of our proposed Region-VLAD framework along with a discussion on the same.

More specifically, challenging benchmark VPR datasets Berlin A100, Berlin Halenseestrasse and Berlin Kudamm (see [7] for more detailed introduction), collected from crowdsourced geotagged photo-mapping platform Mapillary were used to evaluate the proposed approach. Each dataset covers two traverses of the same route uploaded by different users. One traverse is used as reference database and the other traverse is employed as test database (please see TABLE I). Another dataset, Garden Point was captured at QUT campus with one traverse taken in daytime on left side walk and the other traverse was recorded in right side walk at night time. The Synthesized Nordland dataset was recorded on a train with one traverse taken in winter and other the traverse was recorded in spring. Viewpoint variance was added by cropping frames of summer traverse to keep 75% resemblance [8]. Sample images of both the traverses of the benchmark datasets are shown in Figure 6. Severe conditional and viewpoint variations can be seen across the same places. For Berlin A100, Berlin Halenseestrasse and Berlin Kudamm, using geotagged information under different conditions and viewpoints, ground truths were generated that matched images of one traverse with the closely resembled images of the other traverse. For Garden Point and Synthesized Nordland, the ground truths were obtained by parsing the frames and maintaining place level resemblance.

Dataset Traverse Environment Variation
Test Reference Viewpoint Condition
Berlin A100 81 85 urban strong moderate
157 67
very strong moderate
222 201 urban very strong moderate
GardenPoint 200 200 campus very strong very strong
1622 1622 train moderate strong
TABLE I: Benchmark place recognition datasets

The proposed method is implemented in Python and the system average runtime over 5 iterations is recorded with images (comprising test and reference images). AlexNet pre-trained on Places365 dataset is employed as a CNN model for regions-based features extraction with input image size. AlexNet is a light-weight CNN model that contains five convolutional and three fully connected layers. Convolutional layers contain rich spatial information as compared to the fully connected layers. Middle convolutional layers have more generic features (edges, colors) where later convolutional layers focus on higher semantic information including shapes, objects, buildings [23]. For all the baseline experiments, we utilize middle convolutional layer conv3 only. The motivation behind employing conv3 is its better performance in various VPR approaches [5][6] but other convolutional layers including conv2, conv4 and conv5 can also be used for regions identification.

Fig. 6: Strong viewpoint and conditional variations can be observed across the same places for Berlin A100 (first row), Berlin Halenseestrasse (second row), Berlin Kudamm (third row), Garden Point (forth row) and Synthesized Nordland (fifth row). Left column frames are taken from the test traverses and the right column frames are taken from the reference traverses of the datasets.

For a single image, a forward pass takes around an average and

using Caffe on NVIDIA P100 and Intel Xeon Gold 6134 @3.2GHz. We extract and aggregate

ROIs-Descriptors for conv3 with total time comparable with the state of the art methods [7] (see Table II). The VLADs are retrieved and matched using aggregated ROIs-Descriptors on clustered dictionary trained on aggregated ROIs-Descriptors of the dataset. For direct comparison with [7], we use with . The results are also reported for with utilizing AUC-PR [12] as an accuracy criteria. The choice of clustered dictionary is based on the value of , with larger , we used higher regional dictionary and with smaller , we used the dictionary with less clustered regions. Table II shows that for both the regional settings, our average matching times are and faster than [7] under different implementation environment. However for both the regional settings, employing NVIDIA P100 for forward pass and Intel Xeon Gold 6134 @3.2GHz for features encoding and VLAD matching, the overall matching times for one test image against one reference image are and where Region-BoW [7] takes around using Titan X Pascal GPU for forward pass, features encoding and matching. It is worth noticing that we can further reduce the time cost by employing NVIDIA P100 for features encoding and matching.

Methodology Our Region-VLAD (Python) Region-BoW (MATLAB) [7]
Model AlexNet365 VGG16
Images 1125 1000
GPU/CPU NVIDIA P100 Intel Xeon Gold 6134 @3.2GHz Titan X Pascal GPU
Forward pass time (ms) 0.305190 15.574639 59
ROIs-Descriptors ”N” 50 100 200 300 400 500 200
Extraction and Aggregation time (s) 0.328 0.361 0.394 0.402 0.443 0.452 0.349
Regions ”V” 64 128 256 64 128 256 64 128 256 64 128 256 64 128 256 64 128 256 10k Visual words
Matching time
VLAD encoding 1.33 2.05 3.58 1.55 2.28 3.79 1.91 2.4 4.03 1.99 2.68 4.28 2.13 2.96 4.54 2.36 3.16 4.75 7
VLAD matching 0.05 0.06 0.12 0.05 0.08 0.13 0.05 0.07 0.12 0.04 0.07 0.12 0.05 0.08 0.12 0.05 0.07 0.12
TABLE II: Comparison of the proposed method with Region-BoW[7]

For Berlin Halenseestrasse and Synthesized Nordland, the proposed method significantly outperforms all other state-of-the-art methods in both the settings. i.e., and shown in Figure 7 and Figure 8. For Berlin Kudamm, our approach with higher number of regions showcases state-of-the-art results (see Figure 9). For Berlin A100, Region-BoW [7] performs slightly better than the proposed method (see Figure 10). AUC-PR curves of the benchmark datasets across all other approaches are taken from [7].

Fig. 7: AUC PR-curves for Berlin Halenseestrasse dataset are presented here; our proposed Region-VLAD approach outperforms all other VPR techniques.
Fig. 8: For the Synthesized Nordland dataset, our proposed Region-VLAD approach outperforms all other VPR techniques in terms of AUC PR-curve.
Fig. 9: AUC PR-curves for Berlin Kudamm dataset are presented here; our proposed Region-VLAD with higher regional features outperforms all other VPR techniques.
Fig. 10: Region-BoW [7] shown state-of-the-art AUC-PR curve for Berlin A100 dataset but our proposed Region-VLAD approach approximately shown the similar results.

Both the Garden Point traverses exhibit strong viewpoint- and condition-variance with strong temporal coherence between the frames. Taking advantage from the sequential information, SeqSLAM managed to beat all other techniques and our approach with higher regional features has shown slightly better performance than Region-BoW [7]. This also highlights the benefit of employing more regions under simultaneous viewpoint- and condition-variation (see Figure 11). Across all the five benchmark datasets, median AUC-PR performance is shown for all the VPR methods in Figure 12. It is evident that the proposed Region-VLAD framework achieves considerably better results as compared to the state-of-the-art VPR approaches.

Fig. 11: AUC PR-curves for Garden Point dataset across multiple VPR approaches are shown here; our proposed Region-VLAD framework with higher regional features slightly leads the Region-BoW [7]. SeqSLAM due to its supreme performance under sequence based frames matching shown state-of-the-art results.

The variation in the AUC-PR curves for the proposed Region-VLAD VPR framework across all the benchmark datasets is due to many factors. The first reason is the environment of the dataset on which the CNN model is trained. Since Place365 database [18] consists of scenes/labels, where each label contains different places exhibiting the same scene like shopping mall, restaurant, rain-forest and other indoor/outdoor scenes. It strongly influence the CNN layers responses and activations will be more focused on those objects of the label/scene on which it’s trained on, for example, in Berlin Halenseestrasse, frames contain objects like traffic signals, buildings, cars, pedestrians and trees. So, even from a different viewpoint of the same places, the CNN focuses on the scenes by putting emphasis on viewpoint variant objects, hence the scenes get correctly recognized. However, we have observed that if the places contain frequent time changing objects including cars, pedestrians, and exhibit reasonable but less stronger conditional variations with stronger viewpoint changes like in Berlin A100

then employing a scene-centric CNN sometimes deteriorates the performance. This is probably due to the scene-centric training of the CNN model upon which activations strongly concentrate on those objects.

Fig. 12: Median AUC-PR curves across all the five benchmark datasets are presented here; our proposed Region-VLAD approach significantly outperforms all other VPR approaches.

Secondly, the diversity and size of the dataset employed to make the regional vocabulary also play a crucial role with the equal contribution of VLAD encoding and cosine matching for determining the regions similarities. We have also seen that picking more regions boost up the accuracy. This is apparently because in the pre-trained regional vocabulary, this might be possible that few or more clustered regions suit one dataset more as compared to others. But sometimes, inclusion of more regions also degrade the performance since each region contributes to the final matching score which might results into a wrong match if multiple reference images exhibit the similar scenes and inclusion of more but less energetic regions decay the overall final score for the correct match. For the VLAD retrieval, the collected dataset for regional vocabulary contains only images, whereas in Region-BoW [7], images were employed. Bigger the dataset, more diverse the dictionary will be. However, due to our system runtime memory limitation and to load images for regions and features aggregations, we have confined ourself to images. But, we have kept variety in our dataset to learn diverse regional features which reflects on our results with small vocabulary size. Clustering the regions using K-means to make the regional vocabulary is also valuable, we generated the dictionary twice using the same dataset. AUC-PR curves across all the benchmark datasets using both the dictionaries vary with an average marginal difference of AUC-PR. Lastly, employing a less layered CNN architecture found to be computation- and memory-efficient and also showed the potential to boost up the performance with our proposed Region-VLAD approach for environment Invariant-VPR.

Fig. 13: Sample correctly retrieved matches using the proposed Region-VLAD methodology are presented here; our proposed system identifies common regions across the queries and the retrieved images under strong environment variations.

Some of the sample matched (green) and unmatched (red) images using the proposed methodology are shown in Figure 13 and Figure 14. For the correct matches, our proposed methodology successfully identifies the common regions (shown with different coloured boxes in Figure 13) under simultaneous viewpoint and appearance changes. Those queries against which the retrieved images are mismatched as in Figure 14, the identified conv3 top contributing regions are shown. The coloured boxes on the corresponding regions (trees, lamp posts, cars, doors, clouds and buildings) show the areas where the system confuses in and matches the scenes but wrongly recognizes the places. The failure cases again point towards the scene-centric training of the CNN; our proposed approach identifies the common regional features of geographically different places (query and retrieved frames) exhibiting the similar scene and thus, leads to places mismatch. For some unmatched scenarios, we also observed that the retrieved images are quite similar and geographically closer to the test images but due to the ground truth priorities, we considered those cases as unmatched. Datasets and results are placed at [27].

Fig. 14: Sample incorrectly retrieved matches using the proposed approach are presented here; both the queries and the recognized images are geographically different but exhibiting the similar scenes. The identified novel regions mislead the system upon which the system recognizes the scenes but wrongly matches the places.

V Conclusion

For Visual Place Recognition on resource-constrained mobile robots, achieving state-of-the-art performance/accuracy with lightweight CNN architectures is highly desirable but a challenging problem. This paper has taken a step in this direction and presented a holistic approach targeted for a CNN architecture comprising a small number of layers pre-trained on a place-/scene-centric image database to reduce the memory and computational cost for resource-constrained mobile robots. The proposed framework detects novel CNN-based regional features and combines them with the VLAD encoding methodology adapted specifically for computation-efficient and environment Invariant-VPR problem. The Proposed method achieved state-of-the-art AUC-PR curves on severe viewpoint- and condition-variant benchmark place recognition datasets.

In future, it would be useful to analyse the performance of the proposed framework on other shallow/deep CNN models individually trained on place recognition-centric datasets. Furthermore, instead of employing defined number of novel regions, it would be interesting to investigate the dynamic regional features selection at runtime and their performances on multiple regional vocabularies. Fused multi-layer regional approach integrated with fine-tuned place recognition-centric CNN models

[7] is also a worthwhile research direction that can further improves the recognition especially when geographically different places exhibit the similar scenes under severe environmental changes.



  • [1] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 1–19, 2016.
  • [2] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in ECCV.   Springer, 2006, pp. 404–417.
  • [3] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
  • [4] Z. Chen, O. Lam, A. Jacobson, and M. Milford, “Convolutional neural network-based place recognition,” arXiv preprint arXiv:1411.1509, 2014.
  • [5] N. Sünderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Upcroft, and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” Proceedings of Robotics: Science and Systems XII, 2015.
  • [6] P. Panphattarasap and A. Calway, “Visual place recognition using landmark distribution descriptors,” in Asian Conference on Computer Vision.   Springer, 2016, pp. 487–502.
  • [7] Z. Chen, F. Maffra, I. Sa, and M. Chli, “Only look once, mining distinctive landmarks from convnet for visual place recognition,” in IROS.   IEEE, 2017, pp. 9–16.
  • [8] Z. Chen, L. Liu, I. Sa, Z. Ge, and M. Chli, “Learning context flexible attention model for long-term visual place recognition,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4015–4022, 2018.
  • [9] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [11] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” in null.   IEEE, 2003, p. 1470.
  • [12] J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (roc) curve.” Radiology, vol. 143, no. 1, pp. 29–36, 1982.
  • [13] M. Cummins and P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
  • [14] M. J. Milford and G. F. Wyeth, “Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights,” in ICRA.   IEEE, 2012, pp. 1643–1649.
  • [15] L. Liu, C. Shen, and A. van den Hengel, “Cross-convolutional-layer pooling for image recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2305–2313, 2017.
  • [16]

    A. Babenko and V. Lempitsky, “Aggregating local deep features for image retrieval,” in

    Proceedings of the IEEE international conference on computer vision, 2015, pp. 1269–1277.
  • [17] G. Tolias, R. Sicre, and H. Jégou, “Particular object retrieval with integral max-pooling of cnn activations,” arXiv preprint arXiv:1511.05879, 2015.
  • [18]

    B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,”

    IEEE transactions on pattern analysis and machine intelligence, 2017.
  • [19] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in CVPR.   IEEE, 2010, pp. 3304–3311.
  • [20] R. Arandjelovic and A. Zisserman, “All about vlad,” in CVPR, 2013, pp. 1578–1585.
  • [21] N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford, “On the performance of convnet features for place recognition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on.   IEEE, 2015, pp. 4297–4304.
  • [22] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
  • [23] Z. Chen, A. Jacobson, N. Sünderhauf, B. Upcroft, L. Liu, C. Shen, I. Reid, and M. Milford, “Deep learning features at scale for visual place recognition,” in ICRA.   IEEE, 2017, pp. 3223–3230.
  • [24] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, “24/7 place recognition by view synthesis,” in CVPR, 2015, pp. 1808–1817.
  • [25] H. Jégou, M. Douze, and C. Schmid, “On the burstiness of visual elements,” in CVPR.   IEEE, 2009, pp. 1169–1176.
  • [26] T.-T. Do, T. Hoang, D.-K. L. Tan, and N.-M. Cheung, “From selective deep convolutional features to compact binary representations for image retrieval,” arXiv preprint arXiv:1802.02899, 2018.
  • [27] “Results and datasets,”