I Introduction

Visual place recognition (VPR) [31, 45, 33, 12] is a vital sub-problem in the autonomous navigation of mobile robots, which has attracted the attention of many researchers in recent years. VPR aims to help a robot determine whether it locates in a previously visited place. Specifically, there is an existing database about the environment, which stores visual data (such as frames) for various places in the environment. Now given a query data, we expect to obtain its location information by finding database data captured at the same (or close) location as the query data. In a word, VPR can assist mobile robots or autonomous unmanned systems in localization and loop closure detection (LCD) in GPS-denied environments.
Currently, since standard frame cameras could provide rich appearance information of everyday scenes, existing frame-based VPR methods could achieve good performance in those scenarios. However, standard cameras suffer from low frame rate, motion blur, and sensitivity to illumination changes, which makes traditional VPR methods difficult to handle challenging scenes (e.g., high speed and high dynamic range). Event cameras [8, 10, 5, 43], which are neuromorphic vision sensors, record microsecond-level pixel-wise brightness changes and offer significant advantages (e.g., low latency, rich motion information and high dynamic range). Nevertheless, event cameras lack appearance (texture) information in some cases (e.g., still or low-speed scenes). Therefore, the above two kinds of vision sensors are complementary. As shown in Fig. 1, we present two representative examples (glare and motionless) respectively, and these challenging situations can be solved by visual sensor fusion.
Inspired by their complementarity, we consider combining standard frame cameras and event cameras to lift the limitations faced by single vision sensors in large-scale place recognition problems, thereby improving the performance of the VPR pipeline in challenging scenarios. Recently, there have been some works trying to combine frames and events for some machine vision tasks and achieve excellent results [13, 21, 17, 14, 35, 28, 41, 23]. However, combining frames and events for VPR is not very straightforward. We need to deal with several challenges: (1) How to fuse raw frame with event data? (2) How to extract multi-scale features? (3) How to fuse multiple features into a unified descriptor? Addressing the above challenges, in this paper, we propose a robust VPR method by fusing frames and events (FE-Fusion-VPR) for large-scale place recognition problems 111Supplementary Material: An accompanying video for this work is available at https://youtu.be/g4tl--2nvhM.. FE-Fusion-VPR is an attention-based multi-scale deep network architecture for the mixed frame-/event-based VPR. It can effectively combine the advantages of standard frame cameras and event cameras, and suppress their disadvantages, thus achieving better VPR performance than using frames or events alone. To our knowledge, it is the first end-to-end deep network architecture combining standard cameras and event cameras for VPR problems and goes beyond the existing SOTA methods. In summary, The main contributions of this paper are as follows:
-
We propose a novel two-stream network (TSFE-Net) that can fuse intensity frame and event volume for hybrid feature extraction, which is compatible with asynchronous and irregular data from the event camera.
-
We propose an attention-based multi-scale network (MSF-Net) and design a re-weighting network (DRW-Net) that can assign weights to different sub-descriptors to obtain the best descriptor representation, achieving the SOTA VPR performance.
-
We comprehensively compare our FE-Fusion-VPR with other SOTA frame-/event-based VPR methods on the Brisbane-Event-VPR and DDD20 datasets to evaluate the advanced performance of our method.
-
We conduct ablation studies on each network component to comprehensively demonstrate the compactness of our FE-Fusion-VPR pipeline.
Ii Related Work
Ii-a Frame-based VPR Methods
Conventional VPR algorithms mainly consist of two steps: feature extraction and feature matching. In feature extraction, key low-level features need to be extracted from high-dimensional visual data for storage. These key features are generally called descriptors or representations and can be local / global or sparse / dense. Earlier frame-based VPR works focused on hand-crafted algorithms, including local feature extractors (SIFT [30], SURF [4], ORB [37]) and global feature extractors (HoG [7], GiST [34]), feature clusterers (BoW [1, 11], FV [36, 38], VLAD [19, 3, 40, 24]
). However, hand-crafted methods usually need to be elaborately designed and are not robust to changes in illumination, seasons, viewpoint, etc. In contrast, learning-based (especially deep learning) feature extraction algorithms can learn general features and achieve better performance than hand-crafted algorithms in large-scale image retrieval tasks
[33]. For example, Arandjelovic et al. improved VLAD as a trainable pooling layer (called NetVLAD [2]) for direct integration into the CNN-based VPR framework. On this basis, Patch-NetVLAD [15] first computes the global NetVLAD descriptors to filter out some reference candidates, then uses the local patch NetVLAD descriptors for further fine-tuning matching, which obtains a higher performance than NetVLAD. MR-NetVLAD [25] augments NetVLAD with multi-resolution image pyramid encoding, resulting in rich place representations to overcome challenging scenarios (such as illumination and viewpoint changes) better. Although learning-based VPR methods have achieved good performance, they still have difficulties in glare and high-speed scenes due to the inherent shortcomings of standard cameras.Ii-B Event-based VPR Methods
Recently, using event cameras to solve VPR problems has drawn more and more attention from researchers. First, Tobias et al. proposed an event-based VPR method (Ensemble-Event-VPR) with ensembles of temporal windows [9], which reconstructs intensity frames from event streams of different temporal windows, extracts visual descriptors using NetVLAD [2], and then integrates the distance matrix of multiple descriptors for VPR. Lee et al. proposed a VPR method (EventVLAD) to recover edge images from event streams and then use NetVLAD for descriptor generation [27]. No matter whether using intensity frames or edge images, the above two methods need to transform event streams into frames. Therefore, they are still frame-based VPR methods in essence. To utilize event streams directly, we previously proposed an event-based end-to-end deep network architecture for VPR (Event-VPR) [26]. Event-VPR uses EST voxel grid representation, combines deep residual network and VLAD layer to extract visual descriptors, and adopts weakly supervised loss for training, which achieves excellent performance in multiple challenging driving datasets using events directly. However, since event cameras can not output the appearance information of the scene directly, it is still tricky for VPR in weakly textured and motionless scenes. The latest work, VEFNet [18], simply uses a cross-modality attention module and a self-attention module for frame and events fusion based on the VGG network feature extractor. However, it does not perform better than NetVLAD-based VPR methods in most challenging cases. As a comparison, our proposed FE-Fusion-VPR outperforms both frame-based and event-based SOTA methods in performance.
Iii Methodology

Iii-a The Overall Architecture
An overview of the proposed FE-Fusion-VPR pipeline is shown in Fig. 2. Our FE-Fusion-VPR comprises a two-stream feature extraction network (TSFE-Net), a multi-scale fusion network (MSF-Net), and a descriptor re-weighting network (DRW-Net). Its backbone consists of residual blocks of ResNet34 [16]. The intensity frame and events are first fed into TSFE-Net to obtain their shallow fusion feature. Then, MSF-Net extracts the multi-scale features which are aggregated to three sub-descriptors. Next, DRW-Net can learn the weights of the sub-descriptors for getting the final refined descriptor. Finally, the final descriptor is used to match the query data with reference data. The whole network is trained in an end-to-end manner with weakly supervision. In addition, the attention layer is commonly used in our FE-Fusion-VPR network architecture, which consists of a channel attention mechanism and a spatial attention mechanism [42]. We pass the feature maps of different scales through the attention layer to obtain more sophisticated feature maps, which can effectively improve VPR performance.
Iii-B Two-Stream Feature Extraction Network
Due to the asynchronous characteristics of events, combining them with intensity frame remains challenging, especially for learning-based methods. In Event-VPR [26], our experiments have shown that different event representations have little effect on VPR tasks. Therefore in this paper, event volumes are processed to event frames [32] directly, which are fed into our TSFE-Net together with intensity frames. As shown in Fig. 2, in order to learn multi-modal shared feature, our two-stream feature extraction network (TSFE-Net) extracts two kinds of shallow features and fuses them. Inspired by Event-VPR [26], we use convolutional layer conv1 and residual blocks conv2_x cut from ResNet34, and attach an attention layer to each to improve the quality of shallow features. For better performance, we concatenate the shallow features along the channel dimension [44]
. Finally, we use a max-pooling operation and an attention layer to obtain rich and effective scene information. Our two-stream feature extractor
can be summarized as follows:(1) | ||||
where and are the encoders processing frames and events respectively. The encoder structure is Conv(,64,/2)-Attn-MaxPool2d(/2)-ResBlock0(,64)-ResBlock1(,64)-ResBlock2(
,64)-Attn-BatchNorm-ReLU.
and are intensity frame and event volume, and are the primary features of intensity frame and event volume respectively, is the concatenation operation, and is the hybrid feature after fusing.Iii-C Multi-Scale Fusion Network
Many works [6, 39] have demonstrated that mid-level visual features exhibit robustness to appearance changes, while high-level visual features are robust to viewpoint changes in VPR tasks. Therefore, the accuracy of VPR can be improved theoretically by using multi-scale network architecture. Here, our idea is inspired by feature pyramid network (FPN) [29], which is a typical multi-scale network architecture. However, the performance of FPN degrades in some cases (such as the detection of large objects [22]). To achieve more efficient communication between different levels, our proposed multi-scale fusion network (MSF-Net) fuses different-scale features in the following way (as shown in Fig. 3). First, our backbone network performs bottom-up feature extraction, which contains three stages of residual structure. The output features of each stage are respectively. In particular, the residual structure of the first stage is contained in . Thus, we directly adopt the max-pooling layer and the attention layer for the output fusion feature of the TSFE-Net to obtain the feature of the first stage. Then, features and are extracted through the remaining two stages of residual structure (residual blocks conv4_x and conv5_x), which are expressed as follows:
(2) | ||||
where the channels of features are
respectively. Next, different from FPN, to make more efficient use of the multi-scale information of each stage, we adopt concatenation to perform stage-wise fusion based on the backbone network. Specifically, for each stage of the backbone network, we add the branch network as a lateral connection (passway) to fuse the features of the two adjacent stages. The branch network generally includes a convolutional layer (convolution kernel is 1×1), upsampling layer (x2 Up), batch-normalization layer (BN), and ReLU activation layer. By adjusting the channels and spatial resolution of features, we can obtain features
with 256 channels and , and spatial resolutions respectively:(3) |
After that, we use the VLAD layer [2, 26] for three features respectively to get three corresponding sub-descriptors , whose dimensions are all . Next, we concatenate them to obtain the primary multi-scale descriptor of which dimension is :
(4) |
where represents concatenation operation. Since the multi-scale fusion features contain rich scenario details and powerful semantic features, they can provide robust feature information for the DRW-Net to improve our network’s overall performance.

Iii-D Descriptor Re-Weighting Network
In MSF-Net, we have obtained the primary multi-scale descriptor aggregated by three different-scale features. In order to represent the scenes better, we need to redesign the multi-scale descriptor to obtain a compact global descriptor. Therefore, we propose a descriptor re-weighting network (DRW-Net) , as shown in Fig. 4, to obtain a robust global descriptor:
(5) |
For the multi-scale descriptor , we calculate the average and maximum of each sub-descriptor respectively:
(6) | ||||
where and are the channel-wise global representations of sub-descriptors, denotes the index of the sub-descriptor, is the spatial coordinates of the sub-descriptors. Then, we append two fully connected (FC) layers respectively to learn two kinds of weights of the sub-descriptors and , and add the above two weights to obtain the final weights of sub-descriptors through a soft-max layer :
(7) | ||||
where and are two kinds of fully connected (FC) layers respectively. is the transformation length of global representations in hidden layers. The soft-max operation makes the weights of these three descriptors mutually balanced, and the sum of their weights is 1. Finally, we multiply the multi-scale descriptors with weights and sum multi-scale descriptors channel by channel, which is expressed as follows:
(8) | ||||
Thus, we obtain the final multi-scale weighted aggregate descriptor . It is a more robust place representation than the primary descriptor .

Datasets | Scenarios & Sequences | Training Sets (Database / Query) | Testing Sets (Database / Query) |
---|---|---|---|
Brisbane- Event-VPR [9] | sunrise (sr) [2020-04-29-06-20-23] morning (mn) [2020-04-28-09-14-11] datime (dt) [2020-04-24-15-12-03] sunset (ss) [2020-04-21-17-03-03][2020-04-22-17-24-21] | (dt & mn) [4712] / sr [2620] | ss1 [1768] / ss2 [1768] |
(ss2 & mn) [4246] / sr [2620] | ss1 [2492] / dt [2492] | ||
(ss2 & dt) [4002] / sr [2620] | ss1 [2478] / mn [2478] | ||
(ss2 & dt) [4002] / mn [2478] | ss1 [2181] / sr [2181] | ||
DDD20 [17] | street [rec1501983083][rec1502648048][rec1502325857] freeway [rec1500924281][rec1501191354][rec1501268968] | **83 [4246] / **57 [2620] | **83 [2492] / **48 [2492] |
**81 [4712] / **54 [2620] | **81 [1768] / **68 [1768] |
Iv Experiments
Iv-a Experimental Setup
Iv-A1 Dataset Selection
To evaluate the performance of the proposed method, we conduct experiments on Brisbane-Event-VPR [9] and DDD20 [17] datasets. Brisbane-Event-VPR [9] consists of data recorded using a DAVIS camera together with GPS. It includes six traverses of the same path at different time of the day, including sunrise, morning, daytime, sunset, and night. We discard the night sequence due to too sparse intensity frames. DDD20 [17] is the event camera end-to-end driving dataset under various lighting conditions. We selected six sequences of two urban scenes, of which two sequences have glare illumination, and three sequences consist of highways. We use the timestamps of the intensity frames to get the event volumes. The intervals of Brisbane-Event-VPR and DDD20 datasets that we select are approximately 25ms and 20ms respectively.
Iv-A2 Parameters in Training
We train our FE-Fusion-VPR network with weakly supervision using a triplet ranking loss. Except for the optimizer and learning rate, we use the same parameters in all experiments for a fair comparison. Where, the number of cluster centers (vocabulary size) , margin . We choose the potential positive distance threshold , the randomly negative distance threshold and the true positive geographical distance threshold . For more details of the training process, please directly refer to Event-VPR [26].
Iv-A3 Evaluation Metrics
We use PR curve and Recall@N to evaluate the experimental results, which you can refer to Event-VPR [26] for the specific description of the metrics. For a more comprehensive comparison, we calculate F1-max for all the VPR methods, which is as follows:
(9) |
where and are the
-th precision and recall in PR curves respectively. In addition, we also present the retrieval success-rate maps to show the performance of our algorithm more intuitively.
VPR Methods | Evaluation Metrics |
|
||||||
---|---|---|---|---|---|---|---|---|
|
|
|||||||
(dt & mr) / sr ss1 / ss2 | (ss2 & dt) / mr ss1 / sr | (ss2 & dt) / sr ss1 / mr | (ss2 & mr) / sr ss1 / dt | **83 / **57 **83 / **48 | **81 / **54 **81 / **68 | |||
NetVLAD [2] | Recall@1/5 (%) F1-max | 94.34, 97.29 0.9709 | 90.61, 96.35 0.9546 | 86.97, 94.11 0.9330 | 77.40, 88.49 0.8762 | 46.08, 62.91 0.6383 | 9.30, 17.89 0.2411 | |
MR-NetVLAD [25] | Recall@1/5 (%) F1-max | 94.23, 97.06 0.9733 | 91.46, 95.43 0.9582 | 87.20, 93.06 0.9343 | 79.68, 88.72 0.8929 | 64.66, 82.55 0.7860 | 30.38, 43.56 0.5244 | |
Ensemble-Event- VPR [9] | Recall@1/5 (%) F1-max | 87.33, 95.70 0.9345 | 58.82, 82.42 0.7246 | 58.42, 78.33 0.7550 | 43.62, 62.71 0.6319 | 32.02, 50.17 0.4967 | 7.84, 14.99 0.1465 | |
Event-VPR (Ours) [26] | Recall@1/5 (%) F1-max | 84.79, 93.83 0.9236 | 65.65, 86.52 0.7984 | 66.67, 84.26 0.7946 | 44.54, 66.10 0.6705 | 43.89, 66.67 0.6057 | 8.52, 21.33 0.1578 | |
FE-Fusion- VPR (Ours) | Recall@1/5 (%) F1-max | 95.64, 98.36 0.9780 | 93.58, 97.19 0.9671 | 87.41, 93.87 0.9310 | 86.15, 93.03 0.9247 | 72.77, 86.04 0.8407 | 54.05, 64.34 0.7465 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Iv-B Comparison against SOTA Methods
In this section, we present the evaluation details, including frame-based and event-based SOTA VPR methods. And then, we analyze the reasons why our FE-Fusion-VPR has the best performance. Since the results of VEFNet [18] on the two datasets are not better than the frame-based SOTA method (neither the results in their paper nor the results we verified), we do not compare it with FE-Fusion-VPR.
Iv-B1 Comparison against Frame-based VPR Methods
We compare our FE-Fusion-VPR against frame-based SOTA algorithms (NetVLAD [2] and MR-NetVLAD [25]), and the experimental results are shown in Tab. II, Fig. 6, Fig. 7 and Fig. 8. The results show that our method outperforms the above two algorithms in most cases. Tab. II shows that the Recall@1 of FE-Fusion-VPR is 3.37% and 2.55% higher than NetVLAD and MR-NetVLAD on average on the Brisbane-Event-VPR dataset. On the DDD20 dataset, the advantages of our FE-Fusion-VPR are more obvious. The Recall@1 of FE-Fusion-VPR is 35.72% and 15.89% higher than that of the above two algorithms on average. The reason is that there are sequences with obvious differences in intensity appearance in the Brisbane-Event-VPR dataset, and some glare scenarios in the DDD20 dataset, which will limit the performance of frame-based VPR methods. However, event cameras hardly affected by illumination changes can significantly improve the performance of the VPR algorithms. Besides, the DDD20 dataset contains highway scenes with many small objects on both sides of the road. Therefore, our MSF-Net combined with the DRW-Net can improve the performance of VPR due to the multi-scale feature fusion.
Iv-B2 Comparison against Event-based VPR Methods
As shown in Tab. II, Fig. 6, and Fig. 7, our FE-Fusion-VPR is much more robust than the pure event-based SOTA methods (Ensemble-Event-VPR [9] and Event-VPR [26]). On the Brisbane-Event-VPR dataset, the Recall@1 of FE-Fusion-VPR is 25.28% and 28.65% higher than Event-VPR and Ensemble-Event-VPR on average. On the DDD20 dataset, Recall@1 of FE-Fusion-VPR increases by an average of 37.21% and 43.48% over Event-VPR and Ensemble-VPR. The accuracy of our algorithm is much better than the above two methods, illustrating the importance of information from the intensity frame to improve the performance of VPR networks. As it is known to all, event cameras can hardly capture information at low-speed intersections and highways with sparse texture, while standard cameras can capture more background information when the illumination is suitable. Therefore, our FE-Fusion-VPR can achieve higher SOTA VPR performance by combining the advantages of both sensors. Our descriptors can remain rich and vital information by using MSF-Net and DRW-Net.
Ablation Studies | Settings |
|
||||
|
||||||
(dt & mr) / sr ss1 / ss2 | (ss2 & mr) / sr ss1 / dt | (ss2 & dt) / sr ss1 / mr | (ss2 & dt) / mr ss1 / sr | |||
TSFE-Net | Only Frame Encoder | 85.63 | 78.38 | 76.23 | 67.98 | |
Only Event Encoder | 93.16 | 63.74 | 82.16 | 53.90 | ||
Frame Encoder + Event Encoder | 95.64 | 93.58 | 87.41 | 86.15 | ||
MSF-Net | Only , with Attention Layer (/) | 86.82 / 91.35 () | 76.33 / 83.59 () | 85.31 / 80.06 () | 72.84 / 68.03 () | |
Only , with Attention Layer (/) | 86.03 / 95.53 () | 73.25 / 88.13 () | 82.81 / 72.72 () | 81.33 / 78.67 () | ||
Only , with Attention Layer (/) | 90.05 / 94.85 () | 84.16 / 86.08 () | 84.95 / 83.82 () | 82.02 / 76.01 () | ||
Multi-Scale, with Attention Layer (/) | 94.06 / 95.64 () | 90.89 / 93.58 () | 85.85 / 87.41 () | 80.09 / 86.15 () | ||
DRW-Net | Concatenation in Length | 91.57 | 68.51 | 75.30 | 73.21 | |
With Re-Weighting Layer | 95.64 | 93.58 | 87.41 | 86.15 |
Iv-C Ablation Studies
In this section, we explore the impact of TSFE-Net, MSF-Net, and DRW-Net on the performance of FE-Fusion-VPR.
Iv-C1 Impact of TSFE-Net
The experimental results in Tab. III demonstrate that using a single type of sensor data leads to severe performance degradation on the Brisbane-Event-VPR dataset. Moreover, the SOTA performance of our FE-Fusion-VPR is attributed to using the two vision sensors simultaneously rather than a single type of visual sensor.
Iv-C2 Impact of MSF-Net
The experimental results in Tab. III show that, in most cases, our multi-scale FE-Fusion-VPR can achieve better performance than networks using single-scale features (whatever with/without adding attention layers). Since features at different scales focus on information in different regions, using mid-level or high-level features alone is unreliable. Therefore, multi-scale fusion can achieve higher VPR performance. In addition, by adding attention layers at appropriate locations throughout the whole network, the performance of FE-Fusion-VPR can be further improved.
Iv-C3 Impact of DRW-Net
In this experiment, we remove the DRW-Net and use the original VLAD layer. Before the features are input into NetVLAD, we flatten the three different scale feature maps (the dimensions are , respectively) output by MSF-Net into three descriptors with dimensions of , , and then we concatenate them along the length dimension:
(10) |
where the dimension of the descriptor is , . The results in Tab. III show that our DRW-Net outperforms methods directly using original VLAD layers for multi-feature fusion in all cases. DRW-Net assigns the weight of each sub-descriptor through auto-learning, which can fully use each descriptor’s significant information, so that the final multi-scale descriptor has robust representation ability.
V Conclusions
In this paper, we analyzed the limitation of VPR methods using a frame camera or event camera alone. On that basis, we proposed an attention-based multi-scale network architecture combining frames and events for VPR (named FE-Fusion-VPR) to achieve robust performance in challenging environments. The two key ideas of FE-Fusion-VPR are as follows: First, we achieve visual data fusion based on intensity frames and event volumes. Second, we complete feature fusion based on a multi-scale network and descriptor re-weighting network, which is validated to be effective in our ablation studies. Compared with existing frame-based and event-based SOTA methods, our FE-Fusion VPR achieves higher performance, especially in scenes with few textures and difficult sunlight glare conditions. In future, we will try to lightweight and accelerate our algorithm for deployment to autonomous vehicles or mini-UAVs. Furthermore, we will also try to realize a deep spiking VPR network architecture [20] for high energy efficiency inference.
References
- [1] (2008) Fast and incremental method for loop-closure detection using bags of visual words. IEEE transactions on robotics 24 (5), pp. 1027–1037. Cited by: §II-A.
-
[2]
(2016)
NetVLAD: cnn architecture for weakly supervised place recognition.
In
IEEE conference on computer vision and pattern recognition
, pp. 5297–5307. Cited by: §II-A, §II-B, §III-C, Fig. 6, Fig. 7, §IV-B1, TABLE II. - [3] (2013) All about vlad. In IEEE conference on Computer Vision and Pattern Recognition, pp. 1578–1585. Cited by: §II-A.
- [4] (2006) Surf: speeded up robust features. In European conference on computer vision, pp. 404–417. Cited by: §II-A.
- [5] (2020) Event-based neuromorphic vision for autonomous driving: a paradigm shift for bio-inspired visual sensing and perception. IEEE Signal Processing Magazine 37 (4), pp. 34–49. Cited by: §I.
- [6] (2014) Convolutional neural network-based place recognition. arXiv preprint arXiv:1411.1509. Cited by: §III-C.
- [7] (2005) Histograms of oriented gradients for human detection. In IEEE computer society conference on computer vision and pattern recognition, Vol. 1, pp. 886–893. Cited by: §II-A.
- [8] (2010) Activity-driven, event-based vision sensors. In IEEE International Symposium on Circuits and Systems, pp. 2426–2429. Cited by: §I.
- [9] (2020) Event-based visual place recognition with ensembles of temporal windows. IEEE Robotics and Automation Letters 5 (4), pp. 6924–6931. Cited by: §II-B, TABLE I, Fig. 5, Fig. 6, Fig. 7, Fig. 8, §IV-A1, §IV-B2, TABLE II, TABLE III.
- [10] (2020) Event-based vision: a survey. IEEE transactions on pattern analysis and machine intelligence 44 (1), pp. 154–180. Cited by: §I.
- [11] (2012) Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics 28 (5), pp. 1188–1197. Cited by: §II-A.
- [12] (2021) Where is your place, visual place recognition?. arXiv preprint arXiv:2103.06443. Cited by: §I.
- [13] (2018) Asynchronous, photometric feature tracking using events and frames. In European Conference on Computer Vision, pp. 750–765. Cited by: §I.
- [14] (2021) Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction. IEEE Robotics and Automation Letters 6 (2), pp. 2822–2829. Cited by: §I.
- [15] (2021) Patch-netvlad: multi-scale fusion of locally-global descriptors for place recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14141–14152. Cited by: §II-A.
- [16] (2016) Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-A.
- [17] (2020) Ddd20 end-to-end event camera driving dataset: fusing frames and events with deep learning for improved steering prediction. In International Conference on Intelligent Transportation Systems, pp. 1–6. Cited by: §I, TABLE I, Fig. 5, Fig. 6, Fig. 7, Fig. 8, §IV-A1, TABLE II.
- [18] (2022) VEFNet: an event-rgb cross modality fusion network for visual place recognition. In IEEE International Conference on Image Processing, pp. 2671–2675. Cited by: §II-B, §IV-B.
- [19] (2010) Aggregating local descriptors into a compact image representation. In IEEE computer society conference on computer vision and pattern recognition, pp. 3304–3311. Cited by: §II-A.
-
[20]
(2022)
Neuro-planner: a 3d visual navigation method for mav with depth camera based on neuromorphic reinforcement learning
. arXiv preprint arXiv:2210.02305. Cited by: §V. - [21] (2019) Mixed frame-/event-driven fast pedestrian detection. In International Conference on Robotics and Automation, pp. 8332–8338. Cited by: §I.
- [22] (2022) You should look at all objects. In European Conference on Computer Vision, pp. 332–349. Cited by: §III-C.
-
[23]
(2020)
Constrained filtering-based fusion of images, events, and inertial measurements for pose estimation
. In IEEE International Conference on Robotics and Automation, pp. 644–650. Cited by: §I. - [24] (2019) A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE transactions on robotics 36 (2), pp. 561–569. Cited by: §II-A.
- [25] (2022) MultiRes-netvlad: augmenting place recognition training with low-resolution imagery. IEEE Robotics and Automation Letters 7 (2), pp. 3882–3889. Cited by: §II-A, Fig. 6, Fig. 7, Fig. 8, §IV-B1, TABLE II.
- [26] (2022) Event-vpr: end-to-end weakly supervised deep network architecture for visual place recognition using event-based vision sensor. IEEE Transactions on Instrumentation and Measurement 71, pp. 1–18. Cited by: §II-B, §III-B, §III-C, Fig. 6, Fig. 7, §IV-A2, §IV-A3, §IV-B2, TABLE II.
- [27] (2021) EventVLAD: visual place recognition with reconstructed edges from event cameras. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2247–2252. Cited by: §II-B.
- [28] (2022) Fusion-flownet: energy-efficient optical flow estimation using sensor fusion and deep fused spiking-analog network architectures. In International Conference on Robotics and Automation, pp. 6504–6510. Cited by: §I.
- [29] (2017) Feature pyramid networks for object detection. In IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §III-C.
- [30] (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §II-A.
- [31] (2015) Visual place recognition: a survey. IEEE transactions on robotics 32 (1), pp. 1–19. Cited by: §I.
- [32] (2018) Event-based vision meets deep learning on steering prediction for self-driving cars. In IEEE conference on computer vision and pattern recognition, pp. 5419–5427. Cited by: §III-B.
- [33] (2021) A survey on deep visual place recognition. IEEE Access 9, pp. 19516–19547. Cited by: §I, §II-A.
- [34] (2006) Building the gist of a scene: the role of global image features in recognition. Progress in brain research 155, pp. 23–36. Cited by: §II-A.
- [35] (2020) Single image optical flow estimation with an event camera. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1669–1678. Cited by: §I.
-
[36]
(2010)
Large-scale image retrieval with compressed fisher vectors
. In IEEE computer society conference on computer vision and pattern recognition, pp. 3384–3391. Cited by: §II-A. - [37] (2011) ORB: an efficient alternative to sift or surf. In International conference on computer vision, pp. 2564–2571. Cited by: §II-A.
- [38] (2013) Image classification with the fisher vector: theory and practice. International journal of computer vision 105 (3), pp. 222–245. Cited by: §II-A.
- [39] (2015) On the performance of convnet features for place recognition. In IEEE/RSJ international conference on intelligent robots and systems, pp. 4297–4304. Cited by: §III-C.
- [40] (2015) 24/7 place recognition by view synthesis. In IEEE conference on computer vision and pattern recognition, pp. 1808–1817. Cited by: §II-A.
- [41] (2018) Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios. IEEE Robotics and Automation Letters 3 (2), pp. 994–1001. Cited by: §I.
- [42] (2018) Cbam: convolutional block attention module. In European conference on computer vision, pp. 3–19. Cited by: §III-A.
- [43] (2021) A novel visual object detection and distance estimation method for hdr scenes based on event camera. In International Conference on Computer and Communications, pp. 636–640. Cited by: §I.
-
[44]
(2018)
Hierarchical discriminative learning for visible thermal person re-identification.
In
AAAI Conference on Artificial Intelligence
, Vol. 32. Cited by: §III-B. - [45] (2021) Visual place recognition: a survey from deep learning perspective. Pattern Recognition 113, pp. 107760. Cited by: §I.