Log In Sign Up

FE-Fusion-VPR: Attention-based Multi-Scale Network Architecture for Visual Place Recognition by Fusing Frames and Events

Traditional visual place recognition (VPR), usually using standard cameras, is easy to fail due to glare or high-speed motion. By contrast, event cameras have the advantages of low latency, high temporal resolution, and high dynamic range, which can deal with the above issues. Nevertheless, event cameras are prone to failure in weakly textured or motionless scenes, while standard cameras can still provide appearance information in this case. Thus, exploiting the complementarity of standard cameras and event cameras can effectively improve the performance of VPR algorithms. In the paper, we propose FE-Fusion-VPR, an attention-based multi-scale network architecture for VPR by fusing frames and events. First, the intensity frame and event volume are fed into the two-stream feature extraction network for shallow feature fusion. Next, the three-scale features are obtained through the multi-scale fusion network and aggregated into three sub-descriptors using the VLAD layer. Finally, the weight of each sub-descriptor is learned through the descriptor re-weighting network to obtain the final refined descriptor. Experimental results show that on the Brisbane-Event-VPR and DDD20 datasets, the Recall@1 of our FE-Fusion-VPR is 29.26 Ensemble-EventVPR, and is 7.00 NetVLAD. To our knowledge, this is the first end-to-end network that goes beyond the existing event-based and frame-based SOTA methods to fuse frame and events directly for VPR.


page 1

page 5

page 6


Standard and Event Cameras Fusion for Dense Mapping

Event cameras are a kind of bio-inspired sensors that generate data when...

Event-VPR: End-to-End Weakly Supervised Network Architecture for Event-based Visual Place Recognition

Traditional visual place recognition (VPR) methods generally use frame-b...

Asynchronous, Photometric Feature Tracking using Events and Frames

We present a method that leverages the complementarity of event cameras ...

Time Lens++: Event-based Frame Interpolation with Parametric Non-linear Flow and Multi-scale Fusion

Recently, video frame interpolation using a combination of frame- and ev...

EV-VGCNN: A Voxel Graph CNN for Event-based Object Classification

Event cameras report sparse intensity changes and hold noticeable advant...

Real-Time Face Eye Tracking and Blink Detection using Event Cameras

Event cameras contain emerging, neuromorphic vision sensors that capture...

Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras

In this work, we propose a new method to address audio-visual target spe...

I Introduction

Fig. 1: Illustration of advantages of the proposed FE-Fusion-VPR. As can be seen, both frames and events have challenging scenarios that are difficult to deal with alone, so combining frames and events can effectively deal with complex scenarios and improve the performance of VPR pipeline.

Visual place recognition (VPR) [31, 45, 33, 12] is a vital sub-problem in the autonomous navigation of mobile robots, which has attracted the attention of many researchers in recent years. VPR aims to help a robot determine whether it locates in a previously visited place. Specifically, there is an existing database about the environment, which stores visual data (such as frames) for various places in the environment. Now given a query data, we expect to obtain its location information by finding database data captured at the same (or close) location as the query data. In a word, VPR can assist mobile robots or autonomous unmanned systems in localization and loop closure detection (LCD) in GPS-denied environments.

Currently, since standard frame cameras could provide rich appearance information of everyday scenes, existing frame-based VPR methods could achieve good performance in those scenarios. However, standard cameras suffer from low frame rate, motion blur, and sensitivity to illumination changes, which makes traditional VPR methods difficult to handle challenging scenes (e.g., high speed and high dynamic range). Event cameras [8, 10, 5, 43], which are neuromorphic vision sensors, record microsecond-level pixel-wise brightness changes and offer significant advantages (e.g., low latency, rich motion information and high dynamic range). Nevertheless, event cameras lack appearance (texture) information in some cases (e.g., still or low-speed scenes). Therefore, the above two kinds of vision sensors are complementary. As shown in Fig. 1, we present two representative examples (glare and motionless) respectively, and these challenging situations can be solved by visual sensor fusion.

Inspired by their complementarity, we consider combining standard frame cameras and event cameras to lift the limitations faced by single vision sensors in large-scale place recognition problems, thereby improving the performance of the VPR pipeline in challenging scenarios. Recently, there have been some works trying to combine frames and events for some machine vision tasks and achieve excellent results [13, 21, 17, 14, 35, 28, 41, 23]. However, combining frames and events for VPR is not very straightforward. We need to deal with several challenges: (1) How to fuse raw frame with event data? (2) How to extract multi-scale features? (3) How to fuse multiple features into a unified descriptor? Addressing the above challenges, in this paper, we propose a robust VPR method by fusing frames and events (FE-Fusion-VPR) for large-scale place recognition problems 111Supplementary Material: An accompanying video for this work is available at FE-Fusion-VPR is an attention-based multi-scale deep network architecture for the mixed frame-/event-based VPR. It can effectively combine the advantages of standard frame cameras and event cameras, and suppress their disadvantages, thus achieving better VPR performance than using frames or events alone. To our knowledge, it is the first end-to-end deep network architecture combining standard cameras and event cameras for VPR problems and goes beyond the existing SOTA methods. In summary, The main contributions of this paper are as follows:

  • We propose a novel two-stream network (TSFE-Net) that can fuse intensity frame and event volume for hybrid feature extraction, which is compatible with asynchronous and irregular data from the event camera.

  • We propose an attention-based multi-scale network (MSF-Net) and design a re-weighting network (DRW-Net) that can assign weights to different sub-descriptors to obtain the best descriptor representation, achieving the SOTA VPR performance.

  • We comprehensively compare our FE-Fusion-VPR with other SOTA frame-/event-based VPR methods on the Brisbane-Event-VPR and DDD20 datasets to evaluate the advanced performance of our method.

  • We conduct ablation studies on each network component to comprehensively demonstrate the compactness of our FE-Fusion-VPR pipeline.

Ii Related Work

Ii-a Frame-based VPR Methods

Conventional VPR algorithms mainly consist of two steps: feature extraction and feature matching. In feature extraction, key low-level features need to be extracted from high-dimensional visual data for storage. These key features are generally called descriptors or representations and can be local / global or sparse / dense. Earlier frame-based VPR works focused on hand-crafted algorithms, including local feature extractors (SIFT [30], SURF [4], ORB [37]) and global feature extractors (HoG [7], GiST [34]), feature clusterers (BoW [1, 11], FV [36, 38], VLAD [19, 3, 40, 24]

). However, hand-crafted methods usually need to be elaborately designed and are not robust to changes in illumination, seasons, viewpoint, etc. In contrast, learning-based (especially deep learning) feature extraction algorithms can learn general features and achieve better performance than hand-crafted algorithms in large-scale image retrieval tasks

[33]. For example, Arandjelovic et al. improved VLAD as a trainable pooling layer (called NetVLAD [2]) for direct integration into the CNN-based VPR framework. On this basis, Patch-NetVLAD [15] first computes the global NetVLAD descriptors to filter out some reference candidates, then uses the local patch NetVLAD descriptors for further fine-tuning matching, which obtains a higher performance than NetVLAD. MR-NetVLAD [25] augments NetVLAD with multi-resolution image pyramid encoding, resulting in rich place representations to overcome challenging scenarios (such as illumination and viewpoint changes) better. Although learning-based VPR methods have achieved good performance, they still have difficulties in glare and high-speed scenes due to the inherent shortcomings of standard cameras.

Ii-B Event-based VPR Methods

Recently, using event cameras to solve VPR problems has drawn more and more attention from researchers. First, Tobias et al. proposed an event-based VPR method (Ensemble-Event-VPR) with ensembles of temporal windows [9], which reconstructs intensity frames from event streams of different temporal windows, extracts visual descriptors using NetVLAD [2], and then integrates the distance matrix of multiple descriptors for VPR. Lee et al. proposed a VPR method (EventVLAD) to recover edge images from event streams and then use NetVLAD for descriptor generation [27]. No matter whether using intensity frames or edge images, the above two methods need to transform event streams into frames. Therefore, they are still frame-based VPR methods in essence. To utilize event streams directly, we previously proposed an event-based end-to-end deep network architecture for VPR (Event-VPR) [26]. Event-VPR uses EST voxel grid representation, combines deep residual network and VLAD layer to extract visual descriptors, and adopts weakly supervised loss for training, which achieves excellent performance in multiple challenging driving datasets using events directly. However, since event cameras can not output the appearance information of the scene directly, it is still tricky for VPR in weakly textured and motionless scenes. The latest work, VEFNet [18], simply uses a cross-modality attention module and a self-attention module for frame and events fusion based on the VGG network feature extractor. However, it does not perform better than NetVLAD-based VPR methods in most challenging cases. As a comparison, our proposed FE-Fusion-VPR outperforms both frame-based and event-based SOTA methods in performance.

Iii Methodology

Fig. 2: Overview of the proposed FE-Fusion-VPR pipeline. First, the intensity frame and events are input into the two-stream feature extraction network for obtaining the fusion feature. Then, we leverage it as the input of the feature pyramid using the lateral connection for obtaining three-scale features and use the VLAD layer to get three corresponding sub-descriptoars for concatenation. Finally, we use descriptor re-weighting network to obtain the final refined descriptor for matching. The conv1, conv2_x, conv4_x and conv5_x are convolutional layer and residual blocks cut from ResNet34.

Iii-a The Overall Architecture

An overview of the proposed FE-Fusion-VPR pipeline is shown in Fig. 2. Our FE-Fusion-VPR comprises a two-stream feature extraction network (TSFE-Net), a multi-scale fusion network (MSF-Net), and a descriptor re-weighting network (DRW-Net). Its backbone consists of residual blocks of ResNet34 [16]. The intensity frame and events are first fed into TSFE-Net to obtain their shallow fusion feature. Then, MSF-Net extracts the multi-scale features which are aggregated to three sub-descriptors. Next, DRW-Net can learn the weights of the sub-descriptors for getting the final refined descriptor. Finally, the final descriptor is used to match the query data with reference data. The whole network is trained in an end-to-end manner with weakly supervision. In addition, the attention layer is commonly used in our FE-Fusion-VPR network architecture, which consists of a channel attention mechanism and a spatial attention mechanism [42]. We pass the feature maps of different scales through the attention layer to obtain more sophisticated feature maps, which can effectively improve VPR performance.

Iii-B Two-Stream Feature Extraction Network

Due to the asynchronous characteristics of events, combining them with intensity frame remains challenging, especially for learning-based methods. In Event-VPR [26], our experiments have shown that different event representations have little effect on VPR tasks. Therefore in this paper, event volumes are processed to event frames [32] directly, which are fed into our TSFE-Net together with intensity frames. As shown in Fig. 2, in order to learn multi-modal shared feature, our two-stream feature extraction network (TSFE-Net) extracts two kinds of shallow features and fuses them. Inspired by Event-VPR [26], we use convolutional layer conv1 and residual blocks conv2_x cut from ResNet34, and attach an attention layer to each to improve the quality of shallow features. For better performance, we concatenate the shallow features along the channel dimension [44]

. Finally, we use a max-pooling operation and an attention layer to obtain rich and effective scene information. Our two-stream feature extractor

can be summarized as follows:


where and are the encoders processing frames and events respectively. The encoder structure is Conv(,64,/2)-Attn-MaxPool2d(/2)-ResBlock0(,64)-ResBlock1(,64)-ResBlock2(


and are intensity frame and event volume, and are the primary features of intensity frame and event volume respectively, is the concatenation operation, and is the hybrid feature after fusing.

Iii-C Multi-Scale Fusion Network

Many works [6, 39] have demonstrated that mid-level visual features exhibit robustness to appearance changes, while high-level visual features are robust to viewpoint changes in VPR tasks. Therefore, the accuracy of VPR can be improved theoretically by using multi-scale network architecture. Here, our idea is inspired by feature pyramid network (FPN) [29], which is a typical multi-scale network architecture. However, the performance of FPN degrades in some cases (such as the detection of large objects [22]). To achieve more efficient communication between different levels, our proposed multi-scale fusion network (MSF-Net) fuses different-scale features in the following way (as shown in Fig. 3). First, our backbone network performs bottom-up feature extraction, which contains three stages of residual structure. The output features of each stage are respectively. In particular, the residual structure of the first stage is contained in . Thus, we directly adopt the max-pooling layer and the attention layer for the output fusion feature of the TSFE-Net to obtain the feature of the first stage. Then, features and are extracted through the remaining two stages of residual structure (residual blocks conv4_x and conv5_x), which are expressed as follows:


where the channels of features are

respectively. Next, different from FPN, to make more efficient use of the multi-scale information of each stage, we adopt concatenation to perform stage-wise fusion based on the backbone network. Specifically, for each stage of the backbone network, we add the branch network as a lateral connection (passway) to fuse the features of the two adjacent stages. The branch network generally includes a convolutional layer (convolution kernel is 1×1), upsampling layer (x2 Up), batch-normalization layer (BN), and ReLU activation layer. By adjusting the channels and spatial resolution of features, we can obtain features

with 256 channels and , and spatial resolutions respectively:


After that, we use the VLAD layer [2, 26] for three features respectively to get three corresponding sub-descriptors , whose dimensions are all . Next, we concatenate them to obtain the primary multi-scale descriptor of which dimension is :


where represents concatenation operation. Since the multi-scale fusion features contain rich scenario details and powerful semantic features, they can provide robust feature information for the DRW-Net to improve our network’s overall performance.

Fig. 3: Detailed illustration of our multi-scale fusion network (MSF-Net). In the MSF-Net, we obtain three-scale features and use the VLAD layer to get three corresponding sub-descriptors. Then, we concatenate them to obtain the primary multi-scale descriptor.

Iii-D Descriptor Re-Weighting Network

In MSF-Net, we have obtained the primary multi-scale descriptor aggregated by three different-scale features. In order to represent the scenes better, we need to redesign the multi-scale descriptor to obtain a compact global descriptor. Therefore, we propose a descriptor re-weighting network (DRW-Net) , as shown in Fig. 4, to obtain a robust global descriptor:


For the multi-scale descriptor , we calculate the average and maximum of each sub-descriptor respectively:


where and are the channel-wise global representations of sub-descriptors, denotes the index of the sub-descriptor, is the spatial coordinates of the sub-descriptors. Then, we append two fully connected (FC) layers respectively to learn two kinds of weights of the sub-descriptors and , and add the above two weights to obtain the final weights of sub-descriptors through a soft-max layer :


where and are two kinds of fully connected (FC) layers respectively. is the transformation length of global representations in hidden layers. The soft-max operation makes the weights of these three descriptors mutually balanced, and the sum of their weights is 1. Finally, we multiply the multi-scale descriptors with weights and sum multi-scale descriptors channel by channel, which is expressed as follows:


Thus, we obtain the final multi-scale weighted aggregate descriptor . It is a more robust place representation than the primary descriptor .

Fig. 4: Detailed illustration of our descriptor re-weighting network (DRW-Net). We use maximum, average operation, MLP network and soft-max layer to obtain the weights of the primary sub-descriptors. And then, we utilize the weights to get the final refined descriptor.
Datasets Scenarios & Sequences Training Sets (Database / Query) Testing Sets (Database / Query)
Brisbane- Event-VPR [9] sunrise (sr) [2020-04-29-06-20-23] morning (mn) [2020-04-28-09-14-11] datime (dt) [2020-04-24-15-12-03] sunset (ss) [2020-04-21-17-03-03][2020-04-22-17-24-21] (dt & mn) [4712] / sr [2620] ss1 [1768] / ss2 [1768]
(ss2 & mn) [4246] / sr [2620] ss1 [2492] / dt [2492]
(ss2 & dt) [4002] / sr [2620] ss1 [2478] / mn [2478]
(ss2 & dt) [4002] / mn [2478] ss1 [2181] / sr [2181]
DDD20 [17] street [rec1501983083][rec1502648048][rec1502325857] freeway [rec1500924281][rec1501191354][rec1501268968] **83 [4246] / **57 [2620] **83 [2492] / **48 [2492]
**81 [4712] / **54 [2620] **81 [1768] / **68 [1768]
TABLE I: Scenarios, sequences, training and testing sets of Brisbane-Event-VPR [9] and DDD20 [17] datasets used in our experiments.

Iv Experiments

Iv-a Experimental Setup

Iv-A1 Dataset Selection

To evaluate the performance of the proposed method, we conduct experiments on Brisbane-Event-VPR [9] and DDD20 [17] datasets. Brisbane-Event-VPR [9] consists of data recorded using a DAVIS camera together with GPS. It includes six traverses of the same path at different time of the day, including sunrise, morning, daytime, sunset, and night. We discard the night sequence due to too sparse intensity frames. DDD20 [17] is the event camera end-to-end driving dataset under various lighting conditions. We selected six sequences of two urban scenes, of which two sequences have glare illumination, and three sequences consist of highways. We use the timestamps of the intensity frames to get the event volumes. The intervals of Brisbane-Event-VPR and DDD20 datasets that we select are approximately 25ms and 20ms respectively.

Fig. 5: Scenarios of Brisbane-Event-VPR [9] and DDD20 [17] datasets in our experiments. (a) Brisbane-Event-VPR [9]. (b) DDD20 [17].

Iv-A2 Parameters in Training

We train our FE-Fusion-VPR network with weakly supervision using a triplet ranking loss. Except for the optimizer and learning rate, we use the same parameters in all experiments for a fair comparison. Where, the number of cluster centers (vocabulary size) , margin . We choose the potential positive distance threshold , the randomly negative distance threshold and the true positive geographical distance threshold . For more details of the training process, please directly refer to Event-VPR [26].

Iv-A3 Evaluation Metrics

We use PR curve and Recall@N to evaluate the experimental results, which you can refer to Event-VPR [26] for the specific description of the metrics. For a more comprehensive comparison, we calculate F1-max for all the VPR methods, which is as follows:


where and are the

-th precision and recall in PR curves respectively. In addition, we also present the retrieval success-rate maps to show the performance of our algorithm more intuitively.

VPR Methods Evaluation Metrics
Training Set & Testing Set (database / query)
Brisbane-Event-VPR [9] Dataset
DDD20 [17] Dataset
(dt & mr) / sr ss1 / ss2 (ss2 & dt) / mr ss1 / sr (ss2 & dt) / sr ss1 / mr (ss2 & mr) / sr ss1 / dt **83 / **57 **83 / **48 **81 / **54 **81 / **68
NetVLAD [2] Recall@1/5 (%) F1-max 94.34, 97.29 0.9709 90.61, 96.35 0.9546 86.97, 94.11 0.9330 77.40, 88.49 0.8762 46.08, 62.91 0.6383 9.30, 17.89 0.2411
MR-NetVLAD [25] Recall@1/5 (%) F1-max 94.23, 97.06 0.9733 91.46, 95.43 0.9582 87.20, 93.06 0.9343 79.68, 88.72 0.8929 64.66, 82.55 0.7860 30.38, 43.56 0.5244
Ensemble-Event- VPR [9] Recall@1/5 (%) F1-max 87.33, 95.70 0.9345 58.82, 82.42 0.7246 58.42, 78.33 0.7550 43.62, 62.71 0.6319 32.02, 50.17 0.4967 7.84, 14.99 0.1465
Event-VPR (Ours) [26] Recall@1/5 (%) F1-max 84.79, 93.83 0.9236 65.65, 86.52 0.7984 66.67, 84.26 0.7946 44.54, 66.10 0.6705 43.89, 66.67 0.6057 8.52, 21.33 0.1578
FE-Fusion- VPR (Ours) Recall@1/5 (%) F1-max 95.64, 98.36 0.9780 93.58, 97.19 0.9671 87.41, 93.87 0.9310 86.15, 93.03 0.9247 72.77, 86.04 0.8407 54.05, 64.34 0.7465
TABLE II: Comparison of our FE-Fusion-VPR against frame-based SOTA VPR methods [2, 25] and event-based SOTA VPR methods [9] [26] on Brisbane-Event-VPR [9] and DDD20 [17] datasets with the best result bolded.
(a) sunset1 / sunset2
(b) sunset1 / sunrise
(c) sunset1 / morning
(d) sunset1 / daytime
(e) **83 / **48
(f) **81 / **68
Fig. 6: PR curves of NetVLAD [2], MR-NetVLAD [25], Ensemble-Event-VPR [9], Event-VPR (ours) [26] and FE-Fusion-VPR (ours) on Brisbane-Event-VPR [9] and DDD20 [17] datasets. (a)-(d) are on Brisbane-Event-VPR [9] dataset, and (e), (f) are on DDD20 [17] dataset. Our FE-Fusion-VPR (red) performs better than SOTA methods under most scenes.
(a) sunset1 / sunset2
(b) sunset1 / sunrise
(c) sunset1 / morning
(d) sunset1 / daytime
(e) **83 / **48
(f) **81 / **68
Fig. 7: Recall@N curves of NetVLAD [2], MR-NetVLAD [25], Ensemble-Event-VPR [9], Event-VPR (ours) [26] and FE-Fusion-VPR (ours) on Brisbane-Event-VPR [9] and DDD20 [17] datasets. (a)-(d) are on Brisbane-Event-VPR [9] dataset, and (e), (f) are on DDD20 [17] dataset. Our FE-Fusion-VPR (red) is superior in most scenes.
(a) sunset1 / sunset2
(b) sunset1 / sunrise
(c) sunset1 / morning
(d) sunset1 / daytime
(e) **83 / **48
(f) **81 / **68
Fig. 8: Retrieval success-rate maps of MR-NetVLAD [25] and FE-Fusion-VPR (ours) on Brisbane-Event-VPR [9] and DDD20 [17] datasets. (a)-(d) are on Brisbane-Event-VPR [9] dataset, and (e), (f) are on DDD20 [17] dataset. Top: MR-NetVLAD [25], bottom: FE-Fusion-VPR (ours).

Iv-B Comparison against SOTA Methods

In this section, we present the evaluation details, including frame-based and event-based SOTA VPR methods. And then, we analyze the reasons why our FE-Fusion-VPR has the best performance. Since the results of VEFNet [18] on the two datasets are not better than the frame-based SOTA method (neither the results in their paper nor the results we verified), we do not compare it with FE-Fusion-VPR.

Iv-B1 Comparison against Frame-based VPR Methods

We compare our FE-Fusion-VPR against frame-based SOTA algorithms (NetVLAD [2] and MR-NetVLAD [25]), and the experimental results are shown in Tab. II, Fig. 6, Fig. 7 and Fig. 8. The results show that our method outperforms the above two algorithms in most cases. Tab. II shows that the Recall@1 of FE-Fusion-VPR is 3.37% and 2.55% higher than NetVLAD and MR-NetVLAD on average on the Brisbane-Event-VPR dataset. On the DDD20 dataset, the advantages of our FE-Fusion-VPR are more obvious. The Recall@1 of FE-Fusion-VPR is 35.72% and 15.89% higher than that of the above two algorithms on average. The reason is that there are sequences with obvious differences in intensity appearance in the Brisbane-Event-VPR dataset, and some glare scenarios in the DDD20 dataset, which will limit the performance of frame-based VPR methods. However, event cameras hardly affected by illumination changes can significantly improve the performance of the VPR algorithms. Besides, the DDD20 dataset contains highway scenes with many small objects on both sides of the road. Therefore, our MSF-Net combined with the DRW-Net can improve the performance of VPR due to the multi-scale feature fusion.

Iv-B2 Comparison against Event-based VPR Methods

As shown in Tab. II, Fig. 6, and Fig. 7, our FE-Fusion-VPR is much more robust than the pure event-based SOTA methods (Ensemble-Event-VPR [9] and Event-VPR [26]). On the Brisbane-Event-VPR dataset, the Recall@1 of FE-Fusion-VPR is 25.28% and 28.65% higher than Event-VPR and Ensemble-Event-VPR on average. On the DDD20 dataset, Recall@1 of FE-Fusion-VPR increases by an average of 37.21% and 43.48% over Event-VPR and Ensemble-VPR. The accuracy of our algorithm is much better than the above two methods, illustrating the importance of information from the intensity frame to improve the performance of VPR networks. As it is known to all, event cameras can hardly capture information at low-speed intersections and highways with sparse texture, while standard cameras can capture more background information when the illumination is suitable. Therefore, our FE-Fusion-VPR can achieve higher SOTA VPR performance by combining the advantages of both sensors. Our descriptors can remain rich and vital information by using MSF-Net and DRW-Net.

Ablation Studies Settings
Recall@1 (%)
Training Set & Testing Set (database / query)
(dt & mr) / sr ss1 / ss2 (ss2 & mr) / sr ss1 / dt (ss2 & dt) / sr ss1 / mr (ss2 & dt) / mr ss1 / sr
TSFE-Net Only Frame Encoder 85.63 78.38 76.23 67.98
Only Event Encoder 93.16 63.74 82.16 53.90
Frame Encoder + Event Encoder 95.64 93.58 87.41 86.15
MSF-Net Only , with Attention Layer (/) 86.82 / 91.35 () 76.33 / 83.59 () 85.31 / 80.06 () 72.84 / 68.03 ()
Only , with Attention Layer (/) 86.03 / 95.53 () 73.25 / 88.13 () 82.81 / 72.72 () 81.33 / 78.67 ()
Only , with Attention Layer (/) 90.05 / 94.85 () 84.16 / 86.08 () 84.95 / 83.82 () 82.02 / 76.01 ()
Multi-Scale, with Attention Layer (/) 94.06 / 95.64 () 90.89 / 93.58 () 85.85 / 87.41 () 80.09 / 86.15 ()
DRW-Net Concatenation in Length 91.57 68.51 75.30 73.21
With Re-Weighting Layer 95.64 93.58 87.41 86.15
TABLE III: Ablation studies on the impact of TSFE-Net, MSF-Net, and DRW-Net on the Performance of FE-Fusion-VPR on Brisbane-Event-VPR [9] with the best results bolded. : without attention layer, : with attention layer.

Iv-C Ablation Studies

In this section, we explore the impact of TSFE-Net, MSF-Net, and DRW-Net on the performance of FE-Fusion-VPR.

Iv-C1 Impact of TSFE-Net

The experimental results in Tab. III demonstrate that using a single type of sensor data leads to severe performance degradation on the Brisbane-Event-VPR dataset. Moreover, the SOTA performance of our FE-Fusion-VPR is attributed to using the two vision sensors simultaneously rather than a single type of visual sensor.

Iv-C2 Impact of MSF-Net

The experimental results in Tab. III show that, in most cases, our multi-scale FE-Fusion-VPR can achieve better performance than networks using single-scale features (whatever with/without adding attention layers). Since features at different scales focus on information in different regions, using mid-level or high-level features alone is unreliable. Therefore, multi-scale fusion can achieve higher VPR performance. In addition, by adding attention layers at appropriate locations throughout the whole network, the performance of FE-Fusion-VPR can be further improved.

Iv-C3 Impact of DRW-Net

In this experiment, we remove the DRW-Net and use the original VLAD layer. Before the features are input into NetVLAD, we flatten the three different scale feature maps (the dimensions are , respectively) output by MSF-Net into three descriptors with dimensions of , , and then we concatenate them along the length dimension:


where the dimension of the descriptor is , . The results in Tab. III show that our DRW-Net outperforms methods directly using original VLAD layers for multi-feature fusion in all cases. DRW-Net assigns the weight of each sub-descriptor through auto-learning, which can fully use each descriptor’s significant information, so that the final multi-scale descriptor has robust representation ability.

V Conclusions

In this paper, we analyzed the limitation of VPR methods using a frame camera or event camera alone. On that basis, we proposed an attention-based multi-scale network architecture combining frames and events for VPR (named FE-Fusion-VPR) to achieve robust performance in challenging environments. The two key ideas of FE-Fusion-VPR are as follows: First, we achieve visual data fusion based on intensity frames and event volumes. Second, we complete feature fusion based on a multi-scale network and descriptor re-weighting network, which is validated to be effective in our ablation studies. Compared with existing frame-based and event-based SOTA methods, our FE-Fusion VPR achieves higher performance, especially in scenes with few textures and difficult sunlight glare conditions. In future, we will try to lightweight and accelerate our algorithm for deployment to autonomous vehicles or mini-UAVs. Furthermore, we will also try to realize a deep spiking VPR network architecture [20] for high energy efficiency inference.


  • [1] A. Angeli, D. Filliat, S. Doncieux, and J. Meyer (2008) Fast and incremental method for loop-closure detection using bags of visual words. IEEE transactions on robotics 24 (5), pp. 1027–1037. Cited by: §II-A.
  • [2] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic (2016) NetVLAD: cnn architecture for weakly supervised place recognition. In

    IEEE conference on computer vision and pattern recognition

    pp. 5297–5307. Cited by: §II-A, §II-B, §III-C, Fig. 6, Fig. 7, §IV-B1, TABLE II.
  • [3] R. Arandjelovic and A. Zisserman (2013) All about vlad. In IEEE conference on Computer Vision and Pattern Recognition, pp. 1578–1585. Cited by: §II-A.
  • [4] H. Bay, T. Tuytelaars, and L. V. Gool (2006) Surf: speeded up robust features. In European conference on computer vision, pp. 404–417. Cited by: §II-A.
  • [5] G. Chen, H. Cao, J. Conradt, H. Tang, F. Rohrbein, and A. Knoll (2020) Event-based neuromorphic vision for autonomous driving: a paradigm shift for bio-inspired visual sensing and perception. IEEE Signal Processing Magazine 37 (4), pp. 34–49. Cited by: §I.
  • [6] Z. Chen, O. Lam, A. Jacobson, and M. Milford (2014) Convolutional neural network-based place recognition. arXiv preprint arXiv:1411.1509. Cited by: §III-C.
  • [7] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In IEEE computer society conference on computer vision and pattern recognition, Vol. 1, pp. 886–893. Cited by: §II-A.
  • [8] T. Delbrück, B. Linares-Barranco, E. Culurciello, and C. Posch (2010) Activity-driven, event-based vision sensors. In IEEE International Symposium on Circuits and Systems, pp. 2426–2429. Cited by: §I.
  • [9] T. Fischer and M. Milford (2020) Event-based visual place recognition with ensembles of temporal windows. IEEE Robotics and Automation Letters 5 (4), pp. 6924–6931. Cited by: §II-B, TABLE I, Fig. 5, Fig. 6, Fig. 7, Fig. 8, §IV-A1, §IV-B2, TABLE II, TABLE III.
  • [10] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, et al. (2020) Event-based vision: a survey. IEEE transactions on pattern analysis and machine intelligence 44 (1), pp. 154–180. Cited by: §I.
  • [11] D. Gálvez-López and J. D. Tardos (2012) Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics 28 (5), pp. 1188–1197. Cited by: §II-A.
  • [12] S. Garg, T. Fischer, and M. Milford (2021) Where is your place, visual place recognition?. arXiv preprint arXiv:2103.06443. Cited by: §I.
  • [13] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza (2018) Asynchronous, photometric feature tracking using events and frames. In European Conference on Computer Vision, pp. 750–765. Cited by: §I.
  • [14] D. Gehrig, M. Rüegg, M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza (2021) Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction. IEEE Robotics and Automation Letters 6 (2), pp. 2822–2829. Cited by: §I.
  • [15] S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer (2021) Patch-netvlad: multi-scale fusion of locally-global descriptors for place recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14141–14152. Cited by: §II-A.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-A.
  • [17] Y. Hu, J. Binas, D. Neil, S. Liu, and T. Delbruck (2020) Ddd20 end-to-end event camera driving dataset: fusing frames and events with deep learning for improved steering prediction. In International Conference on Intelligent Transportation Systems, pp. 1–6. Cited by: §I, TABLE I, Fig. 5, Fig. 6, Fig. 7, Fig. 8, §IV-A1, TABLE II.
  • [18] Z. Huang, R. Huang, L. Sun, C. Zhao, M. Huang, and S. Su (2022) VEFNet: an event-rgb cross modality fusion network for visual place recognition. In IEEE International Conference on Image Processing, pp. 2671–2675. Cited by: §II-B, §IV-B.
  • [19] H. Jégou, M. Douze, C. Schmid, and P. Pérez (2010) Aggregating local descriptors into a compact image representation. In IEEE computer society conference on computer vision and pattern recognition, pp. 3304–3311. Cited by: §II-A.
  • [20] J. Jiang, D. Kong, K. Hou, X. Huang, H. Zhuang, and F. Zheng (2022)

    Neuro-planner: a 3d visual navigation method for mav with depth camera based on neuromorphic reinforcement learning

    arXiv preprint arXiv:2210.02305. Cited by: §V.
  • [21] Z. Jiang, P. Xia, K. Huang, W. Stechele, G. Chen, Z. Bing, and A. Knoll (2019) Mixed frame-/event-driven fast pedestrian detection. In International Conference on Robotics and Automation, pp. 8332–8338. Cited by: §I.
  • [22] Z. Jin, D. Yu, L. Song, Z. Yuan, and L. Yu (2022) You should look at all objects. In European Conference on Computer Vision, pp. 332–349. Cited by: §III-C.
  • [23] J. H. Jung and C. G. Park (2020)

    Constrained filtering-based fusion of images, events, and inertial measurements for pose estimation

    In IEEE International Conference on Robotics and Automation, pp. 644–650. Cited by: §I.
  • [24] A. Khaliq, S. Ehsan, Z. Chen, M. Milford, and K. McDonald-Maier (2019) A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE transactions on robotics 36 (2), pp. 561–569. Cited by: §II-A.
  • [25] A. Khaliq, M. Milford, and S. Garg (2022) MultiRes-netvlad: augmenting place recognition training with low-resolution imagery. IEEE Robotics and Automation Letters 7 (2), pp. 3882–3889. Cited by: §II-A, Fig. 6, Fig. 7, Fig. 8, §IV-B1, TABLE II.
  • [26] D. Kong, Z. Fang, K. Hou, H. Li, J. Jiang, S. Coleman, and D. Kerr (2022) Event-vpr: end-to-end weakly supervised deep network architecture for visual place recognition using event-based vision sensor. IEEE Transactions on Instrumentation and Measurement 71, pp. 1–18. Cited by: §II-B, §III-B, §III-C, Fig. 6, Fig. 7, §IV-A2, §IV-A3, §IV-B2, TABLE II.
  • [27] A. J. Lee and A. Kim (2021) EventVLAD: visual place recognition with reconstructed edges from event cameras. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2247–2252. Cited by: §II-B.
  • [28] C. Lee, A. K. Kosta, and K. Roy (2022) Fusion-flownet: energy-efficient optical flow estimation using sensor fusion and deep fused spiking-analog network architectures. In International Conference on Robotics and Automation, pp. 6504–6510. Cited by: §I.
  • [29] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §III-C.
  • [30] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §II-A.
  • [31] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford (2015) Visual place recognition: a survey. IEEE transactions on robotics 32 (1), pp. 1–19. Cited by: §I.
  • [32] A. I. Maqueda, A. Loquercio, G. Gallego, N. García, and D. Scaramuzza (2018) Event-based vision meets deep learning on steering prediction for self-driving cars. In IEEE conference on computer vision and pattern recognition, pp. 5419–5427. Cited by: §III-B.
  • [33] C. Masone and B. Caputo (2021) A survey on deep visual place recognition. IEEE Access 9, pp. 19516–19547. Cited by: §I, §II-A.
  • [34] A. Oliva and A. Torralba (2006) Building the gist of a scene: the role of global image features in recognition. Progress in brain research 155, pp. 23–36. Cited by: §II-A.
  • [35] L. Pan, M. Liu, and R. Hartley (2020) Single image optical flow estimation with an event camera. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1669–1678. Cited by: §I.
  • [36] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier (2010)

    Large-scale image retrieval with compressed fisher vectors

    In IEEE computer society conference on computer vision and pattern recognition, pp. 3384–3391. Cited by: §II-A.
  • [37] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: an efficient alternative to sift or surf. In International conference on computer vision, pp. 2564–2571. Cited by: §II-A.
  • [38] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek (2013) Image classification with the fisher vector: theory and practice. International journal of computer vision 105 (3), pp. 222–245. Cited by: §II-A.
  • [39] N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford (2015) On the performance of convnet features for place recognition. In IEEE/RSJ international conference on intelligent robots and systems, pp. 4297–4304. Cited by: §III-C.
  • [40] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla (2015) 24/7 place recognition by view synthesis. In IEEE conference on computer vision and pattern recognition, pp. 1808–1817. Cited by: §II-A.
  • [41] A. R. Vidal, H. Rebecq, T. Horstschaefer, and D. Scaramuzza (2018) Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios. IEEE Robotics and Automation Letters 3 (2), pp. 994–1001. Cited by: §I.
  • [42] S. Woo, J. Park, J. Lee, and I. S. Kweon (2018) Cbam: convolutional block attention module. In European conference on computer vision, pp. 3–19. Cited by: §III-A.
  • [43] T. Wu, C. Gong, D. Kong, S. Xu, and Q. Liu (2021) A novel visual object detection and distance estimation method for hdr scenes based on event camera. In International Conference on Computer and Communications, pp. 636–640. Cited by: §I.
  • [44] M. Ye, X. Lan, J. Li, and P. Yuen (2018) Hierarchical discriminative learning for visible thermal person re-identification. In

    AAAI Conference on Artificial Intelligence

    Vol. 32. Cited by: §III-B.
  • [45] X. Zhang, L. Wang, and Y. Su (2021) Visual place recognition: a survey from deep learning perspective. Pattern Recognition 113, pp. 107760. Cited by: §I.