Object Tracking by Jointly Exploiting Frame and Event Domain

Inspired by the complementarity between conventional frame-based and bio-inspired event-based cameras, we propose a multi-modal based approach to fuse visual cues from the frame- and event-domain to enhance the single object tracking performance, especially in degraded conditions (e.g., scenes with high dynamic range, low light, and fast-motion objects). The proposed approach can effectively and adaptively combine meaningful information from both domains. Our approach's effectiveness is enforced by a novel designed cross-domain attention schemes, which can effectively enhance features based on self- and cross-domain attention schemes; The adaptiveness is guarded by a specially designed weighting scheme, which can adaptively balance the contribution of the two domains. To exploit event-based visual cues in single-object tracking, we construct a large-scale frame-event-based dataset, which we subsequently employ to train a novel frame-event fusion based model. Extensive experiments show that the proposed approach outperforms state-of-the-art frame-based tracking methods by at least 10.4 precision rate, respectively. Besides, the effectiveness of each key component of our approach is evidenced by our thorough ablation study.


page 1

page 4

page 8


Multi-domain Collaborative Feature Representation for Robust Visual Object Tracking

Jointly exploiting multiple different yet complementary domain informati...

Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-based Object Tracking

Event cameras, which are asynchronous bio-inspired vision sensors, have ...

Tracking 6-DoF Object Motion from Events and Frames

Event cameras are promising devices for lowlatency tracking and high-dyn...

End-to-end Learning of Object Motion Estimation from Retinal Events for Event-based Object Tracking

Event cameras, which are asynchronous bio-inspired vision sensors, have ...

SiamEvent: Event-based Object Tracking via Edge-aware Similarity Learning with Siamese Networks

Event cameras are novel sensors that perceive the per-pixel intensity ch...

Cross-Domain Object Matching with Model Selection

The goal of cross-domain object matching (CDOM) is to find correspondenc...

Event Data Association via Robust Model Fitting for Event-based Object Tracking

Event-based approaches, which are based on bio-inspired asynchronous eve...

Code Repositories


The first large-scale tracking dataset by fusing RGB and Event cameras.

view repo

1 Introduction

[rgb] 0,0,0 Recently, convolutional neural networks (CNNs) based approaches show promising performance in object tracking tasks 

[bertinetto2016fully, bhat2020know, Chen_2020_CVPR, dai2019visual, danelljan2017eco, Gao_2020_CVPR, Guo_2020_CVPR, li2019gradnet, nam2016learning, zhang2021multi, zhang2018learning, zhang2019deeper]. These approaches mainly use conventional frame-based cameras as sensing devices since they can effectively measure absolute light intensity and provide a rich representation of a scene. However, conventional frame-based sensors have limited frame rates (, 120 FPS) and dynamic range (, 60 dB). Thus, they do not work robustly in degraded conditions. Figure 1 (a) and (b) show two examples of degraded conditions, high dynamic range, and fast-moving object, respectively. Under both conditions, we hardly see the moving objects. Thus, obtaining meaningful visual cues of the objects is challenging. By contrast, an event-based camera, a bio-inspired sensor, offers high temporal resolution (up to 1MHz), high dynamic range (up to 140 dB), and low energy consumption [brandli2014240]. Nevertheless, it cannot measure absolute light intensity and thus texture cues (as shown in Figure 1 (d)). Both sensors are, therefore, complementary. The unique complementarity triggers us to propose a multi-modal sensor fusion-based approach to improve the tracking performance in degraded conditions, which leverages the advantages of both the frame- and event-domain.

Figure 1: [rgb] 0,0,0Limitations of conventional frame-based and bio-inspired event-based cameras. (a) and (b) show the limitation of a frame-based camera under HDR and fast-moving object, respectively. (d) shows an event-based camera’s asynchronous output of the scene shown in (c), sparse and no texture information.

[rgb] 0,0,0 Yet, event-based cameras measure light intensity changes and output events asynchronously. It differs significantly from conventional frame-based cameras, which represent scenes with synchronous frames. Besides, CNNs-based approaches are not designed to digest asynchronous inputs. Therefore, combining asynchronous events and synchronous images remains challenging. To address the challenge, we propose a simple yet effective event aggregation approach to discretize the time domain of asynchronous events. Each of the discretized time slices can be accumulated to a conventional frame, thus can be easily processed by a CNNs-based model. Our experimental results show the proposed aggregation method outperforms other commonly used event accumulation approaches [chen2020end, lagorce2016hots, maqueda2018event, rebecq2017real, zhu2019unsupervised]. Another critical challenge, similar to other multi-modal fusion-based approaches [an2016online, camplani2015real, lan2018robust, li2018cross, Mei_2021_CVPR_PDNet, song2013tracking], is grasping meaningful cues from both domains effectively regardless the diversity of scenes. In doing so, we introduce a novel cross-domain feature integrator, which leverages self- and cross-domain attention schemes to fuse visual cues from both the event- and frame-domain effectively and adaptively. The effectiveness is enforced by a novel designed feature enhancement module, which enhances its own domain’s feature based on both domains’ attentions. Our approach’s adaptivity is held by a specially designed weighting scheme to balance the contributions of the two domains. Based on the two domains’ reliabilities, the weighting scheme adaptively regulates the two domains’ contributions. We extensively validate our multi-modal fusion-based method and demonstrate that our model outperforms state-of-the-art frame-based methods by a significant margin, at least 10.4% and 11.9% in terms of representative success rate and precision rate, respectively.

[rgb] 0,0,0 To exploit event-based visual cues in single object tracking and enable more future research on multi-modal learning with asynchronous events, we construct a large-scale single-object tracking dataset, FE108, which contains 108 sequences with a total length of 1.5 hours. FE108 provides ground truth annotations on both the frame- and event-domain. The annotation frequency is up to 40Hz and 240Hz for the frame and event domains, respectively. To the best of our knowledge, FE108 is the largest event-frame-based dataset for single object tracking, which also offers the highest annotation frequency in the event domain.

[rgb] 0,0,0To sum up, our contributions are as follows:

[rgb] 0,0,0 We introduce a novel cross-domain feature integrator, which can effectively and adaptively fuse the visual cues provided from both the frame and event domains.

[rgb] 0,0,0 We construct a large-scale frame-event-based dataset for single object tracking. The dataset covers wide challenging scenes and degraded conditions.

[rgb] 0,0,0 Our extensively experimental results show our approach outperforms other state-of-the-art methods by a significant margin. Our ablation study evidences the effectiveness of the novel designed attention-based schemes.

2 Related Work

[rgb] 0,0,0 Single-Domain Object Tracking.

Recently, deep-learning-based methods have dominated the frame-based object tracking field. Most of the methods 

[bertinetto2016fully, dai2019visual, danelljan2017eco, li2019gradnet, nam2016learning, zhang2018learning, zhang2019deeper] leverage conventional frame-based sensors. Only a few attempts have been made to track objects using event-based cameras. Piatkowska et al. [piatkowska2012spatiotemporal]

used an event-based camera for multiple persons tracking in the occurrence of high occlusions, which is enabled by the Gaussian Mixture Model based events clustering algorithm. Barranco

et al. [barranco2018real] proposed a real-time mean-shift clustering algorithm using events for multi-object tracking. Mitrokhin et al. [mitrokhin2018event] proposed a novel events representation, time-image, to utilize temporal information of the event stream. With it, they achieve an event-only feature-less motion compensation pipeline. Chen et al. [chen2020end] pushed the event representation further and proposed a synchronous Time-Surface with Linear Time Decay representation to effectively encode the Spatio-temporal information. Although these approaches reported promising performance in object tracking tasks, they did not consider leveraging frame domains. By contrast, our approach focuses on leveraging complementarity between frame and event domains.

[rgb] 0,0,0Multi-Domain Object Tracking. Multi-modal-based tracking approaches have been getting more attention. Most of the works leverage RGB-D (RGB + Depth) [an2016online, camplani2015real, kart2018make, song2013tracking, xiao2017robust] and RGB-T (RGB + Thermal) [lan2018robust, li2018cross, li2019multi, Wang_2020_CVPR, zhang2019multi, zhu2019dense] as multi-modal inputs to improve tracking performance. Depth is an important cue to help solve the occlusion problem in tracking. When a target object is hidden partially by another object with a similar appearance, the difference in their depth levels will be distinctive and help detect the occlusion. Illumination variations and shadows do not influence images from the thermal infrared sensors. They thus can be combined with RGB to improve performance in degraded conditions (, rain and smog). Unlike these multi-domain approaches, fusing frame and event domains brings a unique challenge caused by the asynchronous outputs of event-based cameras. Our approach aims to solve the problem, which effectively leverage events for improving robustness, especially under degraded conditions.

3 Methodology

3.1 Background: Event-based Camera

[rgb] 0,0,0An event-based camera is a bio-inspired sensor. It asynchronously measures light intensity changes in scene-illumination at a pixel level. Hence, it provides a very high measurement rate, up to 1MHz [brandli2014240]. Since light intensity changes are measured in log-scale, an event-based camera also offers a very high dynamic range, 140 dB vs. 60 dB of a conventional camera [DBLP:journals/corr/abs-1904-08405]. When the change of pixel intensity in the log-scale is greater than a threshold, an event is triggered. The polarity of an event reflects the direction of the changes. Mathematically, a set of events can be defined as: [rgb] 0,0,0


where is the -th event; is the pixel location of event ; is the timestamp; is the polarity of an event. In a stable lighting condition, events are triggered by moving edges (, object contour and texture boundaries), making an event-based camera a natural edge extractor.

3.2 Event Aggregation

[rgb] 0,0,0Since the asynchronous event format differs significantly from the frames generated by conventional frame-based cameras, vision algorithms designed for frame-based cameras cannot be directly applied. To deal with it, events are typically aggregated into a frame or grid-based representation first [gehrig2019end, lagorce2016hots, maqueda2018event, messikommer2020event, rebecq2017real, wang2021event, zhu2019unsupervised].

[rgb] 0,0,0We propose a simple yet effective pre-processing method to map events into a grid-based presentation. Specifically, inspired by [zhu2019unsupervised], we first aggregate the events captured between two adjacent frames into an -bin voxel grid to discretize the time dimension. Then, each 3D discretized slice is accumulated to a 2D frame, where a pixel of the frame records the polarity of the event with the latest timestamp at the pixel’s location inside the current slice. Finally, the generated frames are scaled by for further processing. Given a set of events, , with the timestamps in the time range of -th bin, the pixel located at on the -th aggregated frame can be defined as follows:


[rgb] 0,0,0where is the timestamp of the -th frame in the frame domain; is the Dirac delta function; is the bin size in the time domain, which is defined as: . The proposed method leverages the latest timestamp to capture the latest motion cues inside each time slice. Our experimental results show that our event processing method outperforms other commonly used approaches (see Table 4).

3.3 Network Architecture

[rgb] 0,0,0The overall architecture of the proposed approach is illustrated in Figure 2, which has two branches: reference branch (top) and test branch (bottom). The reference and test branches share weights in a siamese style. The core of our approach is the Cross-Domain Feature Integrator (CDFI), designed to leverage both domains’ advantages. Specifically, the frame domain provides rich texture information, whereas the event domain is robust to challenging scenes and provides edge information. As shown in Figure 2, the inputs of CDFI are a frame and events captured between the frame and its previous one. We preprocess the events based on Eq. 2. The outputs of CDFI are one low-level (, ) and one high-level (,

) fused features. The classifier uses the extracted low-level fused features from both reference and test branches to estimate a confidence map. Finally, the bbox-regressor reports IoU between the ground truth bounding box and estimated bounding box to help locate a target on the test frame.

Figure 2: [rgb] 0,0,0Overview of the proposed architecture.
Figure 3: [rgb] 0,0,0Detailed architectures of the proposed components. (a) Overview of Cross-Domain Feature Integrator (CDFI), (b) Edge-Attention Block (EAB), (c) Cross-Domain Modulation and Selection block (CDMS), (d) Bbox Regressor, and (e) Classifier.

3.3.1 Cross-Domain Feature Integrator

[rgb] 0,0,0The overall structure of the proposed CDFI is shown in Figure 3 (a). It has three components, namely: Frame-Feature Extractor (FFE), Event-Feature Extractor (EFE), and Cross-Domain Modulation and Selection Block (CDMS).

[rgb] 0,0,0FFE is for extracting features from the frame domain. We adopt ResNet18 [he2016deep] as our frame feature extractor. The 4th and 5th blocks’ features are used as the low-level and high-level frame features (, and ), respectively.

[rgb] 0,0,0EFE generates features to represent the encoded information in the event domain. Similar to FFE, EFE extracts low-/high-level features from the event domain (, and ). Since each aggregated event frame conveys different temporal information, each of them is processed by a dedicated sub-branch. Like other feature extractors, each sub-branch of EFE leverages stacked convolutional layers to increase receptive field at higher levels. We also introduce a self-attention scheme to each sub-branch to focus on more critical features. It is achieved by a specially designed Edge Attention Block (EAB), illustrated in Figure 3 (b). As shown in Figure 3 (a), two EABs are added behind the third and fourth convolutional layers. Then, the low-level (, ) and high-level (, ) features on the th sub-branch are generated by the first and second EABs, respectively. Finally, all generated and are fused in a weighted sum manner to obtain the and . Mathematically, EFE is defined as (here we ignore and to bring a general form):


where is a learned weight; means a convolution layer;

is the Sigmoid function;

, , , and are the input, output features of the EAB on the th sub-branch, channel-wise addition, and adaptive average pooling, respectively; / indicates element-wise summation/multiplication;

[rgb] 0,0,0CDMS is designed to fuse the extracted frame and event features, shown in Figure 3 (c). The key to the proposed CDMS is a cross-domain attention scheme designed based on the following observations: (i) Rich textural and semantic cues can easily be captured by a conventional frame-based sensor, whereas an event-based camera can easily capture edge information. (ii) The cues provided by a conventional frame-based sensor become less effective in challenging scenarios. By contrast, an event-based camera does not suffer from these scenarios. (iii) In the case of multiple moving objects crossing each other, it is hard to separate them trivially based on edges. However, the problem can be addressed well with texture information.

[rgb] 0,0,0To address the first observation, we design a Cross-Attention Block (CAB) to fuse features of the two domains based on cross-domain attentions. Specifically, given two features from two different domains, and , we define the following cross-domain attention scheme to generate an enhanced feature for :


[rgb] 0,0,0where indicates channel-wise concatenation;

is the Batch Normalization (BN) followed by a ReLU activation function;

indicates a self-attention based on . is a cross-domain attention scheme based on to enhance the feature of . When and represent the event- and frame-domain, the enhanced feature of the event-domain, , is obtained. Inversely, the enhanced feature of the frame-domain, , can be generated.

[rgb] 0,0,0To address the second and third observations, we propose an adaptive weighted balance scheme to balance the contribution of the frame- and event- domains:


3.3.2 Bounding Box (BBox) Regressor and Classifier

[rgb] 0,0,0 For the BBox regressor and classifier, we adopt the target estimation network of ATOM[danelljan2019atom] and the classifier of DiMP [bhat2019learning], respectively. The architecture of BBox regressor is shown in Figure 3 (d). The IoU modulation maps and

to different level modulation vectors

and , respectively. Mathematically, the mapping is achieved as follows:


where is fully connected layer; denotes PrPool [jiang2018acquisition]; is the target bounding box from reference branch. The IoU predictor predicts IoU based on the following equation:


[rgb] 0,0,0 For the classifier, following [bhat2019learning], we use it to predict a target confidence score. As shown in Figure 3 (e), the classifier first maps and to an initial filter, which is then optimized by the optimizer. The optimizer uses the steepest descent methodology to obtain the final filter. The final filter is used as the convolutional layer’s filter weight and applied to to robustly discriminate between the target object and background distractors.

3.4 Loss Function

[rgb] 0,0,0 We adopt the loss function of 

[bhat2019learning], which is defined as:


where is the -th classification score predicted by the classifier, and is obtained by setting to a Gaussian function centered as the target . The loss function has two components: classification loss , and bounding box regressor loss . The estimates Mean Squared Error (MSE) between and . The idea behind Eq. 15 is to alleviate the impact of unbalanced negative samples (, background). A hinge function is applied to clip the scores at zero in the background region so that the model can equally focus on both positive and negative samples. The estimates MSE between the predicted IoU overlap obtained from test branch and the ground truth .

4 Dataset

[rgb] 0,0,0 Currently, Hu et al. [hu2016dvs] collected a dataset by placing an event-based camera in front of a monitor and recorded large-scale annotated RGB/grayscale videos (e.g., VOT2015 [kristan2015visual]). However, the dataset based on RGB tracking benchmarks cannot faithfully represent events captured in real scenes since the events between adjacent frames are missing. Mitrokhin et al. [mitrokhin2018event, mitrokhin2019ev] collected two event-based tracking datasets: EED [mitrokhin2018event] and EV-IMO [mitrokhin2019ev]. As shown in Table 1, the EED only has 179 frames (7.8 seconds) with two types of objects. EV-IMO offers a better package with motion masks and high-frequency events annotations, up to 200Hz. But, similar to EED, limited object types block it to be used practically. To enable further research on multi-modal learning with events, we collect a large-scale dataset termed FE108, which has 108 sequences with a total length of 1.5 hours. The dataset contains 21 different types of objects and covers four challenging scenarios. The annotation frequency is up to 20/40 Hz for the frame domain (20 out of 108 sequences are 20Hz) and 240 Hz for the event domain.

Classes Frames Events Time Frame(Hz) Event(Hz)
EED [mitrokhin2018event] 2 179 3.4M 7.8s 23 23
EV-IMO [mitrokhin2019ev] 3 76,800 32.0m 40 200
Ours 21 208,672 5.6G 96.9m 20/40 240
Table 1: [rgb] 0,0,0Analysis of existing event-based datasets. Our FE108 offers the best in terms of all listed metrics.

4.1 Dataset Collection and Annotation

[rgb] 0,0,0 The FE108 dataset is captured by a DAVIS346 event-based camera [brandli2014240], which equips a 346260 pixels dynamic vision sensor (DVS) and an active pixel sensor (APS). It can simultaneously provide events and aligned grayscale images of a scene. The ground truth bounding boxes of a moving target are provided by the Vicon motion capture system [vicon], which captures motion with a high sampling rate (up to 330Hz) and sub-millimeter precision. During the capturing process, we fix APS’s frame rate to 20/40 FPS and Vicon’s sampling rate to 240Hz, which are also the annotation frequency of the captured APS frame and accumulated events (, accumulated every second), respectively.

4.2 Dataset Facts

[rgb] 0,0,0 We introduce critical aspects of the constructed FE108. More details about the FE108 are described in the supplementary material.

[rgb] 0,0,0 Categorical Analysis. The FE108 dataset can be categorized differently from different perspectives. The first perspective is the number of object classes. There are 21 different object classes, which can be divided into three categories: animals, vehicles, and daily goods (e.g., bottle, box). Second, as shown in Figure 4 (a), the FE108 contains four types of challenging scenes: low-light (LL), high dynamic range (HDR), fast motion with and without motion blur on APS frame (FWB and FNB). Third, based on the camera movement and number of objects, FE108 has four types of scenes: static shots with a single object or multiple objects; dynamic shots with a single object or multiple objects.

[rgb] 0,0,0 Annotated Bounding Box Statistics. In Figure 4 (b), we plot out the distribution of all annotated bounding box locations, which shows most annotations are close to frames’ centers. In Figure 4 (c), we also show the distribution of the bounding box aspect ratios (H/W) .

[rgb] 0,0,0 Event Rate. The FE108 dataset is collected in a constant lighting condition. It means all events are triggered by motions (, moving objects, camera motion). Therefore, the distribution of the event rate can represent the motion distribution of FE108. As shown in Figure 4 (d), the distribution of the event rate is diverse. It indicates the captured 108 scenes offer wide motion diversity.

(a) Attributes distribution (b) Bounding box distribution
(c) Histogram of aspect ratios (d) Avg event rate (Ev/ms)
Figure 4: [rgb] 0,0,0Statistics of FE108 dataset in terms of (a) attributes, (b) bounding box center position, (c) aspect ratios, and (d) event rate.

5 Experiments

[rgb] 0,0,0We implement the proposed network in PyTorch 


. In the training phase, random initialization is used for all components except the FFE (which is a ResNet18 pre-trained on ImageNet). The initial learning rate for the classifier, the bbox regressor, and the CDFI are set to 1e-3, 1e-3, and 1e-4, respectively. The learning rate is adjusted by a decay scheduler, which is scaled by 0.2 for every 15 epochs. We use Adam optimizer to train the network for 50 epochs. The batch size is set to 26. It takes about 20 hours on a 20-core i9-10900K 3.7 GHz CPU, 64 GB RAM, and an NVIDIA RTX3090 GPU.

SiamRPN [li2018high] 15.3 16.9 6.1 21.6 10.1 8.3 1.4 14.5 26.2 32.1 6.1 44.1 33.2 42.9 11.5 51.9 21.8 26.1 7.0 33.5
ATOM [danelljan2019atom] 36.6 41.8 14.4 56.0 28.6 29.1 5.8 45.0 66.8 89.6 32.6 96.7 57.1 71.0 28.0 88.6 46.5 56.4 20.1 71.3
DiMP [bhat2019learning] 41.8 50.0 17.9 62.7 45.6 52.8 11.2 69.5 69.4 94.7 37.1 99.7 60.5 75.6 29.3 93.2 52.6 65.4 23.4 79.1
SiamFC++ [xu2020siamfc++] 15.3 15.0 1.3 25.2 13.4 8.7 0.8 15.3 28.6 36.3 6.0 48.2 36.8 42.7 7.4 63.1 23.8 26.0 3.9 39.1
SiamBAN [chen2020siamese] 16.3 16.4 3.9 26.6 15.5 14.8 2.3 26.5 25.2 26.3 5.8 46.7 32.0 39.6 9.1 51.4 22.5 25.0 5.6 37.4
KYS [bhat2020know] 15.7 14.5 5.2 23.0 12.0 8.0 1.1 18.0 47.0 63.9 14.8 73.3 36.9 44.5 15.2 57.9 26.6 30.6 9.2 41.0
CLNet [dongclnet] 30.0 33.5 9.6 48.3 13.7 6.0 0.9 23.6 52.9 71.2 23.3 80.3 40.8 46.3 14.2 67.7 34.4 39.1 11.8 55.5
PrDiMP [danelljan2020probabilistic] 44.3 52.8 19.6 66.3 44.6 48.2 8.9 69.5 67.0 89.9 33.6 99.7 60.6 75.8 blue29.7 93.3 53.0 65.0 23.3 80.5
ATOM [danelljan2019atom] + Event 49.0 59.2 21.0 68.8 50.8 67.8 27.7 72.6 68.5 90.4 42.0 97.2 57.4 71.1 28.3 90.2 55.5 70.0 27.4 81.8
DiMP [bhat2019learning] + Event 50.1 60.2 23.7 74.8 57.0 70.4 28.2 82.8 blue70.1 blue94.2 44.2 blue99.9 60.8 75.9 29.1 blue93.6 57.1 71.2 28.6 85.1
PrDiMP [danelljan2020probabilistic] + Event blue53.1 blue65.3 blue24.9 blue79.1 blue60.3 blue79.6 blue29.8 blue90.5 70.0 93.8 blue44.8 99.8 blue61.8 blue76.3 29.4 blue93.6 blue59.0 blue74.4 blue29.8 blue87.7
Ours 59.9 74.4 33.0 86.0 65.6 86.0 30.8 95.7 71.2 94.7 45.9 100.0 62.8 80.5 32.0 94.5 63.4 81.3 34.4 92.4
Table 2: [rgb] 0,0,0State-of-the-art comparison on FE108 in terms of representative success rate (RSR), representative precision rate (RPR), and overlap precision (OP).

5.1 Comparison with State-of-the-art Trackers

[rgb] 0,0,0To validate the effectiveness of our method, we compare the proposed approach with the following eight state-of-the-art frame-based trackers: SiamRPN [li2018high], ATOM [danelljan2019atom], DiMP [bhat2019learning], SiamFC++ [xu2020siamfc++], SiamBAN [chen2020siamese], KYS [bhat2020know], CLNet [dongclnet], and PrDiMP [danelljan2020probabilistic]. To show the quantitative performance of each tracker, we utilize three widely used metrics: success rate (SR), precision rate (PR), and overlap precision (). These metrics represent the percentage of three particular types of frames. SR cares the frame of that overlap between ground truth and predicted bounding box is larger than a threshold; PR focuses on the frame of that the center distance between ground truth and predicted bounding box within a given threshold; represents SR with as the threshold. For SR, we employ the area under curve (AUC) of an SR plot as representative SR (RSR). For PR, we use the PR score associated with a 20-pixel threshold as representative PR (RPR).

Figure 5: [rgb] 0,0,0Precision (left) and Success (right) plot on FE108. In terms of both metric, our approach outperforms the state-of-the-art by a large margin.

[rgb] 0,0,0Illustrated as the solid curves in Figure 5, on FE108 dataset, our method outperforms other compared approaches by a large margin in terms of both precision and success rate. Specifically, the proposed approach achieves a 92.4% overall RPR and 63.4% RSR, and it outperforms the runner-up by 11.9% and 10.4%, respectively. To get more insights into the effectiveness of the proposed approach, we also show the performances under four different challenging conditions provided by FE108. As shown in Table 2, our method offers the best results under all four conditions, especially in LL and HDR conditions. Eight visual examples under different degraded conditions are shown in Figure 7, where we can see our approach can accurately track the target under all conditions.

Figure 6: [rgb] 0,0,0Precision (left) and Success (right) plot on EED [mitrokhin2018event].
Methods FD LV Occ WiB MO ALL
SiamRPN [li2018high] 23 43 11 10 38 40 43 63 53 100 33 51
ATOM [danelljan2019atom] 12 19 7 12 47 60 74 100 47 100 37 58
DiMP [bhat2019learning] 9 19 2 4 48 60 79 100 50 100 37 57
SiamFC++ [xu2020siamfc++] 17 52 10 26 45 60 58 63 50 100 36 60
SiamBAN [chen2020siamese] 22 43 8 6 36 40 69 100 54 100 38 58
KYS [bhat2020know] 19 38 6 19 46 60 46 63 54 100 34 56
CLNet [dongclnet] 10 19 2 6 19 20 13 25 4 13 9 17
PrDiMP [danelljan2020probabilistic] 9 14 4 22 19 20 78 100 31 70 28 45
Ours 32 81 35 98 48 60 69 100 55 100 48 88
Table 3: [rgb] 0,0,0State-of-the-art comparison on EED [mitrokhin2018event] in terms of RSR and RPR.

[rgb] 0,0,0 Even though EED [mitrokhin2018event] has very limited frames and associated events, it provides five challenging sequences: fast drone (FD), light variations (LV), occlusions (Occ), what is background (WiB), and multiple objects (MO). The first two sequences both record a fast moving drone under low illumination. The third and the fourth sequences record a moving ball with another object and a net as foreground, respectively. The fifth sequence consists of multiple moving objects under normal lighting conditions. Therefore, we also compare our approach against other methods on EED [mitrokhin2018event]. The experimental results are shown in Figure 6 and Table 3. Our method significantly outperforms other approaches in all conditions except WiB. But with limited frames, the experimental result is less convincing and meaningful compared to the ones obtained from FE108.

[rgb] 0,0,0One question in our mind is whether combining the frame and event information can make other frame-based approaches outperform our approach. To answer this question, we combine APS and event aggregated frame by concatenation manner to train and test the top three frame-based performers (, PrDiMP [danelljan2020probabilistic], DiMP [bhat2019learning], and ATOM [danelljan2019atom]). We report their RSR and RPR in Table 2 and show the corresponding results as the dashed curve in Figure 5. As we can see, our approach still outperforms all others by a considerable margin. It reflects the effectiveness of our specially designed cross-domain feature integrator. We also witness that the performance of the three chosen approaches can be improved significantly only by naively combining the frame and event domains. It means event information definitely plays an important role in dealing with degraded conditions.

Figure 7: [rgb] 0,0,0Visual outputs of state-of-the-art algorithms on FE108 dataset. The lower-right dashed boxes show accumulated event frame of the dashed boxes inside the frames.

5.2 Ablation Study

Multi-modal Input. [rgb] 0,0,0We design the following experiments to show the effectiveness of multi-modal input. 1. Frame only: only using frames and FFE; 2. Event only: only using events and EFE; 3. Event to Frame: combining frames and events by concatenation as input to FFE; 4. Frame to Event: the same as 3, but input to EFE. For each setup, we train a dedicated model and test with it. As shown in the row A-D of Table 4, the models with multi-modal inputs perform better than the ones with unimodal input. It shows the effectiveness of multi-modal fusion and our CDFI.

A. Frame Only 45.6 54.6 21.0 73.1
B. Event Only 52.0 63.2 20.3 82.0
C. Event to Frame 55.5 70.0 27.4 82.8
D. Frame to Event 53.6 66.5 25.9 80.4
E. w/o EAB 60.7 77.9 31.7 88.6
F. w/o CDMS 59.8 75.8 31.0 88.1
G. CDMS w/o SA 62.6 79.8 33.8 91.5
H. CDMS w/o CA 61.9 78.8 33.0 90.7
I. CDMS w/o AW 60.9 77.2 32.0 89.9
J. TSLTD [chen2020end] 60.4 77.0 31.2 89.2
K. Time Surfaces [lagorce2016hots] 61.4 78.5 32.9 90.1
L. Event Count [maqueda2018event] 59.6 76.4 27.4 88.6
M. Event Frame [rebecq2017real] 59.0 74.5 29.9 87.7
N. Zhu et al[zhu2019unsupervised] 61.9 79.2 32.3 91.2
O. All 61.3 78.1 31.6 90.1
P. Ours 63.4 81.3 34.4 92.4
Table 4: [rgb] 0,0,0Ablation study results.

[rgb] 0,0,0 Effectiveness of the proposed key components. There are two key components in our approach: EAB and CDMS. Inside the CDMS, there are three primary schemes: self-attention (Eq. 7), cross-attention (Eq. 8), and adaptive weighting (Eq. 9). To verify their effectiveness, we modify the original model by dropping each of the components and retrain the modified models. Correspondingly, we obtain five retrained models: (i) without EAB; (ii) without CDMS; Inside CDMS, (iii) without self-attention (CDMS w/o SA); (iv) without cross-attention (CDMS w/o CA); (v) without adaptive weighting (CDMS w/o AW). The results of the five modified models are shown in the row E-I of Table 4, respectively. Compared to the original model, removing CDMS has the most considerable impact on the performance, whereas removing the self-attention influences the least. It confirms the proposed CDMS is the key to our outstanding performance. Moreover, removing EAB also influences performance significantly. It shows that the EAB indeed enhances the extracted edge features.

[rgb] 0,0,0Inside CDMS, removing adaptive weighting scheme degrades performance the most. To get more insights into it, we report the estimated two weights (, for the frame domain; for the event domain) of all eight visual examples in Figure 7. Except for the second one, the frame domain cannot provide reliable visual cues. Correspondingly, we can see the in these seven examples are significantly higher than , whereas is much lower than in the second scene. The fourth one provides an interesting observation. We can see the object clearly in the frame domain, but is still higher than . We think it is because the model is trained to focus on texture cues in the frame domain, but no texture cues can be extracted in this case. It is worthwhile to mention that only our method can successfully track the target in all examples.

Event Aggregation. [rgb] 0,0,0For the events captured between two adjacent frames, we slice them into

chunks in the time domain and then aggregate them as EFE’s inputs. Here, we study the impacts of hyperparameter

. As shown in Table 5, both RSR and RPR scores increase with a larger value. However, with a larger value, it slows down the inference time. We can see offers the best trade-off between accuracy and efficiency. The way of aggregating events is another factor that has an impact on the performance. We conducted experiments with five commonly used event aggregation methods [chen2020end, lagorce2016hots, maqueda2018event, rebecq2017real, zhu2019unsupervised]. The results are shown in the row J-N of Table 4, and our method still delivers the best performance. It suggests that discretizing the time dimension and leveraging the recent timpstamp information are effective for tracking. Another component associated with event aggregation is the weights in Eq. 3, which are learned during the training process. We manually set the weights to 1 with . The result is shown in row O of Table 4, and we can see the corresponding performance is worse than the original model.

1 2 3 4 5 6
RPR 89.3 90.1 92.4 92.6 92.6 92.7
RSR 60.2 61.7 63.4 63.8 63.4 63.9
FPS 35.1 32.7 30.1 27.9 25.2 22.7
Table 5: [rgb] 0,0,0Trade-off between accuracy and efficiency introduced by the number of slices of event aggregation (, ).

6 Discussion and Conclusion

[rgb] 0,0,0 In this paper, we introduce a frame-event fusion-based approach for single object tracking. Our novel designed attention schemes effectively fuse the information obtained from both the frame and event domains. Besides, the novel developed weighting scheme is able to balance the contributions of the two domains adaptively. To enable further research on multi-modal learning and object tracking with events, we construct a large-scale dataset, FE108, comprising events, frames, and high-frequency annotations. Our approach outperforms frame-based state-of-the-art methods, which indicates leveraging the complementarity of events and frames boosts the robustness of object tracking in degraded conditions. Our current focus is on developing a cross-domain fusion scheme that can enhance visual tracking robustness, especially in degraded conditions. However, we have not leveraged the high measurement rate of event-based cameras to achieve low-latency tracking and the frame rate in the frame-domain bounds the tracking frequency of the proposed approach. One limitation of our frame-event-based dataset, FE108, is that no sequence contains the scenario of no events. Our further work will focus on these two aspects: 1) We will investigate the feasibility of increasing tracking frequency by leveraging the high measurement rate of event-based cameras; 2) We will expand the FE108 by collecting more challenging sequences, especially with no events and more realistic scenes.

[rgb]0,0,0Acknowledgements. This work was supported in part by the National Natural Science Foundation of China under Grant 61632006, Grant 61972067, and the Innovation Technology Funding of Dalian (Project No. 2018J11CY010, 2020JJ26GX036).