Multi-domain Collaborative Feature Representation for Robust Visual Object Tracking

Jointly exploiting multiple different yet complementary domain information has been proven to be an effective way to perform robust object tracking. This paper focuses on effectively representing and utilizing complementary features from the frame domain and event domain for boosting object tracking performance in challenge scenarios. Specifically, we propose Common Features Extractor (CFE) to learn potential common representations from the RGB domain and event domain. For learning the unique features of the two domains, we utilize a Unique Extractor for Event (UEE) based on Spiking Neural Networks to extract edge cues in the event domain which may be missed in RGB in some challenging conditions, and a Unique Extractor for RGB (UER) based on Deep Convolutional Neural Networks to extract texture and semantic information in RGB domain. Extensive experiments on standard RGB benchmark and real event tracking dataset demonstrate the effectiveness of the proposed approach. We show our approach outperforms all compared state-of-the-art tracking algorithms and verify event-based data is a powerful cue for tracking in challenging scenes.

READ FULL TEXT VIEW PDF

page 1

page 4

page 5

page 8

page 9

09/19/2021

Object Tracking by Jointly Exploiting Frame and Event Domain

Inspired by the complementarity between conventional frame-based and bio...
08/12/2019

Learning Target-oriented Dual Attention for Robust RGB-T Tracking

RGB-Thermal object tracking attempt to locate target object using comple...
05/23/2018

RGB-T Object Tracking:Benchmark and Baseline

RGB-Thermal (RGB-T) object tracking receives more and more attention due...
09/28/2021

SiamEvent: Event-based Object Tracking via Edge-aware Similarity Learning with Siamese Networks

Event cameras are novel sensors that perceive the per-pixel intensity ch...
06/09/2020

RGB-D-E: Event Camera Calibration for Fast 6-DOF Object Tracking

Augmented reality devices require multiple sensors to perform various ta...
01/21/2022

Exploring Fusion Strategies for Accurate RGBT Visual Object Tracking

We address the problem of multi-modal object tracking in video and explo...
02/09/2022

Real-Time Event-Based Tracking and Detection for Maritime Environments

Event cameras are ideal for object tracking applications due to their ab...

1 Introduction

Figure 1: Visual examples of our tracker comparing with other three state-ot-the-art trackers including GradNet Li et al. (2019c), MDNet Nam and Han (2016), and SiamDW-RPN Zhang and Peng (2019) on Ironman from OTB2013 Wu et al. (2015). Ironman is a challenging sequence with low light, motion blur, background clutters, and fast motion. Best viewed in zoom in.

Visual object tracking is an important topic in computer vision, where the target object is identified in the first frame and tracked in all frames of a video. Due to the significant learning ability, deep convolutional neural networks (DCNNs) have been widely used to object detection 

Mei et al. (2020, 2021); Yang et al. (2019), image matting Qiao et al. (2020b, a); Yang et al. (2018b)

, super-resolution 

Zhang et al. (2020, 2021); Yang et al. (2018a), image enhancement Xu et al. (2020); Yang et al. (2018c) and visual object tracking Bertinetto et al. (2016); Fan and Ling (2019); He et al. (2018); Li et al. (2019a, c); Zhang and Peng (2019); Nam and Han (2016); Danelljan et al. (2017); Zhang et al. (2017, 2018); Jung et al. (2018); Danelljan et al. (2019); Dai et al. (2019); Li et al. (2020); Wang et al. (2017); Ren et al. (2020). However, RGB-based trackers suffer from bad environmental conditions, e.g., low illumination, fast motion, and so on. Some works Song and Xiao (2013); Kart et al. (2018b, a); Li et al. (2018b, 2019b); Zhu et al. (2019) try to introduce additional information (e.g., depth and thermal infrared) to improve tracking performance. However, when the tracking target is in a high-speed motion or an environment with a wide dynamic range, these sensors usually cannot provide satisfactory results.

Event-based cameras are bio-inspired vision sensors whose working principle is entirely different from traditional cameras. While conventional cameras obtain intensity frames at a fixed rate, event-based cameras measure light intensity changes and output events asynchronously. Compared with conventional cameras, event-based cameras have several advantages. First, with a high temporal resolution (around 1 ), event-based cameras do not suffer from motion blur. Second, event-based cameras have a high dynamic range (i.e., 120-140 dB). Thus they can work effectively even under over/under-exposure conditions.

We observe that events and RGB data are captured from different types of sensors, but they share some similar information like target boundaries. At the same time, stacked event images and RGB images have their own unique characteristics. In particular, RGB images contain rich low- and high-frequency texture information and provide abundant representations for describing target objects. Events can provide target edge cues that are not influenced by motion blur and bad illumination. Therefore, event-based data and RGB images are complementary, which calls for the development of novel algorithms capable of combining the specific advantages of both domains to perform computer vision in degraded conditions.

To the best of our knowledge, we are the first to jointly explore RGB and events for object tracking based on their similarities and differences in an end-to-end manner. This work is essentially object tracking with multi-modal data that includes RGB-D tracking Song and Xiao (2013); Kart et al. (2018b); Xiao et al. (2017); Kart et al. (2018a), RGB-T tracking Li et al. (2018b); Lan et al. (2018); Li et al. (2019b); Zhu et al. (2019); Zhang et al. (2019), and so on. However, since the output of an event-based camera is an asynchronous stream of events, this makes event-based data fundamentally different from other sensors’ data that have been addressed well by multi-model tracking methods. With the promise of increased computational ability and low power computation using neuromorphic hardware, Spiking Neural Networks (SNNs), a processing model aiming to improve the biological realism of artificial neural networks, show their potential as computational engines, especially for processing event-based data from neuromorphic sensors. Therefore, combining SNNs and DCNNs to process multi-domain data is worth exploring.

In this paper, focusing on the above two points, we propose Multi-domain Collaborative Feature Representation (MCFR) that can effectively extract the common features and unique features from both domains for robust visual object tracking in challenging conditions. Specifically, we employ the first three convolutional layers of VGGNet-M Simonyan and Zisserman (2014) as our Common Features Extractor (CFE) to learn similar potential representations from the RGB domain and event domain. To model specific characteristics of each domain, the Unique Extractor for RGB (UER) is designed to extract unique texture and semantic features in the RGB domain. Furthermore, we leverage the Unique Extractor for Events (UEE) based on SNNs to efficiently extract edge cues in the event domain. Extensive experiments on the RGB benchmark and real event dataset suggest that the proposed tracker achieves outstanding performance. A visual example can be seen in Figure 1, which contains multiple challenging attributes. By analyzing quantitative results, we provide basic insights and identify the potentials of events in visual object tracking.

To sum up, our contributions are as follows:

We propose a novel multi-domain feature representation network which can effectively extract and fuse the information from frame and event domains.

We preliminarily explore combining SNNs and DCNNs for visual object tracking.

The extensive experiments verify our approach outperforms other state-of-the-art methods. The ablation studies evidence the effectiveness of the designed components.

2 Related Work

2.1 Spiking Neural Networks

Spiking Neural Networks (SNNs) are bio-inspired models using spiking neurons as computational models. The inputs of spiking neurons are temporal events called spikes, and the outputs also are spikes. Spiking neurons have a one-dimensional internal state named potential, which is controlled by first-order dynamics. Whenever a spike arrives, if no other spikes are recorded in time, the potential will be excited but will decay again. When the potential reaches a certain threshold, the spiking neuron sends a spike to the connected neurons and resets its own potential. It has been shown that such networks are able to process asynchronous, without pre-processing events data  

Cohen et al. (2016); Gehrig et al. (2019)

. Since the spike generation mechanism cannot be differentiated and the spikes may introduce the problem of incorrect allocation of the time dimension, the traditional gradient backpropagation mechanism cannot be directly used in SNNs. Nonetheless, some researches 

Neftci et al. (2019); Zenke and Ganguli (2018); Shrestha and Song (2017); Shrestha and Orchard (2018b); Tavanaei et al. (2019); Gehrig et al. (2020)

on supervised learning for SNNs has taken inspiration from backpropagation to solve the error assignment problem. However, it is still unclear how to train multiple layers of SNNs, and combine them with DCNNs for tracking task.

2.2 Single-Domain Tracking

RGB-based tracking. Deep-learning-based methods have dominated the visual object tracking field, from the perspective of either one-shot learning Bertinetto et al. (2016); Fan and Ling (2019); He et al. (2018); Li et al. (2019a, c); Zhang and Peng (2019) or online learning Nam and Han (2016); Danelljan et al. (2017); Zhang et al. (2017, 2018); Jung et al. (2018); Danelljan et al. (2019); Dai et al. (2019); Li et al. (2020). Usually, the latter methods are more accurate (with less training data) but slower than the former ones. Among them, Nam et al.  Nam and Han (2016) proposed the Multi-Domain Network (MDNet), which used a CNN-based backbone pretrained offline to extract generic target representations, and the fully connected layers updated online to adapt temporal variations of target objects. In MDNet Nam and Han (2016), each domain corresponds to one video sequence. Due to the effectiveness of this operation in visual tracking, we follow this idea to ensure the accuracy of tracking.

Event-based tracking. Compared with the frame-based object tracking methods, there are only a few works on event-based object tracking Piatkowska et al. (2012); Mitrokhin et al. (2018); Zhu et al. (2020); Ramesh et al. (2020); Barranco et al. (2018); Stoffregen et al. (2019); Chen et al. (2020). Piatkowska et al. Piatkowska et al. (2012)

presented a Gaussian mixture model to track the pedestrian motion. Barranco

et al. Barranco et al. (2018)

proposed a real-time clustering algorithm and used Kalman filters to smooth the trajectories.

Zhu et al. Zhu et al. (2020)

monitored the confidence of the velocity estimate and triggered a tracking command once the confidence reaches a certain threshold. Ramesh

et al. Ramesh et al. (2020) presented a long-term object tracking framework with a moving event camera under general tracking conditions.
Mitrokhin et al. Mitrokhin et al. (2018) proposed a motion compensation method for tracking objects by getting the possible areas that are not consistent with camera motion. Timo.S et al.  Stoffregen et al. (2019)

calculated the optical flow from the events at first, then warped the events’ position to get the sharp edge event images according to the contrast principle. Besides, they gave each event a weight as its probability and fused them during the process of warping so that they can classify events into different objects or background. Chen

et al. Chen et al. (2020) proposed an end-to-end retinal motion regression network to regress 5-DoF motion features.

Although the above studies have achieved good performance in the RGB domain or the event domain, they ignore exploring the complementary information existing between the two domains. As a consequence, we investigate the similarities and differences between the event and RGB domain, and propose common features extractor and unique feature extractor to learn and fuse valuable complementary features.

2.3 Multi-Domain Tracking

The current popular visual object tracking based on multi-domain data mainly includes RGB-D (RGB + depth) tracking  Song and Xiao (2013); Xiao et al. (2017); Kart et al. (2018a, b) and RGB-T (RGB + thermal) tracking Li et al. (2018b); Lan et al. (2018); Li et al. (2019b); Zhu et al. (2019); Zhang et al. (2019). Depth cues are usually introduced to solve the occlusion problem in visual object tracking. Images from the thermal infrared sensors are not influenced by illumination variations and shadows, and thus can be combined with RGB to improve performance in bad environmental conditions. As the output of an event camera is an asynchronous stream of events, this makes raw event stream fundamentally different from other sensors data that have been addressed well by the above state-of-the-art multi-model tracking methods. Therefore, it is essential to design a tailored algorithm for leveraging RGB data and event data simultaneously.

3 Methodology

Figure 2: The overview of our proposed network. Our pipeline mainly consists of three parts, UEE for extracting special features from event domain, UER for extracting unique features from RGB domain, and CFE for extracting common shared features from both domains. The target is a moving truck in underexposure.

3.1 Backgroud: Event-based Camera

An event-based camera is a bio-inspired sensor. It asynchronously measures light intensity changes in scene-illumination at a pixel level. Therefore, it provides a very high-temporal resolution (i.e., up to 1MHz). Due the light intensity changes are measured in the log scale, an event-based camera can offer a very high dynamic range (i.e., up to 140 dB). An event was triggered when the change of a log-scale pixel intensity is higher or lower than a threshold, resulting in an “ON” and an “OFF” event, respectively. Mathematically, a set of events can be defined as:

(1)

where is the th event; is the pixel location of event ; is the timestamp when the event is triggered; is the polarity of an event, where and represent OFF and ON events, respectively. In a constant lighting condition, events are normally triggered by moving edges (e.g., object contour, texture and depth discontinuities), which makes an event-based camera be a natural edge extractor. Therefore, with these unique features, event-based cameras have been introduced to various tasks Stoffregen et al. (2019); Tulyakov et al. (2019); Bi et al. (2019); Kepple et al. (2020); Choi et al. (2020); Pan et al. (2020); Cadena et al. (2021); Mostafavi et al. (2021) in challenging scenes (e.g., low-light, fast motion).

Even though event-based cameras are sensitive to edges, they cannot provide absolute intensity and texture information. Besides, since the asynchronous event stream differs significantly from the frames generated by conventional frame-based cameras, vision algorithms designed for frame-based cameras cannot be directly applied. To deal with it, events are typically aggregated into a grid-based representation first.

3.2 Network Overview

Our approach builds on two key observations. First, although events and RGB data are captured from different types of sensors, they share some similar information, such as target object boundaries. Similar features should be extracted using a consistent strategy. Second, rich textural and semantic cues can be easily captured by a conventional frame-based sensor. In contrast, an event-based camera can easily capture edge information which may be missed in RGB images under some challenging conditions. Therefore, fusing complementary advantages of each domain will enhance feature representation. Figure 2 illustrates the proposed Multi-domain Collaborative Feature Representation (MCFR) for robust visual object tracking. Specifically, for the first observation, we propose the Common Feature Extractor (CFE) which accepts stacked event images and RGB images as inputs to explore shared common features. For the second observation, we design a Unique Extractor for Event (UEE) based on SNNs to extract edge cues in the event domain which may be missed in the RGB domain under some challenging conditions, and a Unique Extractor for RGB (UER) based on DCNNs to extract texture and semantic information in the RGB domain. The outputs of UEE, CFE, and UER are then concatenated, and a convolutional layer with 11 kernel size is used to adaptively select valuable combinative features. Finally, the combinative features are classified by three fully connected layers and softmax cross-entropy loss. Following Nam and Han (2016), the network has branches, which are denoted by the last fully connected layers. In other words, training sequences .

3.3 Common Feature Extractor

To leverage a consistent scheme for extracting similar features of event and RGB domains, we first stack event stream according to the counts and latest timestamp of positive and negative polarities, which makes vision algorithms designed for frames can also be applied to asynchronous event streams. Mathematically,

(2)

where is the Kronecker delta function, is the time window (the interval between adjacent RGB frames), and is the number of events that occurred within . The stacked event count image contains the number of events at each pixel, which implies the frequency and density information of targets. The stacked event timestamp image contains the temporal cues of the motion, which implies the direction and speed information of targets. An example of counts images and timestamp images is shown in Figure 3, we find that the stacked event images and RGB image indeed share some common features, such as the edge cues of targets.

We then employ a Common Feature Extractor (CFE) to extract shared object representations across different domains. To balance effectiveness and efficiency, we apply the first three layers from the VGGNet-M Simonyan and Zisserman (2014)

as the main feature extraction structure of our CFE. Specifically, the convolution kernel sizes are

, , and , respectively. The output channels are 96, 256, and 512, respectively. As shown in Figure 2, the whole process is formulated as , where denotes RGB image, is concatenation, indicates channel transformation, and is the output of CFE.

(a) RGB (b)pos_counts (c) neg_counts (d) pos_timestamp (e) neg_timestamp
Figure 3: Example of counts and timestamp images. Left to right: RGB image, positive counts image, negative counts image, positive timestamp image, and negative timestamp image. In timestamp images, each pixel represents the timestamp of the most recent event, and brighter is more recent.

3.4 Unique Extractor for RGB

Since the raw event stream and RGB data storage methods and expressions are different, it is necessary to design an exclusive feature extraction method for each domain. For the RGB domain, we propose Unique Extractor for RGB (UER) to effectively extract unique texture and semantic features. Specifically, as shown in Figure 2, UER consists of three convolutional layers, and the size of the convolution kernel are , , and , respectively. It is noted that one major difference between UER and CFE is the size of the convolution kernel. CFE employs large-size convolution kernels to provide a larger receptive field so that the whole boundary from RGB and event domains can be better extracted, while UER can focus on the rich texture information in the RGB domain with small-size kernels. This process can be simply formulated as , where is the output of UER.

3.5 Unique Extractor for Event

Compared with RGB images, the event-based data is not affected by HDR and motion blur. Besides, from Figure 3, we can see that events can provide clear cues about where object movement occurred, which will help the tracking process not be disturbed by the surrounding environment. Since SNNs can process raw event stream directly, we introduce it into our Unique Extractor for Events (UEE) (top branch in Figure 2) to effectively extract unique event features. There are different mathematical models to describe the dynamics of a spiking neuron, we use the Spike Response Model (SRM) Gerstner (1995) in this work. In the SRM Gerstner (1995), the net effect that firing has on the emitting and the receiving neuron is described by two response functions, and . The refractory function describes the response of the firing neuron to its own spike. The synaptic kernel describes the effect of an incoming spike on the membrane potential at the soma of the postsynaptic neuron. Following  Gehrig et al. (2020); Shrestha and Orchard (2018a), we define the feedforward SNNs with layers as:

(3)
(4)
(5)
(6)

where is the Heaviside step function; and are the time constants of the synaptic kernel and refractory kernel, respectively. and are the input spikes and synaptic weights of the th layer, respectively. denotes the neuron threshold, that means, when the sub-threshold membrane potential is strong enough to exceed the spiking neuron responds with a spike. To combine SNNs with DCNNs in the overall structure, we perform a mean operation on the time dimension T of SNNs output. is the output of our UEE.

3.6 Discussion

After extracting common shared features and unique features from both domains, we fuse them with a concatenate operation. Considering different video sequences have different classes, movement styles, and challenging aspects, we further use three fully connected layers named as , , and whose output channels are 512, 512, and 2, respectively, to further process fusion features. is a domain-specific layer, that means each training has sequences, then there are layers. Each of the sequences contains a binary classification layer with softmax cross-entropy loss, which is responsible for distinguishing target and background.

It should be noted that we did not use a very deep network or complex integration strategy because of the following reasons. First, compared with visual recognition problems, visual tracking requires much lower model complexity because it aims to distinguish only two categories of target and background. Second, since the target is usually small, it is desirable to reduce the input size, which will naturally reduce the depth of the network. Finally, due to the need for online training and testing, a smaller network will be more effective. Our main principle of network design is to make it simple yet work. To the best of our knowledge, this work is the first to explore and utilize the correlation between RGB images and event-based data for visual object tracking. We believe that more and more related works could be done to further improve such a compact network.

3.7 Training Details

For CFE, we initialize parameters of it using the pre-trained model in VGGNet-M Simonyan and Zisserman (2014). For UEE, by the public SLAYER Shrestha and Orchard (2018a)

, we can calculate the gradient of the loss function relative to the SNNs parameter based on the first-order optimization method. We initialize parameters of UEE using the pre-trained model in 

Gehrig et al. (2020)

and then fix them. We use the stochastic gradient descent algorithm (SGD) to train the network. The batch size is set to 8 frames which are randomly selected from training video sequences. We choose 32 positive samples (IoU overlap ratios with the ground truth bounding box are larger than 0.7) and 96 negative samples (IoU overlap ratios with the ground truth bounding box are less than 0.5) from each frame, which results in 256 positive and 768 negative samples altogether in a mini-batch. For multi-domain learning with

training sequences, we train the network by softmax cross-entropy loss. The learning rates of all convolutional layers are set to 0.0001, the learning rates of fc4 and fc5 are set to 0.0001, and the learning rate of fc6 is set to 0.001.

3.8 Tracker Details

During the tracking process, for each test video sequence, we replace k branches of fc6 with a single branch. To capture the context of a new sequence and learn video-specific information adaptively, we adopt online fine-tuning. Specifically, we fix all convolutional filters of UEE, CFE, and UER, and fine-tune the fully connected layers fc4, fc5, and a single branch fc6. The reason is that convolutional layers can extract the generic information about tracking, while the fully connected layers are able to learn video-specific information. For online updating, we collect 500 positive samples (IoU overlap ratios with the ground truth bounding box are greater than 0.7) and 5000 negative samples (IoU overlap ratios with the ground truth bounding box are less than 0.5) as the training samples in the first frame. For the -th frame, we collect a set of candidate regions from previous tracking result by Gaussian sampling. We then use these candidates as inputs to our network and obtain their classification scores. The positive and negative scores are computed using the trained network as and , respectively. We select the candidate region with the highest score as the target location of the current frame:

(7)

where is the number of candidate regions. We use the bounding box regression technique to improve the problem of target scale transformation in the tracking process and improve the accuracy of positioning.

4 Experiments

4.1 Training Dataset Generation

Supervised learning for visual object tracking requires a large quantity of data. In our case, we need a dataset that contains RGB data from a traditional APS camera (an APS (Active Pixel Sensor) is a conventional image sensor where each pixel sensor unit cell has a photodetector and one or more active transistors) and events from an event-based camera with ground truth bounding box. Our data set needs to meet the following needs: First, the RGB images and event-based data must be aimed at the same scene, and the data between different domains must be aligned. Second, we must have a large variety of scenes with ground truth bounding boxes to avoid overfitting to specific visual patterns. To our knowledge, such data sets do not yet exist. In order to meet the above requirements, we generate a synthetic dataset using event-camera simulator ESIM Rebecq et al. (2018) on large-scale short-term generic object tracking database GOT-10k Huang et al. (2019). ESIM Rebecq et al. (2018) has successfully been proven its effectiveness in previous works Wang et al. (2019); Rebecq et al. (2019); Stoffregen et al. (2019). GOT-10k Huang et al. (2019) is a large, high-diversity, and one-shot tracking database with a wide coverage of real-world moving objects. GOT-10k Huang et al. (2019) collects over 10,000 videos of 563 object classes and annotates 1.5 million tight bounding boxes manually.

Actually, as we all know, traditional RGB frames suffer from motion blur under fast motion, and also have limited dynamic range resulting in the loss of details. Therefore, directly using the RGB and event pairs from ESIM Rebecq et al. (2018) is not an ideal way for training the network, as our goal is to fully exploit the advantages of event cameras.

Figure 4: PR and SR curves of different tracking result on OTB2013 Wu et al. (2015) dataset, where the representative PR and SR scores are presented in the legend.

Therefore, we randomly select 100 video sequences. For each RGB frame in the sequence, we randomly increase or decrease the exposure manually. In this way, we simulate the fact that event-based cameras can provide valuable information that conventional cameras cannot capture in extreme exposure conditions. To verify the effectiveness of our proposed approach, we evaluate it on the standard RGB benchmark and the real event dataset, respectively.

Figure 5: Evaluation results on various challenges comparing to the-state-of-the-art methods on OTB2013 Wu et al. (2015). Left to right: fast_motion, motion_blur, illumination_variations and background_clutter.
Figure 6: Qualitative evaluation of our method and other trackers including CFNet Valmadre et al. (2017),GradNet Li et al. (2019c), MDNet Nam and Han (2016), SiamDW-RPN Zhang and Peng (2019), SiamDW-FC Zhang and Peng (2019), and SiamFC Bertinetto et al. (2016) on 8 challenging videos from OTB2013 Wu et al. (2015). From left to right and top to down are Ironman, CarScale, Matrix, MotorRolling, Skating1, Skiing, Tiger2, and Trellis respectively. Best viewed in zoom in.

4.2 Evaluation on Standard RGB Benchmark

To demonstrate the effectiveness of our MCFR, we first test it on the standard RGB benchmark OTB2013 Wu et al. (2015). The evaluation is based on two metrics: the Precision Rate (PR) and the Success Rate (SR). SR cares the frame of that overlap between ground truth and predicted bounding box is larger than a threshold; PR focuses on the frame of that the center distance between ground truth and predicted bounding box within a given threshold. The one-pass evaluation (OPE) is employed to compare our algorithm with the eleven state-of-the-art trackers including SiamDW-RPN Zhang and Peng (2019), MDNet Nam and Han (2016), SiamFC Bertinetto et al. (2016), CFNet Valmadre et al. (2017), SiamRPN Li et al. (2018a), SiamDW-FC Zhang and Peng (2019), DaSiamRPN Zhu et al. (2018), SRDCF Danelljan et al. (2015), GradNet Li et al. (2019c), DiMP Bhat et al. (2019), and ATOM Danelljan et al. (2019). We also apply ESIM Rebecq et al. (2018) to generate event-based data on OTB2013 Wu et al. (2015).

The evaluation results are reported in Figure 4. From the results, we can see that our method outperforms the other trackers on OTB2013 Wu et al. (2015). In particular, our MCFR (95.3%/73.9% in PR/SR) outperforms 3.1% over the second-best tracker MDNet Nam and Han (2016) in SR, and is superior to other trackers in PR. It demonstrates the effectiveness of our structure for extracting the common features and unique features from different domains. In addition, the remarkable superior performance over the state-of-the-art trackers like ATOM Danelljan et al. (2019) and DiMP Bhat et al. (2019) suggests that our method is able to make the best use of event domain information to boost tracking performance.

In order to analyze what reliable information the event-based data provides, we report the results on various challenge attributes to show more detailed performance. As shown in Figure 5, our tracker can effectively handle these challenging situations that traditional RGB trackers often lose targets. In particular, under the challenging scenes of fast motion and motion blur, our tracker greatly surpasses the other trackers. That’s because the low latency and high temporal resolution of the event-based camera bring more information about the movement between adjacent RGB frames, which can effectively promote the performance of our tracker. From Figure 5, we can also find that our tracker has the best performance in illumination variation scenes. Moreover, in the background_clutter (the background near the target has the similar color or texture as the target), as event-based data pays more attention to moving objects rather than the color or texture of objects, our tracker has been significantly improved.

Methods fast_drone light variations what is background occlusions
AP AR AP AR AP AR AP AR
KCL Henriques et al. (2014) 0.169 0.176 0.107 0.066 0.028 0.000 0.004 0.000
TLD Kalal et al. (2011) 0.315 0.118 0.045 0.066 0.269 0.333 0.092 0.167
SiamFC Bertinetto et al. (2016) 0.559 0.667 0.599 0.675 0.307 0.308 0.148 0.000
ECO Danelljan et al. (2017) 0.637 0.833 0.586 0.688 0.616 0.692 0.108 0.143
DaSiamRPN Zhu et al. (2018) 0.673 0.853 0.654 0.894 0.678 0.833 0.189 0.333
E-MS Barranco et al. (2018) 0.313 0.307 0.325 0.321 0.362 0.360 0.356 0.353
ETD Chen et al. (2019) 0.738 0.897 0.842 0.933 0.653 0.807 0.431 0.647
MCFR(Ours) 0.802 0.931 0.853 0.933 0.734 0.871 0.437 0.644
Table 1: Results obtained by the competitors and our method on the EED Mitrokhin et al. (2018) dataset. The best results are in red.

4.3 Evaluation on Real Event Dataset

To further prove the effectiveness of our method, we also evaluate it on the real event dataset EED Mitrokhin et al. (2018). The EED Mitrokhin et al. (2018) was recorded using a DAVIS Brändli et al. (2014) event camera in real-world environments, which contains the events sequences and the corresponding RGB sequences for each video. The EED Mitrokhin et al. (2018) also provides the ground truth for targets. The EED Mitrokhin et al. (2018) contains five sequences: fast_drone, light_variations, what_is_background, occlusions, and multiple_objects. Since multiple_objects involves multiple targets, we use the first four video sequences here. Specifically, fast_drone describes a fast moving drone under a very low illumination condition, and in light_variations, a strobe light flashing at a stable frequency is placed in a dark room. A thrown ball with a dense net as foreground in what_is_background, and a thrown ball with a short occlusion under a dark environment in occlusions.

Following Chen et al. (2019), we use two metrics: the Average Precision (AP) and the Average Robustness (AR) for evaluation. AP and AR describe the accuracy and robustness of the tracker, respectively. The AP can be formulated as follows:

(8)

where is the repeat times of the evaluation (here we set to 5), and is the number of objects in the current sequence. is the estimated bounding box in the -th round of the evaluation for the -th object, and is the corresponding ground truth. The AR can be formulated as follows:

(9)

where indicates that whether the tracking in the -th round for the -th object is successful or not. It will be considered a failure condition if the AP value is less than 0.5. We compare our algorithm with seven state-of-the-art methods including KCL Henriques et al. (2014), TLD Kalal et al. (2011), SiamFC Bertinetto et al. (2016), ECO Danelljan et al. (2017), DaSiamRPN Zhu et al. (2018), E-MS Barranco et al. (2018), and ETD Chen et al. (2019). Herein, the first five algorithms are correlation filter-based or deep learning based traditional RGB object tracking methods, and the remaining are event-based tracking methods. The quantitative results are shown in Table 1, we can see that the traditional RGB trackers are severely affected by low light and fast motion. When there is too much noise in the events, due to lacking image texture information, the event-based tracker cannot effectively obtain satisfactory performance. Instead, our proposed structure can simultaneously obtain texture information from RGB and target edge cues from events so that our method can effectively handle high dynamic range and fast motion conditions.

UEE CFE UER RGB Event C T PR(%) SR(%)
MCFR 0.397 0.556
MCFR 0.702 0.944
MCFR 0.719 0.949
w/o UEE 0.729 0.950
w/o CFE 0.710 0.947
w/o UER 0.723 0.951
MCFR 0.720 0.951
MCFR 0.728 0.950
MCFR 0.739 0.953
Table 2: Ablation analyses of MCFR and its variants.

4.4 Ablation Study

To verify RGB images and event-based data can jointly promote the tracker performance, we implement three variants, including 1) MCFR, that only applies CFE with events as inputs. 2) MCFR, that only applies CFE with RGB images as inputs. 3) MCFR, that applies CFE with events and RGB data as inputs. The comparison results are shown in Table 2. The results illustrate that the collaborative use of multi-domain information is indeed superior to a single domain.

To validate our method can effectively extract common and unique features from RGB and event domains, we implement three variants based on MCFR, including 1) w/o UEE, that removes Unique Extractor for Event, 2) w/o CFE, that removes Common Feature Extractor, and 3) w/o UER, that removes Unique Extractor for RGB. From Table 2, we can see that our MCFR is superior over w/o UEE, which suggests the UEE with SNNs is helpful to take advantage of the event-based data, thereby improving the tracking performance. Besides, MCFR outperforms w/o CFE by a clear margin demonstrates that it is essential to extract common features of targets. The superior performance of MCFR over w/o UER suggests unique texture features from RGB are important for tracking.

We also explore the performance impact of different ways of stacking events. MCFR and MCFR represent stacking event streams according to counts and the latest timestamp, respectively. From Table 2, we can see that MCFR outperforms MCFR and MCFR, which verifies that counts images C can record all the events that occurred within a period, and timestamps images T can encode features about the motion.

4.5 Failure Cases Analysis

Our method does have limitations. The failure examples are shown in Figure 7. Since the target is static, the event camera cannot effectively provide the edge cues of the target, resulting in the unavailability of information in the event domain. At the same time, an object similar to the target moves around the target, similar colors and textures will interfere with the target-related information provided by the RGB domain. In these cases, the event provides misleading information about moving object, which causes incorrect positioning.

Figure 7: Failure cases. The target is stationary and a moving object similar to the target appears around. Red box is GT, green box is our result.

5 Conclusion

In this paper, we propose Multi-domain Collaborative Feature Representation (MCFR) to effectively extract and fuse common features and unique features from the RGB and event domain for robust visual object tracking in some challenging conditions, such as fast motion and high dynamic range. Specifically, we apply CFE to extract common features and design UEE based on SNNs and UER based on DCNNs to present specific features of the RGB and event data. Extensive experiments on the RGB tracking benchmark and real event dataset suggest that the proposed tracker achieves outstanding performance. In future work, we will explore upgrading our event-based module so that it can be easily extended to existing RGB trackers for improving performance in challenging conditions.

Acknowledgements. This work was supported in part by the National Natural Science Foundation of China under Grant 91748104, Grant 61972067, and the Innovation Technology Funding of Dalian (Project No. 2018J11CY010, 2020JJ26GX036).

Conflict of interest. Jiqing Zhang, Kai Zhao, Bo Dong, Yingkai Fu, Yuxin Wang, Xin Yang and Baocai Yin declare that they have no conflict of interest.

References

  • F. Barranco, C. Fermuller, and E. Ros (2018) Real-time clustering and multi-target tracking using event-based sensors. In IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: §2.2, §4.3, Table 1.
  • L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, Cited by: §1, §2.2, Figure 6, §4.2, §4.3, Table 1.
  • G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte (2019) Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §4.2, §4.2.
  • Y. Bi, A. Chadha, A. Abbas, E. Bourtsoulatze, and Y. Andreopoulos (2019) Graph-based object classification for neuromorphic vision sensing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §3.1.
  • C. Brändli, R. Berner, M. Yang, S. Liu, and T. Delbrück (2014) A 240 180 130 db 3 s latency global shutter spatiotemporal vision sensor. IEEE Journal of Solid-state Circuits. Cited by: §4.3.
  • P. R. G. Cadena, Y. Qian, C. Wang, and M. Yang (2021) SPADE-e2vid: spatially-adaptive denormalization for event-based video reconstruction. IEEE Transactions on Image Processing. Cited by: §3.1.
  • H. Chen, D. Suter, Q. Wu, and H. Wang (2020) End-to-end learning of object motion estimation from retinal events for event-based object tracking.. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.2.
  • H. Chen, Q. Wu, Y. Liang, X. Gao, and H. Wang (2019) Asynchronous tracking-by-detection on adaptive time surfaces for event-based object tracking. In Proceedings of the 27th ACM International Conference on Multimedia, Cited by: §4.3, Table 1.
  • J. Choi, K. Yoon, et al. (2020) Learning to super resolve intensity images from events. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §3.1.
  • G. K. Cohen, G. Orchard, S. Leng, J. Tapson, R. B. Benosman, and A. Van Schaik (2016) Skimming digits: neuromorphic classification of spike-encoded images. Frontiers in neuroscience. Cited by: §2.1.
  • K. Dai, D. Wang, H. Lu, C. Sun, and J. Li (2019) Visual tracking via adaptive spatially-regularized correlation filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.2.
  • M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2019) Atom: accurate tracking by overlap maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.2, §4.2, §4.2.
  • M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg (2017) Eco: efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.2, §4.3, Table 1.
  • M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg (2015) Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §4.2.
  • H. Fan and H. Ling (2019) Siamese cascaded region proposal networks for real-time visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.2.
  • D. Gehrig, A. Loquercio, K. G. Derpanis, and D. Scaramuzza (2019) End-to-end learning of representations for asynchronous event-based data. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.1.
  • M. Gehrig, S. B. Shrestha, D. Mouritzen, and D. Scaramuzza (2020) Event-based angular velocity regression with spiking networks. In 2020 IEEE International Conference on Robotics and Automation (ICRA), Cited by: §2.1, §3.5, §3.7.
  • W. Gerstner (1995) Time structure of the activity in neural network models. Physical review E. Cited by: §3.5.
  • A. He, C. Luo, X. Tian, and W. Zeng (2018) A twofold siamese network for real-time object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.2.
  • J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2014) High-speed tracking with kernelized correlation filters. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.3, Table 1.
  • L. Huang, X. Zhao, and K. Huang (2019) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §4.1.
  • I. Jung, J. Son, M. Baek, and B. Han (2018) Real-time mdnet. In Proceedings of the European Conference on Computer Vision, Cited by: §1, §2.2.
  • Z. Kalal, K. Mikolajczyk, and J. Matas (2011) Tracking-learning-detection. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.3, Table 1.
  • U. Kart, J. Kämäräinen, J. Matas, L. Fan, and F. Cricri (2018a) Depth masked discriminative correlation filter. In 2018 24th International Conference on Pattern Recognition, Cited by: §1, §1, §2.3.
  • U. Kart, J. Kämäräinen, J. Matas, and J. Matas (2018b) How to make an rgbd tracker?. In Proceedings of the European Conference on Computer Vision, Cited by: §1, §1, §2.3.
  • D. R. Kepple, D. Lee, C. Prepsius, V. Isler, I. M. Park, and D. D. Lee (2020) Jointly learning visual motion and confidence from local patches in event cameras. In Proceedings of the European Conference on Computer Vision, Cited by: §3.1.
  • X. Lan, M. Ye, S. Zhang, and P. C. Yuen (2018) Robust collaborative discriminative learning for rgb-infrared tracking. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.3.
  • B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan (2019a) Siamrpn++: evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.2.
  • B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018a) High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
  • C. Li, A. Lu, A. Zheng, Z. Tu, and J. Tang (2019b) Multi-adapter rgbt tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Cited by: §1, §1, §2.3.
  • C. Li, C. Zhu, Y. Huang, J. Tang, and L. Wang (2018b) Cross-modal ranking with soft consistency and noisy labels for robust rgb-t tracking. In Proceedings of the European Conference on Computer Vision, Cited by: §1, §1, §2.3.
  • P. Li, B. Chen, W. Ouyang, D. Wang, X. Yang, and H. Lu (2019c) Gradnet: gradient-guided network for visual object tracking. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Figure 1, §1, §2.2, Figure 6, §4.2.
  • W. Li, X. Li, O. E. Bourahla, F. Huang, F. Wu, W. Liu, Z. Wang, and H. Liu (2020) Progressive multistage learning for discriminative tracking. IEEE Transactions on Cybernetics. Cited by: §1, §2.2.
  • H. Mei, Y. Liu, Z. Wei, D. Zhou, X. Xiaopeng, Q. Zhang, and X. Yang (2021) Exploring dense context for salient object detection. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
  • H. Mei, X. Yang, Y. Wang, Y. Liu, S. He, Q. Zhang, X. Wei, and R. W. Lau (2020) Don’t hit me! glass detection in real-world scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • A. Mitrokhin, C. Fermuller, C. Parameshwara, and Y. Aloimonos (2018) Event-based moving object detection and tracking. In IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: §2.2, §4.3, Table 1.
  • M. Mostafavi, L. Wang, and K. Yoon (2021) Learning to reconstruct hdr images from events, with applications to depth and flow prediction. International Journal of Computer Vision. Cited by: §3.1.
  • H. Nam and B. Han (2016) Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Figure 1, §1, §2.2, §3.2, Figure 6, §4.2, §4.2.
  • E. O. Neftci, H. Mostafa, and F. Zenke (2019) Surrogate gradient learning in spiking neural networks: bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine. Cited by: §2.1.
  • L. Pan, M. Liu, and R. Hartley (2020) Single image optical flow estimation with an event camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.
  • E. Piatkowska, A. N. Belbachir, S. Schraml, and M. Gelautz (2012) Spatiotemporal multiple persons tracking using dynamic vision sensor. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §2.2.
  • Y. Qiao, Y. Liu, X. Yang, D. Zhou, M. Xu, Q. Zhang, and X. Wei (2020a) Attention-guided hierarchical structure aggregation for image matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • Y. Qiao, Y. Liu, Q. Zhu, X. Yang, Y. Wang, Q. Zhang, and X. Wei (2020b) Multi-scale information assembly for image matting. In Computer Graphics Forum, Cited by: §1.
  • B. Ramesh, S. Zhang, H. Yang, A. Ussa, M. Ong, G. Orchard, and C. Xiang (2020) E-tld: event-based framework for dynamic object tracking. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.2.
  • H. Rebecq, D. Gehrig, and D. Scaramuzza (2018) ESIM: an open event camera simulator. In Conference on Robot Learning, Cited by: §4.1, §4.1, §4.2.
  • H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza (2019) Events-to-video: bringing modern computer vision to event cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.1.
  • W. Ren, X. Wang, J. Tian, Y. Tang, and A. B. Chan (2020) Tracking-by-counting: using network flows on crowd density maps for tracking multiple targets. IEEE Transactions on Image Processing. Cited by: §1.
  • S. B. Shrestha and G. Orchard (2018a) SLAYER: spike layer error reassignment in time. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Cited by: §3.5, §3.7.
  • S. B. Shrestha and G. Orchard (2018b) Slayer: spike layer error reassignment in time. In Advances in Neural Information Processing Systems, Cited by: §2.1.
  • S. B. Shrestha and Q. Song (2017) Robustness to training disturbances in spikeprop learning. IEEE transactions on neural networks and learning systems. Cited by: §2.1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §3.3, §3.7.
  • S. Song and J. Xiao (2013) Tracking revisited using rgbd camera: unified benchmark and baselines. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §1, §2.3.
  • T. Stoffregen, G. Gallego, T. Drummond, L. Kleeman, and D. Scaramuzza (2019) Event-based motion segmentation by motion compensation. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.2, §3.1, §4.1.
  • A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida (2019) Deep learning in spiking neural networks. Neural Networks. Cited by: §2.1.
  • S. Tulyakov, F. Fleuret, M. Kiefel, P. Gehler, and M. Hirsch (2019) Learning an event sequence embedding for dense event-based deep stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §3.1.
  • J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr (2017) End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Figure 6, §4.2.
  • L. Wang, Y. Ho, K. Yoon, et al. (2019)

    Event-based high dynamic range image and very high frame rate video generation using conditional generative adversarial networks

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.1.
  • X. Wang, B. Fan, S. Chang, Z. Wang, X. Liu, D. Tao, and T. S. Huang (2017) Greedy batch-based minimum-cost flows for tracking multiple objects. IEEE Transactions on Image Processing. Cited by: §1.
  • Y. Wu, J. Lim, and M. Yang (2015) Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Figure 1, Figure 4, Figure 5, Figure 6, §4.2, §4.2.
  • J. Xiao, R. Stolkin, Y. Gao, and A. Leonardis (2017) Robust fusion of color and depth data for rgb-d target tracking using adaptive range-invariant depth models and spatio-temporal consistency constraints. IEEE transactions on cybernetics. Cited by: §1, §2.3.
  • K. Xu, X. Yang, B. Yin, and R. W. Lau (2020) Learning to restore low-light images via decomposition-and-enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • X. Yang, H. Mei, K. Xu, X. Wei, B. Yin, and R. W. Lau (2019) Where is my mirror?. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §1.
  • X. Yang, H. Mei, J. Zhang, K. Xu, B. Yin, Q. Zhang, and X. Wei (2018a) DRFN: deep recurrent fusion network for single-image super-resolution with large factors. IEEE Transactions on Multimedia. Cited by: §1.
  • X. Yang, K. Xu, S. Chen, S. He, B. Y. Yin, and R. Lau (2018b) Active matting. Advances in Neural Information Processing Systems. Cited by: §1.
  • X. Yang, K. Xu, Y. Song, Q. Zhang, X. Wei, and R. W. Lau (2018c) Image correction via deep reciprocating hdr transformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • F. Zenke and S. Ganguli (2018) Superspike: supervised learning in multilayer spiking neural networks. Neural computation. Cited by: §2.1.
  • J. Zhang, C. Long, Y. Wang, H. Piao, H. Mei, X. Yang, and B. Yin (2021) A two-stage attentive network for single image super-resolution. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
  • J. Zhang, C. Long, Y. Wang, X. Yang, H. Mei, and B. Yin (2020) Multi-context and enhanced reconstruction network for single image super resolution. In 2020 IEEE International Conference on Multimedia and Expo, Cited by: §1.
  • L. Zhang, M. Danelljan, A. Gonzalez-Garcia, J. van de Weijer, and F. Shahbaz Khan (2019) Multi-modal fusion for end-to-end rgb-t tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Cited by: §1, §2.3.
  • T. Zhang, S. Liu, C. Xu, B. Liu, and M. Yang (2017) Correlation particle filter for visual tracking. IEEE Transactions on Image Processing. Cited by: §1, §2.2.
  • T. Zhang, C. Xu, and M. Yang (2018) Learning multi-task correlation particle filters for visual tracking. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.2.
  • Z. Zhang and H. Peng (2019) Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Figure 1, §1, §2.2, Figure 6, §4.2.
  • Q. Zhu, J. Triesch, and B. E. Shi (2020) An event-by-event approach for velocity estimation and object tracking with an active event camera. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. Cited by: §2.2.
  • Y. Zhu, C. Li, B. Luo, J. Tang, and X. Wang (2019) Dense feature aggregation and pruning for rgbt tracking. In Proceedings of the 27th ACM International Conference on Multimedia, Cited by: §1, §1, §2.3.
  • Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu (2018) Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision, Cited by: §4.2, §4.3, Table 1.