A Survey for Deep RGBT Tracking

by   Zhangyong Tang, et al.

Visual object tracking with the visible (RGB) and thermal infrared (TIR) electromagnetic waves, shorted in RGBT tracking, recently draws increasing attention in the tracking community. Considering the rapid development of deep learning, a survey for the recent deep neural network based RGBT trackers is presented in this paper. Firstly, we give brief introduction for the RGBT trackers concluded into this category. Then, a comparison among the existing RGBT trackers on several challenging benchmarks is given statistically. Specifically, MDNet and Siamese architectures are the two mainstream frameworks in the RGBT community, especially the former. Trackers based on MDNet achieve higher performance while Siamese-based trackers satisfy the real-time requirement. In summary, since the large-scale dataset LasHeR is published, the integration of end-to-end framework, e.g., Siamese and Transformer, should be further considered to fulfil the real-time as well as more robust performance. Furthermore, the mathematical meaning should be more considered during designing the network. This survey can be treated as a look-up-table for researchers who are concerned about RGBT tracking.


page 1

page 2

page 3


Efficient Adversarial Attacks for Visual Object Tracking

Visual object tracking is an important task that requires the tracker to...

Siamese Learning Visual Tracking: A Survey

The aim of this survey is an attempt to review the kind of machine learn...

Visual Object Tracking with Discriminative Filters and Siamese Networks: A Survey and Outlook

Accurate and robust visual object tracking is one of the most challengin...

STA: Adversarial Attacks on Siamese Trackers

Recently, the majority of visual trackers adopt Convolutional Neural Net...

Scale Equivariance Improves Siamese Tracking

Siamese trackers turn tracking into similarity estimation between a temp...

Deeper and Wider Siamese Networks for Real-Time Visual Tracking

Siamese networks have drawn great attention in visual tracking because o...

RSINet: Rotation-Scale Invariant Network for Online Visual Tracking

Most Siamese network-based trackers perform the tracking process without...

I Introduction

Given the first frame as the prior, visual object tracking aims at predicting a compact bounding box of the object among the whole sequence. Due to the price friendly characteristic of visible sensor, RGB tracking occupies the dominating status in the tracking community. However, current researches have noticed its shortcomings in extreme environment conditions, e.g., fog and night. Basically, RGB data is imaged with the electromagnetic waves reflected by the object. Thus, it is persuasive that all the factors interfering the reflection have direct influence on the imaging procedure. On the contrary, thermal data is imaged by the emitted thermal radiations from the objects with temperature above absolute zero. Therefore, it is considered being complementary to the RGB modality. Recently, thanks to the development of RGB and TIR all-in-one equipments, the multi-modal clues can be constructed simultaneously and applied to the downstream tasks, e.g., tracking [zhu2020complementary, ZhuTCSVT, LADCF, GFSDCF], image fusion [li2020nestfuse, luo2017image, li2017multi, luo2016novel] and segmentation [Li2015CVPR-seg, Li2016TIP-seg]. In this paper, the main concentration is put on tracking with multi-modal, i.e., RGBT tracking. Fig. 1 shows some pairs of RGBT data.

The rest of this paper is arranged as follows. We briefly introduce the existing deep RGBT trackers in Section II. And then their quantitative performances are compared on several benchmarks in Section III. Then the conclusion is given in the final Section V.

Fig. 1: An illustration of paired RGBT data. Here the sequences i.e.soccer2, walking40, womanred, tree5, balancebike, biketwo are selected from RGBT210 [RGBT210]. The region in gray indicates the TIR images while the other represents images in RGB space.

Ii RGBT Trackers

In this section, we will introduce the mainstream frameworks employed in the RGBT tracking community.

Fig. 2:

Pipeline of MDNet-based RGBT trackers. The processes of feature extraction and multi-modal fusion are completed in the first blue block. And the block rendered in yellow is the last fully-connected (Fc) layer, which is online initialized before the tracking procedure.

Fig. 3: Pipeline of the published Siamese-based RGBT trackers. For fusion at feature level, the multi-modal fusion locates between the feature extraction and similarity evaluation. For pixel-level and decision-level fusion, the network architecture always degrades to a single-modality configuration and, therefore, it is not included in this figure.

Ii-a MDNEt-based RGBT Trackers

MDNet [MDNet], which is the champion of VOT-2015 challenge [VOT2015], treats each sequence as a single domain and uses the online updating technique to acquire the object-specific clues, obtaining robust tracking performance. Thus, lots of RGBT trackers based on MDNet, which are presented in TABLE I, is proposed. And Fig 2 shows the basic extension of MDNet in RGBT tracking. Generally, the feature extractor and fusion module are trained offline while the last fully-connected layer is trained online. For example, at the beginning of the tracking procedure of ’Sequence 1’, the first frame is utilized to train the parameters of the last fully-connected layer, giving more object-specific consideration.

Since the tracking process is generally similar, the main discrepancy of different algorithms lies in the feature extractor and fusion module, especially the latter. Thus, we mainly focus on the fusion modules of the methods described below. In FANet [FANet], the feature channels from both RGB and TIR modalities are firstly concatenated for inter-modality interactions and then separated for fusion weights calculating, which is normalized by a Softmax operator applied in a cross-modal way. DAPNet [DAPNet]

achieves the fusion task by recurrently using a sub-network, which consists of a convolutional layer, a Rectified Linear Unit (ReLU) activation

[ReLU] and a normalization layer, at different feature levels. Compared with the coarse fusion sub-network used in DAPNet, DAFNet [DAFNet] further designs an adaptive fusion module which is similar to that in FANet. Carefully programmed, MANet [MANet] expects to extract the modality-specific, modality-shared and object-specific clues through modality, generator and instance adapters respectively. [TODAT] investigates the potential of attention mechanism in RGBT tracking, i.e., local attention for offline training and global attention for online testing. Similarly, CMPP [CMPP] employs the attention mechanism to finish the inter-modality correlation which is further extended to the temporal dimensional due to the significance of temporal information in video analysis. A duality-gated mutual conditional module aims to achieve the multi-modal fusion task in a mutual-guided way in DMCNet [DMCNet]. Based on MANet, the features formerly enhanced within each modality before the fusion stage in MANet++ [MANet++]. A quality-aware fusion block is constructed to evaluate the modality significance in CBPNet [CBPNet]. Aiming at strengthen the stronger modality and suppress the weaker modality, a lightweight attention-based fusion module is applicated in ML [M5L]. All the above trackers mentioned in this sub-section maintain a fused feature representation after the cross-modal interactions. However, to keep the modality-specific characteristics being discriminative, the features from RGB and TIR modalities are also retained in some methods. Specifically, MacNet [MaCNet] learns the fusion weights through the independent modal-aware attention network and competitive learning [competitive-learning], for which the features of RGB and TIR modalities are reserved, leverages the capacity of results from single modality and the modality-fused branch. TFNet [TFNet] deploys a trident branch architecture and each branch is specific for the RGB, TIR and fused features. Different from the most existing RGBT trackers, CAT [CAT], ADRNet [ADRNet] and APFNet [APFNet] make their network construction more concrete for both modality-specific (e.g., illumination variation in RGB and thermal crossover in TIR) and modality-shared (e.g., scale variation) challenges. In CAT [CAT], all the challenge-specific features are adaptively aggregated and then complemented to the basic learning procedure of both modalities. ADRNet [ADRNet]

designs an attribute-driven residual block to measure the appearance model under different circumstance and then all the information is ensembled before the residual connection to the basic learned feature representation. Similarly, in APFNet

[APFNet], the aggregated multi-challenge features are combined with the features solely learned from RGB or TIR modality through transformer encoder and decoder blocks.

Ii-B Siamese-based RGBT Trackers

In visual object tracking, Siamese network is carried forward by the pioneering work SiamFC [SiamFC]

due to its efficiency brought by the end-to-end training scheme. Totally, it aims to learn a general similarity evaluation metric and its pipeline is shown in Fig

3. As the figure shows, the existing Siamese-based trackers employ unequal feature extractors for RGB and TIR modalities. Then, in the published Siamese-based RGBT trackers, the multi-modal fusion module is followed to achieve the feature aggregation. After that, the same strategy for similarity evaluation is applied for both classification and regression before their corresponding heads.

Since the core of multi-modal task lies in the combination of multi-modal clues, the fusion mechanism is mainly described for the following methods. AT the beginning, the thermal data is employed by replacing one of the channels of RGB data in [SiamRGT]. Treating the SiamFC [SiamFC] as the baseline, SiamFT [SiamFT] uses simple concatenation for the features of template inputs while these of search inputs are fused by the learned modality reliabilities. Based on SiamFT, the dynamic online-learned transformation strategy as well as the multi-level semantic features are further employed in DSiamMFT [DSiamMFT]. Similarly to [SiamRGT], fusion at the input stage, DuSiamRT [DuSiamRT] utilizes modality-wise channel attention mechanism to fuse the features of template inputs while keeps unchanged for the features of search inputs. SiamCDA [SiamCDA], whose baseline tracker is an advanced anchor-based tracekr, i.e., SiamRPN++ [SiamRPN++], introduces the information fromm one modality to the other modality through the generated weights. Furthermore, to cope with the situation at that time that there exists insufficient annotated RGBT data for large-scale network training, the LSS Dataset 111https://github.com/RaymondCover/LSS-Dataset is synthesized in a statistical way which contributes to its superior performance.

Ii-C Other Deep RGBT Trackers

Except the MDNet-based and Siamese-based framework mentioned before, some RGBT trackers are built based on other frameworks. In [fusionnet], the multi-modal information is straightforwardly combined by addition. mfDiMP [mfDimp] is the tracker based on DiMP [Dimp] which is a superior RGB tracker that many researchers follow. Specifically, DiMP is extended to the TIR modality and a TIR dataset is generated from GOT10k [GOT-10K] and employed for the learning of neural network. JMMAC [JMMAC], who is the champion of the published datasets of the VOT-RGBT2019 [VOT2019] and VOT-RGBT2020 [VOT2020], learns the fusion weights through two sub-networks for local and global attention respectively.

Trackers Baseline Year Published Reference
- Others 2018 NeuroComputing [fusionnet]
mfDiMP Others 2019 ICCVW [mfDimp]
MANet MDNet-based 2019 ICCVW [MANet]
DAPNet MDNet-based 2019 ACM MM [DAPNet]
DAFNet MDNet-based 2019 ICCVW [DAFNet]
- MDNet-based 2019 ICIP [TODAT]
- Siamese-based 2019 FUSION [SiamRGT]
SiamFT Siamese-based 2019 IEEE Access [SiamFT]
MaCNet MDNet-based 2020 Sensors [MaCNet]
DMCNet MDNet-based 2020 arXiv [DMCNet]
CMPP MDNet-based 2020 CVPR [CMPP]
CAT MDNet-based 2020 ECCV [CAT]
DSiamMFT Siamese-based 2020 Signal Processing: Image Communication [DSiamMFT]
MANet++ MDNet-based 2021 IEEE TIP [MANet++]
CBPNet MDNet-based 2021 IEEE TMM [CBPNet]
TFNet MDNet-based 2021 IEEE TCSVT [TFNet]
FANet MDNet-based 2021(2018) IEEE TIV(arXiv) [FANet]
ADRNet MDNet-based 2021 IJCV [ADRNet]
ML MDNet-based 2021 IEEE TIP [M5L]
SiamCDA Siamese-based 2021 IEEE TCSVT [SiamCDA]
DuSiamRT Siamese-based 2021 The Visual Computer [DuSiamRT]
APFNet MDNet-based 2022 AAAI [APFNet]
TABLE I: A collection of the existing deep RGBT trackers.

Iii Datasets and Results

In this section, we will firstly give a introduction to the existing RGBT dataset, i.e., GTOT [GTOT], RGBT210 [RGBT210], RGBT234 [RGBT234], VOT-RGBT2019 [VOT2019], VOT-RGBT2020 [VOT2020] and LasHeR [LasHeR]. After that, a comparison of the existing deep RGBT trackers on multi-benchmarks will be listed and analysed.

Benchmark Year Num of Sequences Aligned Category Num of attributes Reference
GTOT 2016 50 N 9 7 [GTOT]
RGBT210 2017 210 Y 22 12 [RGBT210]
RGBT234 2019 234 N 22 12 [RGBT234]
VOT-RGBT2019 2019 60 - 13 12 [VOT2019]
VOT-RGBT2020 2019 60 - 13 12 [VOT2020]
LasHeR 2021 245 Y 32 19 [LasHeR]
TABLE II: Illustration for RGBT datasets (Only for Testing).

Iii-a Datasets and Evaluation Metrics

TABLE II shows the detail information about these six benchmarks. Here ’Num of Sequences’ represents the number of paired RGBT videos. ’Aligned’ means whether the RGB and TIR images are aligned or not. Specifically, if there exists one groundtruth file (’Y’), the RGB and TIR modalities are thought aligned. It should be noticed that the VOT-RGBT2019 dataset is a sub-set of RGBT234 and the difference between VOT-RGBT2019 and VOT-RGBT2020 benchmarks locate in the testing protocol. Therefore, they are the same in statistic analysis [VOT2020].

The same evaluation metrics are employed in GTOT [GTOT], RGBT210 [RGBT210], RGBT234 [RGBT234] and LasHeR [LasHeR] datasets, i.e., Precision and Success. Precision rate measures the distance between the groundtruth bounding box and the predicted one. Success rate represents the ratio of tracking failures whose Intersection over Union (IoU) between its corresponding label below a given threshold.

Accuracy, Robustness and (Excepted Average Overlap) EAO are the measurements utilized in VOT-RGBT2019 [VOT2019] and VOT-RGBT2020 [VOT2020]. The overlap between the prediction and the groundtruth is represented by accuracy. Robustness is designed to measure the ratio of tracking failures over the total numbers of image frames. EAO is considered the most important and comprehensively indicates the superiority of the tracker.

GTOT [GTOT] RGBT210 [RGBT210] RGBT234 [RGBT234] LasHeR [LasHeR]
Trackers Precision Success Precision Success Precision Success Precision Success
[fusionnet] 0.852 0.626 - - - - - -
mfDiMP [mfDimp] - 0.786 0.555 0.785 0.559 0.447 0.344
MANet [MANet] 0.894 0.724 - - 0.777 0.539 0.457 0.33
DAPNet [DAPNet] 0.882 0.707 - - 0.766 0.537 0.431 0.314
DAFNet [DAFNet] 0.891 0.712 - - 0.796 0.544 0.449 0.311
[TODAT] 0.843 0.677 - - 0.787 0.545 - -
[SiamRGT] - - - - 0.610 0.428 - -
SiamFT [SiamFT] 0.826 0.700 - - 0.688 0.486 - -
MaCNet [MaCNet] 0.880 0.714 - - 0.790 0.554 0.483 0.352
DMCNet [DMCNet] 0.909 0.733 0.797 0.555 0.839 0.593 0.491 0.357
CMPP [CMPP] 0.926 0.738 - - 0.823 0.575 - -
CAT [CAT] 0.889 0.717 0.792 0.533 0.804 0.561 0.451 0.317
DSiamMFT [DSiamMFT] - - 0.642 0.432 - - - -
JMMAC [JMMAC] 0.902 0.732 - - 0.790 0.573 - -
MANet++ [MANet++] 0.901 0.723 - - 0.800 0.554 0.467 0.317
CBPNet [CBPNet] 0.885 0.716 - - 0.794 0.541 - -
TFNet [TFNet] 0.886 0.729 0.777 0.529 0.806 0.560 - -
FANet [FANet] 0.891 0.728 - - 0.787 0.553 0.442 0.309
ADRNet [ADRNet] 0.904 0.739 - - 0.809 0.571 - -
ML [M5L] 0.896 0.710 - - 0.795 0.542 - -
SiamCDA [SiamCDA] 0.877 0.732 - - 0.760 0.569 - -
DuSiamRT [DuSiamRT] 0.766 0.628 - - 0.567 0.384 - -
APFNet [APFNet] 0.905 0.739 - - 0.827 0.579 0.500 0.362
TABLE III: Quantitative results of the existing deep RGBT trackers on GTOT, RGBT210, RGBT234 and LasHeR datasets.
VOT-RGBT2019 [VOT2019] VOT-RGBT2020 [VOT2020]
Trackers Accuracy Robustness EAO Accuracy Robustness EAO
mfDiMP [mfDimp] 0.6019 0.8036 0.3879 0.6380 0.7930 0.3800
MANet [MANet] 0.5823 0.7010 0.3463 - - -
SiamFT [SiamFT] 0.6300 0.6390 0.3100 - - -
MaCNet [MaCNet] 0.5451 0.5914 0.3052 - - -
JMMAC [JMMAC] 0.6649 0.8211 0.4826 0.6620 0.8180 0.4200
MANet++ [MANet++] 0.5092 0.5379 0.2716 - - -
TFNet [TFNet] 0.4617 0.5936 0.2878 - - -
FANet [FANet] 0.4724 0.5078 0.2465 - - -
ADRNet [ADRNet] 0.6218 0.7657 0.3959 - - -
SiamCDA [SiamCDA] 0.6820 0.7570 0.4240 - - -
TABLE IV: Quantitative results on VOT-RGBT2019 and VOT-RGBT2020 datasets.

Iii-B Results

TABLE III shows the results on GTOT [GTOT], RGBT210 [RGBT210], RGBT234 [RGBT234] and LasHeR [LasHeR] while the results on the VOT banechmarks are exhibited in TABLE IV. On GTOT, the highest Success rate is obtained by ADRNet [ADRNet] and AFPNet [APFNet] (0.739) while the best Precision rate reaches 0.926 by CMPP [CMPP]. On RGBT210, DMCNet [DMCNet] achieves the best Precision (0.797) and Success (0.555) rates at the same time. Consistently, DMCNet also ranks the first on Precision (0.839) and Success (0.593) rates on RGBT234 dataset. APFNet [APFNet] gets the best Precision and Success scores on LasHeR dataset. For the VOT benchamrks, as mentioned before, JMMAC [JMMAC] ranks the first on the public dataset twice. However, the VOT community provides one public dataset combined with a sequestered one, and the real champion is decided on the private dataset. The champion of VOT-RGBT2019 [VOT2019] is mfDiMP [mfDimp] and DFAT wins the VOT-RGBT2020 challenge [VOT2020].

Iv Discussion

From the above investigations, we have several discussions as follows: (1) Following the whole computer vision field, the potential of Transformer model

[transformer] is not explored yet. (2) Different the image-based tasks [li2011no, chen2018new, zheng2006nearest, li2020nestfuse, feng2017face, wu2004new], tempoal information , which has not been widely studied in RGBT tracking yet, is of great inportance in video-based tasks, e.g., visual object tracking. (3) During the investigation, we find that less mathematical theories considered during the network construction process. (4) Behind the fusion step, what is actually going on has not been discussed yet. In the future, we will mainly focus on more concrete RGBT tracking based on our discussions.

V Conclusion

In this paper, a statistic analysis of the existing deep RGBT trackers. Specifically, all the trackers are divided into three categories, i.e., MDNet-based, Siamese-based and Others. Furthermore, their quantitative results on GTOT, RGBT210, RGBT234, LasHeR, VOT-RGBT2019 and VOT-RGBT2020 benchmarks are compared intuitively by gathering them together. Therefore, this work can act like a reference for researchers who are interested in RGBT tracking.