Given the first frame as the prior, visual object tracking aims at predicting a compact bounding box of the object among the whole sequence. Due to the price friendly characteristic of visible sensor, RGB tracking occupies the dominating status in the tracking community. However, current researches have noticed its shortcomings in extreme environment conditions, e.g., fog and night. Basically, RGB data is imaged with the electromagnetic waves reflected by the object. Thus, it is persuasive that all the factors interfering the reflection have direct influence on the imaging procedure. On the contrary, thermal data is imaged by the emitted thermal radiations from the objects with temperature above absolute zero. Therefore, it is considered being complementary to the RGB modality. Recently, thanks to the development of RGB and TIR all-in-one equipments, the multi-modal clues can be constructed simultaneously and applied to the downstream tasks, e.g., tracking [zhu2020complementary, ZhuTCSVT, LADCF, GFSDCF], image fusion [li2020nestfuse, luo2017image, li2017multi, luo2016novel] and segmentation [Li2015CVPR-seg, Li2016TIP-seg]. In this paper, the main concentration is put on tracking with multi-modal, i.e., RGBT tracking. Fig. 1 shows some pairs of RGBT data.
Ii RGBT Trackers
In this section, we will introduce the mainstream frameworks employed in the RGBT tracking community.
Ii-a MDNEt-based RGBT Trackers
MDNet [MDNet], which is the champion of VOT-2015 challenge [VOT2015], treats each sequence as a single domain and uses the online updating technique to acquire the object-specific clues, obtaining robust tracking performance. Thus, lots of RGBT trackers based on MDNet, which are presented in TABLE I, is proposed. And Fig 2 shows the basic extension of MDNet in RGBT tracking. Generally, the feature extractor and fusion module are trained offline while the last fully-connected layer is trained online. For example, at the beginning of the tracking procedure of ’Sequence 1’, the first frame is utilized to train the parameters of the last fully-connected layer, giving more object-specific consideration.
Since the tracking process is generally similar, the main discrepancy of different algorithms lies in the feature extractor and fusion module, especially the latter. Thus, we mainly focus on the fusion modules of the methods described below. In FANet [FANet], the feature channels from both RGB and TIR modalities are firstly concatenated for inter-modality interactions and then separated for fusion weights calculating, which is normalized by a Softmax operator applied in a cross-modal way. DAPNet [DAPNet]ReLU] and a normalization layer, at different feature levels. Compared with the coarse fusion sub-network used in DAPNet, DAFNet [DAFNet] further designs an adaptive fusion module which is similar to that in FANet. Carefully programmed, MANet [MANet] expects to extract the modality-specific, modality-shared and object-specific clues through modality, generator and instance adapters respectively. [TODAT] investigates the potential of attention mechanism in RGBT tracking, i.e., local attention for offline training and global attention for online testing. Similarly, CMPP [CMPP] employs the attention mechanism to finish the inter-modality correlation which is further extended to the temporal dimensional due to the significance of temporal information in video analysis. A duality-gated mutual conditional module aims to achieve the multi-modal fusion task in a mutual-guided way in DMCNet [DMCNet]. Based on MANet, the features formerly enhanced within each modality before the fusion stage in MANet++ [MANet++]. A quality-aware fusion block is constructed to evaluate the modality significance in CBPNet [CBPNet]. Aiming at strengthen the stronger modality and suppress the weaker modality, a lightweight attention-based fusion module is applicated in ML [M5L]. All the above trackers mentioned in this sub-section maintain a fused feature representation after the cross-modal interactions. However, to keep the modality-specific characteristics being discriminative, the features from RGB and TIR modalities are also retained in some methods. Specifically, MacNet [MaCNet] learns the fusion weights through the independent modal-aware attention network and competitive learning [competitive-learning], for which the features of RGB and TIR modalities are reserved, leverages the capacity of results from single modality and the modality-fused branch. TFNet [TFNet] deploys a trident branch architecture and each branch is specific for the RGB, TIR and fused features. Different from the most existing RGBT trackers, CAT [CAT], ADRNet [ADRNet] and APFNet [APFNet] make their network construction more concrete for both modality-specific (e.g., illumination variation in RGB and thermal crossover in TIR) and modality-shared (e.g., scale variation) challenges. In CAT [CAT], all the challenge-specific features are adaptively aggregated and then complemented to the basic learning procedure of both modalities. ADRNet [ADRNet]
designs an attribute-driven residual block to measure the appearance model under different circumstance and then all the information is ensembled before the residual connection to the basic learned feature representation. Similarly, in APFNet[APFNet], the aggregated multi-challenge features are combined with the features solely learned from RGB or TIR modality through transformer encoder and decoder blocks.
Ii-B Siamese-based RGBT Trackers
In visual object tracking, Siamese network is carried forward by the pioneering work SiamFC [SiamFC]
due to its efficiency brought by the end-to-end training scheme. Totally, it aims to learn a general similarity evaluation metric and its pipeline is shown in Fig3. As the figure shows, the existing Siamese-based trackers employ unequal feature extractors for RGB and TIR modalities. Then, in the published Siamese-based RGBT trackers, the multi-modal fusion module is followed to achieve the feature aggregation. After that, the same strategy for similarity evaluation is applied for both classification and regression before their corresponding heads.
Since the core of multi-modal task lies in the combination of multi-modal clues, the fusion mechanism is mainly described for the following methods. AT the beginning, the thermal data is employed by replacing one of the channels of RGB data in [SiamRGT]. Treating the SiamFC [SiamFC] as the baseline, SiamFT [SiamFT] uses simple concatenation for the features of template inputs while these of search inputs are fused by the learned modality reliabilities. Based on SiamFT, the dynamic online-learned transformation strategy as well as the multi-level semantic features are further employed in DSiamMFT [DSiamMFT]. Similarly to [SiamRGT], fusion at the input stage, DuSiamRT [DuSiamRT] utilizes modality-wise channel attention mechanism to fuse the features of template inputs while keeps unchanged for the features of search inputs. SiamCDA [SiamCDA], whose baseline tracker is an advanced anchor-based tracekr, i.e., SiamRPN++ [SiamRPN++], introduces the information fromm one modality to the other modality through the generated weights. Furthermore, to cope with the situation at that time that there exists insufficient annotated RGBT data for large-scale network training, the LSS Dataset 111https://github.com/RaymondCover/LSS-Dataset is synthesized in a statistical way which contributes to its superior performance.
Ii-C Other Deep RGBT Trackers
Except the MDNet-based and Siamese-based framework mentioned before, some RGBT trackers are built based on other frameworks. In [fusionnet], the multi-modal information is straightforwardly combined by addition. mfDiMP [mfDimp] is the tracker based on DiMP [Dimp] which is a superior RGB tracker that many researchers follow. Specifically, DiMP is extended to the TIR modality and a TIR dataset is generated from GOT10k [GOT-10K] and employed for the learning of neural network. JMMAC [JMMAC], who is the champion of the published datasets of the VOT-RGBT2019 [VOT2019] and VOT-RGBT2020 [VOT2020], learns the fusion weights through two sub-networks for local and global attention respectively.
|DSiamMFT||Siamese-based||2020||Signal Processing: Image Communication||[DSiamMFT]|
|DuSiamRT||Siamese-based||2021||The Visual Computer||[DuSiamRT]|
Iii Datasets and Results
In this section, we will firstly give a introduction to the existing RGBT dataset, i.e., GTOT [GTOT], RGBT210 [RGBT210], RGBT234 [RGBT234], VOT-RGBT2019 [VOT2019], VOT-RGBT2020 [VOT2020] and LasHeR [LasHeR]. After that, a comparison of the existing deep RGBT trackers on multi-benchmarks will be listed and analysed.
|Benchmark||Year||Num of Sequences||Aligned||Category||Num of attributes||Reference|
Iii-a Datasets and Evaluation Metrics
TABLE II shows the detail information about these six benchmarks. Here ’Num of Sequences’ represents the number of paired RGBT videos. ’Aligned’ means whether the RGB and TIR images are aligned or not. Specifically, if there exists one groundtruth file (’Y’), the RGB and TIR modalities are thought aligned. It should be noticed that the VOT-RGBT2019 dataset is a sub-set of RGBT234 and the difference between VOT-RGBT2019 and VOT-RGBT2020 benchmarks locate in the testing protocol. Therefore, they are the same in statistic analysis [VOT2020].
The same evaluation metrics are employed in GTOT [GTOT], RGBT210 [RGBT210], RGBT234 [RGBT234] and LasHeR [LasHeR] datasets, i.e., Precision and Success. Precision rate measures the distance between the groundtruth bounding box and the predicted one. Success rate represents the ratio of tracking failures whose Intersection over Union (IoU) between its corresponding label below a given threshold.
Accuracy, Robustness and (Excepted Average Overlap) EAO are the measurements utilized in VOT-RGBT2019 [VOT2019] and VOT-RGBT2020 [VOT2020]. The overlap between the prediction and the groundtruth is represented by accuracy. Robustness is designed to measure the ratio of tracking failures over the total numbers of image frames. EAO is considered the most important and comprehensively indicates the superiority of the tracker.
|GTOT [GTOT]||RGBT210 [RGBT210]||RGBT234 [RGBT234]||LasHeR [LasHeR]|
|VOT-RGBT2019 [VOT2019]||VOT-RGBT2020 [VOT2020]|
TABLE III shows the results on GTOT [GTOT], RGBT210 [RGBT210], RGBT234 [RGBT234] and LasHeR [LasHeR] while the results on the VOT banechmarks are exhibited in TABLE IV. On GTOT, the highest Success rate is obtained by ADRNet [ADRNet] and AFPNet [APFNet] (0.739) while the best Precision rate reaches 0.926 by CMPP [CMPP]. On RGBT210, DMCNet [DMCNet] achieves the best Precision (0.797) and Success (0.555) rates at the same time. Consistently, DMCNet also ranks the first on Precision (0.839) and Success (0.593) rates on RGBT234 dataset. APFNet [APFNet] gets the best Precision and Success scores on LasHeR dataset. For the VOT benchamrks, as mentioned before, JMMAC [JMMAC] ranks the first on the public dataset twice. However, the VOT community provides one public dataset combined with a sequestered one, and the real champion is decided on the private dataset. The champion of VOT-RGBT2019 [VOT2019] is mfDiMP [mfDimp] and DFAT wins the VOT-RGBT2020 challenge [VOT2020].
From the above investigations, we have several discussions as follows: (1) Following the whole computer vision field, the potential of Transformer model[transformer] is not explored yet. (2) Different the image-based tasks [li2011no, chen2018new, zheng2006nearest, li2020nestfuse, feng2017face, wu2004new], tempoal information , which has not been widely studied in RGBT tracking yet, is of great inportance in video-based tasks, e.g., visual object tracking. (3) During the investigation, we find that less mathematical theories considered during the network construction process. (4) Behind the fusion step, what is actually going on has not been discussed yet. In the future, we will mainly focus on more concrete RGBT tracking based on our discussions.
In this paper, a statistic analysis of the existing deep RGBT trackers. Specifically, all the trackers are divided into three categories, i.e., MDNet-based, Siamese-based and Others. Furthermore, their quantitative results on GTOT, RGBT210, RGBT234, LasHeR, VOT-RGBT2019 and VOT-RGBT2020 benchmarks are compared intuitively by gathering them together. Therefore, this work can act like a reference for researchers who are interested in RGBT tracking.