Hierarchical Siamese Network for Thermal Infrared Object Tracking

by   Qiao Liu, et al.
Harbin Institute of Technology

Most thermal infrared (TIR) tracking methods are discriminative, which treat the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is well coupled to the objective of tracking task. We propose a TIR tracker via a hierarchical Siamese convolutional neural network (CNN), named HSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN coalescing the multiple hierarchical convolutional layers. Then, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the most similar one as the tracked target. Extensive experimental results on the benchmarks: VOT-TIR 2015 and VOT-TIR 2016, show that our proposed method achieves favorable performance against the state-of-the-art methods.


Multiple Convolutional Features in Siamese Networks for Object Tracking

Siamese trackers demonstrated high performance in object tracking due to...

Learning Deep Multi-Level Similarity for Thermal Infrared Object Tracking

Existing deep Thermal InfraRed (TIR) trackers only use semantic features...

Visual Object Tracking based on Adaptive Siamese and Motion Estimation Network

Recently, convolutional neural network (CNN) has attracted much attentio...

Learning by tracking: Siamese CNN for robust target association

This paper introduces a novel approach to the task of data association w...

Updatable Siamese Tracker with Two-stage One-shot Learning

Offline Siamese networks have achieved very promising tracking performan...

Similarity Mapping with Enhanced Siamese Network for Multi-Object Tracking

Multi-object tracking has recently become an important area of computer ...

I Introduction

In recent years, both price and size of thermal camera have decreased while resolution and quality of thermal image have improved, which has opened up new application areas such as surveillance, rescue, and driver assistance at night [1, 2]. TIR object tracking is often used as a subroutine that plays an important role in these vision tasks. It has several superiorities over the visual object tracking [3, 4]. For example, TIR tracking is not sensitive to variation of the illumination whereas visual tracking usually fails in poor visibility. In addition, it can protect privacy in some real-world scenarios such as surveillance of private place.

Despite its many advantages, TIR object tracking faces a lot of challenges. First, the TIR objects have several adverse properties, such as the absence of visual color patterns, low resolution, and blurry contour. The TIR objects also often lie in a complicated background that has dead pixels, blooming, and distractors [5]. These adverse properties hinder the feature extractor to extract discriminative features, which severely degrade the quality of the tracking model. Second, several other challenges are faced by TIR tracking such as deformation, occlusion, and scale variation. In order to handle these challenges, several TIR trackers have been proposed over the past years. For instance, Li et al. [6] proposed a TIR tracker based on sparse theory and compressive Harr-like features, which can handle the occlusion problem to some extent. Gundogdu et al. [7] used multiple correlation filters (CFs) with the histogram of oriented gradient features to construct an ensemble TIR tracker, which can adapt the appearance changed of the object due to the proposed switching mechanism. To alleviate the fact that a single feature is not robust to the various challenges, in [8], the authors presented a sparse representation-based TIR tracking method using fusion of multiple features. However, these methods do not solve the challenges of TIR tracking well because it is difficult to get the discriminative information of the TIR objects using these hand-crafted features.

Considering the powerful representation ability of the CNN, some works [9, 10] introduce CNN features into the TIR tracking. Unfortunately, these trackers do not make big progress for several reasons. First, they are not robust to various challenges since they only use the feature of a single CNN layer. Second, the network is only trained on limited TIR images, insufficient to obtain a robust feature. Third, the used CNN feature is obtained from a classification network, which is not optimal for tracking because the objectives of these two tasks are not explicitly coupled with the network learning.

Most recently, by casting the tracking problem as a similarity verification problem, a visual tracker, Siamese-fc [11]

, has been proposed. It simply uses a pre-trained Siamese network as a similarity function to verify whether the target candidate is the tracked target in the tracking process. Compared to the classification network, the pre-trained Siamese network is more coupled to the tracking task. Thus, the feature extracted from the pre-trained Siamese network has more discriminative ability. In this paper, we use the Siamese network to carry out TIR tracking. However, there are several problems that must be solved. First, this Siamese network uses the feature of last convolutional layer, which is not robust for TIR tracking due to the adverse properties of the TIR objects. Second, there lacks sufficient public TIR images dataset to train the network.

To address these problems, we propose a TIR tracker using a hierarchical Siamese CNN. Specifically, to obtain richer spatial and semantic features for TIR tracking, we first design a Siamese CNN that coalesces the deep and shallow hierarchical convolutional layers. Since the TIR tracking needs not only the deep level semantic features to distinguish the different objects but also the shallow level spatial information to precisely locate the target object. Additionally, we suggest the deep features learned from visible images can also represent the TIR objects. Therefore, to handle the lack of the training data, we train the proposed Siamese network on a large visible video dataset to learn the similarity between paired objects before we transfer the learned network into the TIR domain. Then, the learned Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the most similar one as the tracked target without any adapting in the tracking process. The experimental results show that our method achieves satisfactory performance.

The rest of the paper is organized as follows. Section II introduces most related works briefly. Section III describes the main part of the proposed approach. Section IV carries out the experiment and shows results while Section V draws a short conclusion.

Ii Related Work

In this section, we discuss two classes most related works: classification-based trackers and verification-based trackers.

Classification-based trackers. These trackers have received much attention in TIR tracking. To deal with various challenges, a variety of the classification-based TIR trackers are presented based on sparse representation [6, 8], multiple instances learning [12]

, kernel density estimation 

[13], low-rank sparse learning [14], correlation filter [7, 15, 16]

, and deep learning 

[9, 10]

. However, the aim of the classifier is to predict the class label of a sample that usually used in pattern recognition 

[17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], while the aim of the tracker is to estimate object position accurately. None of these classification-based TIR trackers can take into account the within-class difference, while the tracking task mainly cares about it. Unlike these classification-based trackers, here, we treat the tracking task as a similarity verification problem, which is more coupled to the tracking task.

Verification-based trackers. Most recently, verification-based trackers have been presented in object tracking with competitive results. This kind of method is often based on a Siamese architecture, which consists of two identical or asymmetric sub-networks joined at their outputs. For instance, YCNN [28] learns discriminating features of growing complexity while simultaneously learning similarity between template and search region with corresponding prediction maps using a shallow Siamese network. Tao et al. [29] propose a Siamese invariance network to learn a generic matching function for tracking (SINT). The learned matching function is used to match the initial target with candidates and returns the most similar one as the tracked target. Bertinetto et al. [11] present a fully-convolutional Siamese network (Siamese-fc) to learn the similarity function of two arbitrary objects for tracking. However, it often fails when the appearance of the target drastic changes due to without the updating strategy. Subsequently, Valmadre et al. [30] combine the correlation filter with Siamese architecture to construct a deep correlation filter learner (CFNet), which explains the correlation filter as a differentiable layer in a deep neural network. However, most of these Siamese networks are not compatible for the TIR tracking, because most of them use the single convolutional layer feature to represent the object which is not robust to the TIR tracking. To adapt the TIR tracking, in our Siamese network, we design a hierarchical Siamese network which coalesces multiple hierarchical convolutional layers to obtain more robust features.

Fig. 1:

The architecture of the proposed Siamese network (HSNet). The hierarchical network coalesces multiple convolution layers. MP, BN, Concat, Conv, Crop, and CF denote the max pooling layer, batch normalization layer, concat layer, convolution layer, crop layer, and correlation filter layer, respectively. The

green and red pixel in the response map represent the similarity score between the target template and candidate. The pixel with the highest score is regarded as the final tracked target.

Iii Hierarchical Siamese Network

In this section, we first give the overall framework in Section III-A and then present the hierarchical Siamese network architecture in Section III-B. Next, we explain how to train this network to learn the similarity function in Section III-C and finally give the tracking interface in Section III-D.

Iii-a Algorithmic Overview

The proposed method is based on a hierarchical Siamese network that coalesces multiple hierarchical convolutional layers, as shown in Fig. 1. This network consists of two asymmetric branches, which have shared parameters in the hierarchical network and joined by a cross-correlation layer. The output of the Siamese network is a response map, which denotes the similarity between the multiple candidates and the target template.

In the tracking stage, our goal is to use the pre-trained Siamese network to locate the tracked target. It can be simply formulated as the following cross-correlated operator:


where is an embedding function and denotes the learned Siamese network in this paper. The CF block computes a standard CF target template from the feature map

by solving a ridge regression problem in the Fourier domain. As shown in Fig. 

1, we just need to input a target and a search region; then the Siamese network will return a response map, that measures the similarity between the candidates and the target template. Then, we choose the corresponding candidate with the maximal response score value as the final tracked target, and then the coordinates are mapped into the original frame to locate the position of the target.

Iii-B The Network Architecture

The Siamese network is composed by two asymmetric branches, as shown in Fig. 1. For individual branches, inspired by AlexNet [31], we design a deep network architecture, which consists of several types of layers commonly used in CNN. The details of these layers are shown in Table I. In the following, we mainly introduce the distinctive designs of the proposed networks.

Max pooling. Location information of the object is not important for classification task but is required for tracking task. After the max pooling layer, the feature map often loses the location information of the object to some extent. Like AlexNet [31], which has two max pooling layers, our network also has two max pooling layers at the early stage to reserve more location information. On the other hand, the max pooling is robust to the local noises because it introduces the invariance to the local deformation. Therefore, it is important for tracking task since the tracked object changes its appearance over time.



size-in size-out
Conv1 /2
MP /2
Conv2 /1
MP /2
Conv3 /1
Conv4 /1
Conv5 /1
MP3+BN3 /1 Conv3
MP4+BN4 /1 Conv4
BN5 Conv5
Concat MP3+BN,MP4+BN,BN5
Conv6 /1

The structure details of the target branch of the proposed Siamese network (HSNet). Each convolution layer is followed by a batch normalization (BN) and rectified linear unit (ReLU) layer except for Conv5 and Conv6.

Batch normalization. To accelerate the training of the Siamese network, we add a batch normalization layer [32] after each convolutional layer. The effectiveness of batch normalization has been shown in many deep networks. Additionally, in the fusion part of the network, we normalize multiple hierarchical convolutional feature maps using the batch normalization operator and then combine them into one single output cube.

Hierarchical convolutional layers fusion. Unlike the previous Siamese architecture, which just uses the feature from the last layer to represent the object. However, the last layer feature lacks the spatial information, which is not robust for TIR tracking. To obtain more robust features for TIR tracking, our proposed Siamese network coalesces multiple hierarchical convolutional layers. Since we note that the tracking task not only needs the discriminative semantic information of the deep layers to distinguish the different objects but also needs the spatial information of the shallow layers to precisely locate the target position. In order to coalesce these hierarchical convolutional layers, which have a different spatial resolution, we exploit the max pooling to downsample the shallow convolutional layer to the same resolution with deep convolutional layer. Before concatenating these translated feature maps, we adopt the batch normalization layer to normalize these feature maps because we want to balance the influence of these feature maps. For three hierarchical convolutional feature maps which denote the feature of Conv3, Conv4, and Conv5, respectively. All of these three feature maps have different spatial resolutions. To fuse these feature maps, we define two functions and to represent the max pooling and batch normalization operator, respectively. Thus, the fused feature map can be formulated as follows:


where denotes the concat layer, which concatenates the multiple feature maps in the channel direction. After that, we find the dimension of the fused feature map is too high to train the network quickly. Therefore, it is necessary to reduce the dimension of the fused feature map. In our network, we use a convolutional layer to reduce the channel dimension of the fused feature map, as shown for Conv6 in Table I. This convolutional layer not only reduces the dimension of the feature map but also assigns the weights to different hierarchical convolutional layer adaptively. The final fused hierarchical convolutional feature map can be formulated as follows:


where denotes a convolutional operator and has a suitable dimension for training.

Correlation filter. As in CFNet [30], CF is interpreted as a differentiable CNN layer in our Siamese architecture. So, the errors can be propagated through the CF layer back to the CNN features and the overall Siamese network can be trained end to end. In addition, the CF layer can be used to update the target template in the tracking process. The dynamic target template can adapt to the variation in appearance of the target more flexible.

Iii-C Training the Network

Our goal is to learn a general similarity function that can evaluate the similarity degree of a pair objects. The effectiveness of the Siamese architecture has demonstrated to this task. Due to the shortness of the TIR images, we train the proposed network on the visible images dataset before we transfer the learned network into the TIR domain.

Network inputs. As shown in Fig. 1

, the proposed Siamese network needs a target and a search region as the inputs. Because our network architecture is fully-convolutional, it can compare a target with a larger search region in a single evaluation by the cross-correlation. The output measures the similarity between a target and multiple candidates. In order to train the network, we prepare the training pairs (a target and a corresponding search region) from a large visible images video detection dataset from ImageNet 

[33] and the corresponding labels like in [11]. First, we scale the original image frame with a scale factor . Then, we crop a target with the fixed size and a search region that is centered on the target at every frame of the video. Finally, we randomly choose a target and a search region within an interval of frames in the same video as a training pair. Assume that the bounding box size of the target is , and the cropped size is , the scale factor can be formulated as:



is the padding context margin. Given the response map

of the network, we suggest that an element is a positive sample if it is within radius of the center


where denotes the total stride of the network. To all training pairs, the corresponding labels are calculated by Eq. 5.

Loss function. We add a logistic loss layer to train the network at the end of the Siamese network


where denotes the real score of a single target-candidate pair returned from the model. represents the ground-truth label of this pair. For the loss of the response map which measures the similarity of a target and multiple candidates, the mean of the individual losses is exploited


The parameters of the Siamese network can be obtained by minimizing the loss function


For the problem 8

, we can use Stochastic Gradient Descent (SGD) to solve it.

Iii-D Tracking Interface

Once the similarity function has been learned, we simply exploit it as a prior in the tracking without any adapting. We use a simple strategy to verify the target candidates. Given the cropped target at the -th frame and a search region at the -th frame. The tracked target at the -th frame can be calculated by the following formulation:


where is the -th candidate in the search region .

Scale estimation. In order to enhance the accuracy of the tracking, we adopt a simple but effective scale estimation method [34]. The search regions of three different scale are inputted in the network for comparison with the target, and the maximum response map and the corresponding scale are returned.

Iv Experiments

To demonstrate the effectiveness of our approach, we conduct the experiments on two TIR tracking benchmarks. First, we give the implementation details in Section IV-A and describe the evaluation criteria in Section IV-B. Then, we carry out an internal comparison experiment in Section IV-C and an external comparison experiment in Section IV-D, respectively.

Iv-a Implementation Details

Training. We train the proposed Siamese network on the large video detection dataset (ILSVRC2015) from ImageNet [33] by solving Eq. 8 with straightforward SGD using MatConvNet [35]. The training is performed epochs, each epoch is made up by sampled pairs. We use mini-batches with a size of to calculate gradients for each iteration. The learning rate is annealed geometrically at each epoch from to . After finished the training, the -th iteration results is exploited as our model.

Tracking. The experiments are conducted in MATLAB 2015b with a GTX 1080 GPU card. In Algorithm 1, we give the main steps of the proposed tracking algorithm. In order to deal with the scale variation, we use three fixed scales

to search the object. The scale is updated by linear interpolation with a factor of

to provide damping. The average frame rate of the proposed tracker is frames per second (fps).

1:  Inputs: initial target state , the learned similarity function using Eq. 8.
2:  Outputs: the estimated target state .
3:  while  do
4:     Crop three different scale’s search region based on the target state .
5:     Calculate the optimal target state by Eq. 9 and return the best scale factor.
6:  end while
Algorithm 1 The proposed tracker (HSNet)
Fig. 2: A-R ranking plots and A-R raw plots on VOT-TIR 2015. The better performance a tracker obtains, the closer top-right of the plot it displays.

Iv-B Evaluation Criterion

Accuracy (A) and robustness (R) are adopted as the performance measures due to their high interpretability [36]. A can be calculated as the following formulation:


where denotes the area of the predicted bounding box, denotes the area of ground-truth at the -th frame, respectively. is the frame number of the dataset. The robustness counts the number of failures. It is failing when lower than a given threshold. The tracking results are often visualized by A-R raw plot and A-R ranking plot [37].

In addition, to predict the overall performance of the tracker, A and R are integrated into the expected average overlap (EAO) measurement. The EAO curve and EAO score are often used to evaluate the overall performance [38].

Iv-C Experiments on VOT-TIR 2015

In this section, we show the effectiveness of each component of our tracker. An internal comparison experiment on the benchmark VOT-TIR 2015 [39] is conducted.

Datasets. VOT-TIR 2015 has TIR sequences and each sequence has several local attributes, such as dynamics change, occlusion, camera motion, and object motion. The tracker’s performance on these attributes are often compared with others.

Compared trackers. First, to demonstrate our hierarchical convolutional layer fusion method is effective, we compare our tracker (HSNet) with HSNet-lastlayer, which just uses the last single convolutional layer. Then, to demonstrate the scale estimation strategy is also effective, we compare HSNet with HSNet-noscale, which is HSNet without the scale estimation strategy.

Evaluation and results. Two experiments, as mentioned above, are conducted, the results are shown in Fig. 2. It is easy to see that our tracker HSNet achieves better performance than HSNet-lastlayer and HSNet-noscale in terms of accuracy and robustness. Specifically, HSNet improves the accuracy by about five percent compared to HSNet-lastlayer, which shows that the proposed hierarchical convolutional layer fusion CNN can obtain more robust features of the TIR object than the same CNN using the last layer. Additionally, the scale estimation strategy also enhances the accuracy by about two percent while the robustness exhibits no change, as shown by the A-R plot for experiment baseline (weighted_mean) of Fig. 2.

Iv-D Experiments on VOT-TIR 2016

In this section, we demonstrate our tracker achieves the favorable results against most state-of-the-art methods on the benchmark VOT-TIR 2016 [40]. Additionally, we show some representative tracking results on several challenging sequences.

Datasets. The benchmark VOT-TIR 2016 is an enhanced version of VOT-TIR 2015. Several more challenging sequences have been added.

Fig. 3: EAO curve and EAO score on VOT-TIR 2016. The right-most tracker gets the best performance according to the EAO values.

Compared trackers. We choose fourteen trackers to compare with our tracker on VOT-TIR 2016. These trackers can be divided into four categories: CNN-based trackers, CF-based trackers, part-based trackers, and fusion-based trackers. For the CNN-based trackers, we choose six state-of-the-art methods: Siamese-fc [11], MDNet [41], CFNet [30], deepMKCF [42], HCF [43], and HDT [44]. These trackers achieve promising results on the object tracking benchmark [45]. For the CF-based trackers, three trackers are selected: DSST [46], NSAMF [47], SKCF [48]. These trackers achieve favorable results on VOT 2014 [49]. In particular, DSST shows the best performance. Three part-based trackers are chosen: DPT [50], FCT [40], GGT2 [51] to compare with ours. For the fusion-based trackers, MAD [52] and LOFT-Lite [53] are selected.

Evaluation and results. Firstly, to evaluate the overall performance of a tracker on the concrete tracking application, the EAO curve and the EAO score have been adopted to visualize it and the results are shown in Fig. 3. This confirms that our tracker HSNet achieves the best performance. Specifically, HSNet has the highest EAO score which is about two and four percent higher than the scores of CFNet [30] and DSST [46], respectively.

Secondly, we show the accuracy and robustness of all these trackers on VOT-TIR 2016, the results are shown in Fig. 4. It is obvious that our tracker HSNet achieves the best robustness and the third best accuracy in all plots. Note that, almost all the CNN-based trackers obtain the better performance than the conventional hand-crafted features based trackers, which shows that the deep feature is more suitable for TIR tracking, although it learned from the visible images. Additionally, the CF-based trackers usually perform well in the visual tracking, namely, DSST [46], HCF [43], and so on, while our tracker HSNet outperforms these trackers in the TIR tracking.

Furthermore, to evaluate the performance of handling various challenges of our tracker, these trackers are compared with our HSNet on the corresponding attribute subset of VOT-TIR 2016. The results are shown in Fig. 5. To the size change and motion change challenges, it is easy to see that our method gets the best robustness, as shown in Fig. LABEL:rank2016labelsize and LABEL:rank2016labelmotion. This demonstrates that our tracker (HSNet) can deal with these challenges effectively. We suggest that the promising results are obtained from the hierarchical convolutional layer fusion strategy, which can obtain richer structure and semantic features for tracking. To the occlusion and camera motion challenges, it is obvious that our tracker HSNet achieves the best accuracy and its robustness also exhibits the favorable results, as shown in Fig. LABEL:rank2016labelocclusion and LABEL:rank2016labelcamera. This also illustrates that HSNet obtains satisfactory results.

Fig. 4: A-R ranking plots and the A-R raw plots on VOT-TIR 2016. The better performance a tracker obtains, the closer top-right of the plot it displays.

Finally, to compare the results of the trackers more intuitively, we give several visual tracking results from the methods evaluated in Fig. 6. Overall, it is obvious that our tracker HSNet locates the targets more precisely. We note that in Fig. LABEL:bird (”Birds”), DSST and Siamese-fc fails when the target changes its appearance drastically while our tracker tracks the target precisely. In addition, in the ”Hiding” sequence, as shown in Fig. LABEL:hiding, we can see that HSNet outperforms the other trackers when the target is occluded. These results illustrate that HSNet is robust to variation of the appearance and occlusion challenges. Furthermore, it is easy to see that our tracker has a better scale estimation ability, as shown in Fig. LABEL:boat and LABEL:car.

Fig. 5: A-R ranking plots and the A-R raw plots for the baseline experiments on each attribute subset of VOT-TIR 2016. The better performance a tracker obtains, the closer top-right of the plot it displays.
Fig. 6: The visual tracking results comparison of several state-of-the-art trackers on some representative challenging TIR sequences.

V Conclusion

In this paper, we propose a TIR tracking method (HSNet) using a hierarchical Siamese CNN and achieves promising results. we treat the tracking problem as a similarity verification task, which is more coupled to the tracking task. To adapt the TIR tracking, we design a novel Siamese network that coalesces multiple hierarchical convolutional layers for tracking. The experimental results show that our hierarchical convolutional layers fusion method can obtain more robust features for TIR tracking. To solve the lack of the TIR training data, we train the proposed network on a large visible image video dataset and then transfer the learned network into the TIR domain. The results show that the deep feature learned from the visible images also can well represent the TIR objects.


This research was supported by the National Natural Science Foundation of China (Grant Nos, 61272252, U1509216, 61472099, 61672183), by the Science and Technology Planning Project of Guangdong Province (Grant No. 2016B090918047), by the Natural Science Foundation of Guangdong Province (Grant No. 2015A030313544), and by the Shenzhen Research Council (Grant Nos. JCYJ20170413104556946, JCYJ20160406161948211, JCYJ20160226201453085, JSGG20150331152017052).


  • [1] R. Gade and T. B. Moeslund, “Thermal cameras and applications: a survey,” Machine vision and applications, vol. 25, no. 1, pp. 245–262, 2014.
  • [2] J. A. Sobrino, F. Del Frate, M. Drusch, J. C. Jiménez-Muñoz, P. Manunta, and A. Regan, “Review of thermal infrared applications and requirements for future high-resolution sensors,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 5, pp. 2963–2972, 2016.
  • [3] S. Zhang, X. Lan, Y. Qi, and P. C. Yuen, “Robust visual tracking via basis matching,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 3, pp. 421–430, 2017.
  • [4] S. Zhang, H. Zhou, F. Jiang, and X. Li, “Robust visual tracking using structurally random projection and weighted least squares,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 11, pp. 1749–1760, 2015.
  • [5] A. Berg, J. Ahlberg, and M. Felsberg, “Channel coded distribution field tracking for thermal infrared imagery,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    , 2016, pp. 9–17.
  • [6] Y. Li, P. Li, and Q. Shen, “Real-time infrared target tracking based on minimization and compressive features,” Applied optics, vol. 53, no. 28, pp. 6518–6526, 2014.
  • [7] E. Gundogdu, H. Ozkan, H. S. Demir, H. Ergezer, E. Akagunduz, and S. K. Pakin, “Comparison of infrared and visible imagery for object tracking: Toward trackers with superior ir performance,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, 2015, pp. 1–9.
  • [8] S. J. Gao and S. T. Jhang, “Infrared target tracking using multi-feature joint sparse representation,” in International Conference on Research in Adaptive and Convergent Systems, 2016, pp. 40–45.
  • [9] E. Gundogdu, A. Koc, B. Solmaz, R. I. Hammoud, and A. Aydin Alatan, “Evaluation of feature channels for correlation-filter-based visual object tracking in infrared spectrum,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2016, pp. 24–32.
  • [10] Q. Liu, X. Lu, Z. He, C. Zhang, and W.-S. Chen, “Deep convolutional neural networks for thermal infrared object tracking,” Knowledge-Based Systems, vol. 134, no. Supplement C, pp. 189 – 198, 2017.
  • [11] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in European Conference on Computer Vision (ECCV) Workshops, 2016, pp. 850–865.
  • [12] X. Shi, W. Hu, Y. Cheng, G. Chen, and J. J. H. Ling, “Infrared target tracking using multiple instance learning with adaptive motion prediction and spatially template weighting,” in Proc. of SPIE Vol, vol. 8739, 2013, pp. 873 912–1.
  • [13] R. Liu and Y. Lu, “Infrared target tracking in multiple feature pseudo-color image with kernel density estimation,” Infrared Physics & Technology, vol. 55, no. 6, pp. 505–512, 2012.
  • [14] Y. He, M. Li, J. Zhang, and J. Yao, “Infrared target tracking based on robust low-rank sparse learning,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 2, pp. 232–236, 2016.
  • [15] Y.-J. He, M. Li, J. Zhang, and J.-P. Yao, “Infrared target tracking via weighted correlation filter,” Infrared Physics & Technology, vol. 73, pp. 103–114, 2015.
  • [16] C. Asha and A. Narasimhadhan, “Robust infrared target tracking using discriminative and generative approaches,” Infrared Physics & Technology, vol. 85, pp. 114–127, 2017.
  • [17] X. You, Q. Li, D. Tao, W. Ou, and M. Gong, “Local metric learning for exemplar-based object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 8, pp. 1265–1276, 2014.
  • [18] Z. Zhu, X. You, C. P. Chen, D. Tao, W. Ou, X. Jiang, and J. Zou, “An adaptive hybrid pattern for noise-robust texture analysis,” Pattern Recognition, vol. 48, no. 8, pp. 2592–2608, 2015.
  • [19]

    X.-Y. Jing, X. Zhu, F. Wu, R. Hu, X. You, Y. Wang, H. Feng, and J.-Y. Yang, “Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning,”

    IEEE Transactions on Image Processing, vol. 26, no. 3, pp. 1363–1378, 2017.
  • [20] Q. Ge, X.-Y. Jing, F. Wu, Z.-H. Wei, L. Xiao, W.-Z. Shao, D. Yue, and H.-B. Li, “Structure-based low-rank model with graph nuclear norm regularization for noise removal,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3098–3112, 2017.
  • [21]

    W.-S. Chen, P. C. Yuen, J. Huang, and B. Fang, “Two-step single parameter regularization fisher discriminant method for face recognition,”

    International Journal of Pattern Recognition and Artificial Intelligence

    , vol. 20, no. 02, pp. 189–207, 2006.
  • [22] W. Ou, S. Yu, G. Li, J. Lu, K. Zhang, and G. Xie, “Multi-view non-negative matrix factorization by patch alignment framework with view consistency,” Neurocomputing, vol. 204, pp. 116–124, 2016.
  • [23] W. Ou, X. You, D. Tao, P. Zhang, Y. Tang, and Z. Zhu, “Robust face recognition via occlusion dictionary learning,” Pattern Recognition, vol. 47, no. 4, pp. 1559–1572, 2014.
  • [24] Z. Guo, X. Wang, J. Zhou, and J. You, “Robust texture image representation by scale selective local binary patterns,” IEEE Transactions on Image Processing, vol. 25, no. 2, pp. 687–699, 2016.
  • [25]

    X. Shi, Z. Guo, F. Nie, L. Yang, J. You, and D. Tao, “Two-dimensional whitening reconstruction for enhancing robustness of principal component analysis,”

    IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2130–2136, 2016.
  • [26] Z. Lai, Y. Xu, Q. Chen, J. Yang, and D. Zhang, “Multilinear sparse principal component analysis,” IEEE transactions on neural networks and learning systems, vol. 25, no. 10, pp. 1942–1950, 2014.
  • [27] Z. Lai, W. K. Wong, Y. Xu, J. Yang, and D. Zhang, “Approximate orthogonal sparse embedding for dimensionality reduction,” IEEE transactions on neural networks and learning systems, vol. 27, no. 4, pp. 723–735, 2016.
  • [28] K. Chen and W. Tao, “Once for all: a two-flow convolutional neural network for visual tracking,” arXiv preprint arXiv:1604.07507, 2016.
  • [29] R. Tao, E. Gavves, and A. W. Smeulders, “Siamese instance search for tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1420–1429.
  • [30] J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. Torr, “End-to-end representation learning for correlation filter based tracking,” arXiv preprint arXiv:1704.06036, 2017.
  • [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [32] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    International Conference on Machine Learning (ICML)

    , 2015, pp. 448–456.
  • [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [34] X. Li, Q. Liu, Z. He, H. Wang, C. Zhang, and W.-S. Chen, “A multi-view model for visual tracking via correlation filters,” Knowledge-Based Systems, vol. 113, pp. 88–99, 2016.
  • [35] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in ACM International Conference on Multimedia, 2015, pp. 689–692.
  • [36] L. Čehovin, M. Kristan, and A. Leonardis, “Is my new tracker really better than yours?” in IEEE Winter Conference on Applications of Computer Vision, 2014, pp. 540–547.
  • [37] M. Kristan, J. Matas, A. Leonardis et al., “A novel performance evaluation methodology for single-target trackers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 11, pp. 2137–2155, 2016.
  • [38] M. Kristan, J. Matas, A. Leonardis, M. Felsberg et al., “The visual object tracking vot2015 challenge results,” in IEEE International Conference on Computer Vision (ICCV) Workshops, 2015, pp. 1–23.
  • [39] M. Felsberg, A. Berg, G. Hager, J. Ahlberg, M. Kristan, J. Matas, A. Leonardis, L. Cehovin, G. Fernandez, T. Vojir et al., “The thermal infrared visual object tracking vot-tir2015 challenge results,” in IEEE International Conference on Computer Vision (ICCV) Workshops, 2015, pp. 76–88.
  • [40] M. Felsberg, M. Kristan, J. Matas, A. Leonardis et al., The Thermal Infrared Visual Object Tracking VOT-TIR2016 Challenge Results, 2016, pp. 824–849.
  • [41] H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4293–4302.
  • [42] M. Tang and J. Feng, “Multi-kernel correlation filter for visual tracking,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3038–3046.
  • [43] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolutional features for visual tracking,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3074–3082.
  • [44] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, and J. L. M.-H. Yang, “Hedged deep tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4303–4311.
  • [45] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2411–2418.
  • [46] M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate scale estimation for robust visual tracking,” in British Machine Vision Conference, Nottingham, September 1-5, 2014, 2014, pp. 65.1–65.11.
  • [47] H. Possegger, T. Mauthner, and H. Bischof, “In defense of color-based model-free tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2113–2120.
  • [48] J. Lang, R. Lagani et al., “Scalable kernel correlation filter with sparse feature integration,” in IEEE International Conference on Computer Vision (ICCV) Workshop, 2015, pp. 587–594.
  • [49] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, L. Čehovin et al., “The visual object tracking vot2014 challenge results,” in Europe Conference on Computer Vision (ECCV) Workshops, 2015, pp. 191–217.
  • [50] A. Lukežič, L. Čehovin, and M. Kristan, “Deformable parts correlation filters for robust visual tracking,” arXiv preprint arXiv:1605.03720, 2016.
  • [51] D. Du, H. Qi, L. Wen, Q. Tian, Q. Huang, and S. Lyu, “Geometric hypergraph learning for visual tracking,” arXiv preprint arXiv:1603.05930, 2016.
  • [52] S. Becker, S. B. Krah, W. Hübner, and M. Arens, “Mad for visual tracker fusion,” in SPIE Security+ Defence, 2016, pp. 99 950K–99 950K.
  • [53] R. Pelapur, S. Candemir, F. Bunyak, M. Poostchi, G. Seetharaman, and K. Palaniappan, “Persistent target tracking using likelihood fusion in wide-area and full motion video sequences,” in International Conference on Information Fusion, 2012, pp. 2420–2427.