Visual object tracking is an important topic in computer vision, where the target object is identified in the initial video frame and successively tracked in subsequent frames. In recent years, deep networks[37, 3, 19, 44, 23, 39] have significantly improved the tracking performance due to their representation prowess.
There are two groups of deep-learning-based trackers. The first group[36, 28, 32, 4] improves the discriminative ability of deep networks by frequent online update. They utilize the first frame to initialize the model and update it every few frames. Timely online update enables trackers to capture target variations but also requires more computational time. Therefore, the speed of these trackers generally cannot meet the real-time requirements.
Siamese-based trackers are representative in the second group [3, 44, 22] which is totally based on offline training. They learn the similarity between objects in different frames through massive offline training. During online testing, the initial target feature is regarded as template and used to search the target in the following frames. These methods need no online updating, thus, they usually run at real-time speeds. However, these methods cannot adapt to appearance variations of target without important online adaptability, thereby increasing the risk of tracking drift. To solve this problem, many researches [16, 45, 40] present different mechanisms to update template features. However, these methods only focus on combining the previous target features, ignoring the discriminative information in background clutter. This results in a big accuracy gap between the siamese-based trackers and those with online update.
Generally, gradients are calculated through the final loss which considers both positive and negative candidates. Thus, gradients contain the discriminative information to reflect the target variations and distinguish the target from background clutter. As shown in Figure 1, when objects are occluded with noise or similar objects coexist at the neighborhood of the target, the absolute value of gradients at these locations are prone to be higher. The high value in gradients can force the template to focus on these regions and capture the core discriminative information. Most gradient-based trackers [36, 32] concentrate on hand-designed optimization algorithms, such as momentum , Adagrad , ADAM  and so on. These algorithms need hundreds of iterations to converge, which lead to more computation and a lower speed. How to take a trade-off between the speed and accuracy of update is still a problem.
If we expect to reduce the number of training iterations but still keep online update through gradients, the extreme case is to adapt the template through one backward propagation. However, training by one backward propagation is a difficult task. As shown in Table 1, there is no proper learning rate to make the template of SiameseFC converge through one iteration. Generally, even with the optimal step length, moving according to the gradient at only one iteration cannot update the template properly, because the normal gradient-based optimization is a nonlinear process. On the other hand, we can learn a nonlinear function by CNNs, which simulates the non-linear gradient-based optimization by exploring the rich information in gradients. Therefore, we propose a gradient-guided network (GradNet) to perform gradient-guided adaptation in visual tracking. The GradNet integrates the adaptation process that consists of two feed-forward and one backward calculation, simplifying the process of gradient-based optimization.
It is a very tough task to train a robust GradNet due to two main reasons. The first reason is that the network is prone to use the appearance of the template instead of using the gradient for tracking (details can be found in Section 3.3), because learning to use the gradients is more difficult than learning to use appearance. The second reason is that the network is prone to overfit. As shown in Figure 2, the model with normal training (Ours-T) can quickly get a low distance error but its test accuracy is not promising, compared with our model. To handle these issues, we propose a template generalization method to effectively explore gradient information and avoid overfitting.
The major contributions can be summarized as follows:
A GradNet is proposed to conduct gradient-guided template updating for visual tracking.
A template generalization method is proposed to ensure strong adaptation ability and avoid overfitting.
Extensive experiments conducted on four popular benchmarks show that the proposed tracker achieves promising results at a real-time speed of 80fps.
2 Related Work
2.1 Siamese Network based Tracking
SiameseFC  is the most representative trackers based on template matching. Bertinetto et al.  present a siamese network with two shared branches to extract features of both the target and the search region. During online tracking, the template is fixed as the initial target feature and the tracking performance mainly relies on the discriminative ability of the offline-trained network. Without online updating, the tracker achieves beyond real-time speed. Similarly, SINT  also designs a network to match the initial target with candidates in a new frame. Its speed is much lower because hundreds of candidate patches are sent into the network instead of one search image. Another siamese-based tracker is GOTURN  which proposes a siamese network to regress the target bounding box with a speed of 100fps. All these methods are lack of important online updating. The fixed model cannot adapt to appearance variations, which makes the tracker easily disturbed by similar instances or background noise. In this paper, we choose SiameseFC as our basic model and propose a gradient-guided method to update the template.
2.2 Model Updating in Tracking
Timely updating is essential to keep trackers robust. There are three main dominant strategies of model updating, including template combination, gradient-descent based and correlation-based strategies.
Algorithms [16, 45] based on template combination aim to effectively combine the target features from previous frames. Guo et al.  propose a fast transformation learning model to enable effective online learning from previous frames. Zhu et al.  utilize the optical flow information to convert templates and integrate them according to their weights. All these methods focus on using the information of templates, which ignore the background clutter. Different from these methods, we take full use of the discriminative information in backward gradients instead of just integrating previous templates.
Gradient-descent based approaches.
Deep trackers [36, 32] based on gradient descent explore the discriminative information in backward gradients to update the model through hundreds of iterations. Wang et al.  train two separate convolutional layers to regress Gaussian maps with the initial frame and update these layers every few frames. Similarly, Song et al.  also utilize a number of gradient descent iterations in initialization and online update procedures. These trackers need many training iterations to capture the appearance variations of the target, which makes the tracker less effective and far from real-time requirements. We propose a GradNet that needs only one backward propagation and two forward propagations to update the template effectively. Besides, our template generalization method for handling overfitting is not investigated in existing works.
Correlation based Tracking.
train classifier through circular convolution, which can be quickly calculated in Fourier domain. The final classifier is trained and updated by solving the closed-form solution of the optimization function. The classifier training cannot be simulated totally by deep networks, so most correlation based trackers just utilize deep networks to extract robust features. Differently, our method aims to update the template in an end-to-end network.
2.3 Gradient Exploiting
Currently, most deep neural networks adopt gradients in offline training based on hand-designed optimization strategies, such as momentum, Adagrad , ADAM  and so on. These methods usually need expensive computation and large-scale data sets. How to accelerate the training of deep networks is a hot topic in computer vision.
Meta learning approaches can be broadly divided into different categories, including optimization-based methods , memory-based methods , variable-based methods [14, 30, 24] and so on. Our algorithm can be seen as an improved version of the optimization-based method  to adapt to the update task in visual tracking. Our approach has three main differences compared with . First, ours only learns to update template, but not the network branch of search region. This is specifically designed for the tracking task. Second, our update process only contains one iteration instead of multiple iterations. Finally, our training of the optimizer includes second-order gradient which is not used in .
Meta Learning for Tracking.
Despite the popularity of meta learning in many fields, there are few works [40, 29] applying it to visual tracking. Yang et al.  design a memory structure to dynamically write and read previous templates for model updating. Differently, we focus on exploring the discriminative information of gradients. Eunbyung et al. 
train the initialization parameters of filters with pixel-wise learning rate offline and utilize a matrix multiplication to update the filters. The update is a linear process. While, our template update is a non-linear process with convolutional layers and Relu. Besides, we use the target feature as the prior information to speed up the update process by providing a good initial value.
3 Proposed Algorithm
The whole pipeline of GradNet is shown in Figure 3, which consists of two branches. One branch extracts features of the search region and the other branch generates the template according to the target information and gradients, detailed in Section 3.2. The template generation process consists of initial embedding, gradient calculation and template updating. First, the shallow target feature is sent to one sub-net (shown in purple in Figure 3) to obtain an initial template which is used to calculate the initial loss . Second, the gradient of the shallow target feature is calculated through backward propagation, and sent to the other sub-net (shown in orange in Figure 3) for being non-linearly converted to better gradient representation. Finally, the converted gradient is added to the shallow target feature to get an updated target feature which is sent to the sub-net again to output the optimal template. It should be noted that the two sub-nets in the initial embedding and template update process share parameters. The optimal template is used to search targets on search regions through cross correlation convolution.
3.1 Basic Tracker
We adopt SiameseFC  as the basic tracker.
is used to model the feature extraction branch for search region,is used to model the feature extraction branch for target region. We assume that the movement of the target is smooth between two consecutive frames. Thus, we can crop a search region which is larger than the target patch in the current frame, centered at the target’s position in the last frame. The final score map is calculated by:
where is the template to perform an exhaustive search over the search region , means cross correlation convolution, denotes the score map to find the target. In SiameseFC, the template is defined as the deep target feature:
where is the target patch in the first frame. In order to improve the discriminative ability of the template during online tracking, we design the update branch to explore the rich information in gradients:
where is the parameter of the update branch which can not only capture the template information in but also the background information in through gradients.
3.2 Template Generation
Given the image pair , we want to get the optimal template which is suitable to distinguish the target from the background in search region . First, we get the target feature (using two convolutional layers) and sent to the sub-net to get the initial template :
where is the parameter of . The initial template only contains template information without background information. Thus, we need to explore the discriminative information in gradient to make it more robust. After getting , the initial score map is calculated through equation (1).
Based on the initial score map and the training label , we can get the initial loss by:
is logistic loss function. We utilize this loss to calculate the gradient ofand added it to . Then, the updated target feature is obtained by:
where is the parameter of . Here, the gradient is related to and used as the input of the sub-net to calculate the final loss, so the second-order guidance is introduced in the parameter training of the sub-net .
Finally, we send the updated target feature to the sub-net again to obtain the optimal template and the final score map by:
The optimal score map
is utilized to estimate the target position. Our goal is to lethave the highest value at the target position and lower values at other positions. Thus, we utilize the loss which is calculated by to train the update branch:
To our knowledge, this work is the first attempt to exploit the discriminative information of gradients to update the template in SiameseFC. To simplify the introduction of template generation process, we just utilize one image pair here. In the next subsection, we will discuss the training method more generally and detailedly.
3.3 Template Generalization
Problem of Basic Optimization. Image pairs from different videos and their training labels form the training set , is search region which is larger than target patch , is training label and is the number of training samples. It should be noted that and are from different frames of the same video, while and () are from different videos. One simple idea to train our network is to utilize image pairs in the training set to get optimal template and final score maps by equations (47). The update branch is trained through:
This method has two main problems according to our experiment. The first one is that the update branch of the network is prone to focus on the template appearance instead of the gradient, because learning to use the gradient is harder than modeling the similarity metric. As shown in Figure 4, the network trained without template generalization has lower weight ratio of gradients. This means that the network focuses less on gradients. The second one is that the network cannot avoid overfitting under this training process as shown in Figure 2.
Template Generalization. Our goal is forcing the update branch to focus on gradients and avoiding overfitting. Based on these requirements, we propose a template generalization method which adopts search regions from different videos to obtain a versatile template and make it perform well on all search regions in each training batch. We show the training process of our model without template generalization and our model with template generalization in Figure 5 based on four image pairs. The main difference is that we utilize one template (instead of four templates) to search targets on four images from different videos.
We choose ( in Figure 5) training image pairs from the training set to form a training batch and utilize the target patch in the first image pair to calculate the target feature . The initial template can be obtained by equation (4). Here, means the template which is calculated through . Then, we utilize to find the target on all search regions:
Then, we can obtain the initial loss by equation (5) and update the template through equations (6, 7). After obtaining the updated template , we utilize it to search the target in all search regions and train the update branch through equation (9). In this way, the is required to track the targets in simultaneously. To clarify, we show the details in Algorithm 1.
The template generalization offers the target feature with multiple search regions and aims to obtain a general template feature which performs well on all search regions. This strategy can force the network to focus on the gradients during offline training, because the initial target features are misaligned and the gradients are aligned. The sub-nets and need to correct the initial misaligned template according to the gradients and thereby obtaining a great power to update templates according to gradients. As shown in Figures 2 and 4, the template generalization algorithm can effectively avoid overfitting and pay attention on gradients.
3.4 Online Tracking
After offline training, the update branch is totally fixed and used for initialization and update during online testing .
Initialization. Given the ground truth in the first frame, we crop a target patch and a search region as inputs of the network. Then, we can obtain the optimal template according to equations (47). Besides, the updated target features is calculated through equation (6) and used to update the template in the following frames.
Online Update. We update the template with one reliable training sample through one iteration. We save the reliable sample according to tracking results and use it to update the current template based on equations (47) ( replacing , , with , , ). Namely, we obtain updated feature through the initial frame. Then, the update branch of network is used to update according to the reliable sample and produce optimal templates for the regression part.
3.5 Implementation Details
The feature extraction for the search region consists of five convolutional layers with the same structure and parameters as SiameseFC . The shallow target features are from the second convolutional layers of SiameseFC. There are three convolutional layers in which have the same structure with the last three layers of SiameseFC. The kernel size of the convolutional layer in is . The size of template and is and the size of score map is . During tracking, we update the template every 5 frames. The reliable training sample is chosen according to the max value of the score map. We set the max value of the score map in the first frame as a threshold . If the max value of the current score map is larger than , we think that the result is accurate and crop the training sample
as the reliable training sample. The scale evaluate, learning rate and training epoch in the proposed method are the same as those in SiameseFC. To take the trade-off between the fast adaptation and error accumulation, the final template is obtained by combining the initial template and . We only train the network on ILSVRC2014 VID dataset and the whole network is fixed during inference.
Our tracker is implemented in Python with the Pytorch framework, which runs atfps with an intel i7 3.2GHz CPU with 32G memory and a Nvidia 1080ti GPU with 11G memory. We compare our tracker with many state-of-the-art trackers with real-time performance (i.e., their speeds are faster than fps) on recent benchmarks, including OTB-2015 , TC-128 , VOT-2017  and LaSOT .
4.1 Evaluation on the OTB-2015 dataset
The OTB-2015  dataset is one of the most popular benchmarks, which consists of challenging video clips annotated with different attributes. We refer the reader to  for more detailed information. Here, we adopt both success and precision plots to evaluate different trackers on OTB-2015. The precision plot reports the percentages that the center location errors are smaller than certain thresholds. Whereas the success plot reports the percentages of frames where the overlap between the predicted and the ground truth bounding boxes is higher than a series of given ratios. We compare our algorithm with twelve state-of-the-art trackers including nine real-time deep trackers (ACT , StructSiam , SiamRPN , ECO-HC , PTAV , CFNet , Dsiam , LCT , SiameFC ) and three traditional trackers (Staple , DSST , KCF ).
Figure 6 illustrates the precision and success plots of all compared trackers over OTB-2015, which shows the proposed tracker achieves very good performance (merely a slightly lower than ECO-HC in success). Especially, our tracker performs significantly better than the baseline model (SiameseFC) by almost 8 in precision and 6 in success. To facilitate more detailed analysis, we demonstrate the visual results of some representative methods in Figure 9. From these figures, we can see that our method can well handle various challenging factors and consistently achieve good performance.
4.2 Evaluation on the TC-128 dataset
The TC128  dataset consists of fully-annotated image sequences with various challenging factors, which is larger than OTB-2015 and focuses more on color information. We also adopt both success and precision plots to evaluate different trackers (the same evaluation protocol as OTB-2015). We compare our algorithm with eleven trackers, including ACT , PTAV , Dsiam , SiameFC , HCFT , FCNT , STCT, BACF , SRDCF , KCF  and MEEM . Figure 7 shows that our tracker achieves the best results in terms of both precision and success criterion.
4.3 Evaluation on the VOT2017 dataset
The VOT2017  dataset contains short sequences annotated with different attributes. According to its evaluation protocol, the tested tracker is re-initialized whenever a tracking failure is detected. In this benchmark, the accuracy (A) and robustness (R) as well as expected average overlap (EAO) are three important criterion. Different trackers are ranked based on the EAO criterion. We refer the reader to  for more detailed information. In this subsection, we compare our algorithm with top ten trackers reported in the VOT2017 real-time Challenge  and another state-of-the-art tracker SiamRPN . Table 2 shows that our tracker achieves the best performance in terms of EAO while maintaining a very competitive accuracy and robustness. The EAO of our tracker is higher than the winner (CSRDCF++) of the VOT2017 real-time Challenge by . Our tracker can also perform better than SiamRPN whose training data (over 100,000 videos) is much larger than ours (about 4,000 videos).
4.4 Evaluation on the LaSOT dataset
The LaSOT  dataset is a very large-scale dataset consisting of sequences with categories and more than M frames in total. The average frame length of this dataset is more than frames. Up to now, this dataset is the largest for visual tracking. Following one-pass evaluation, different trackers are compared based on three criteria including precision, normalized precision and success. We also adopt precision and success plots to compare trackers and show the performance of the top trackers in Figure 8 (more compared results are presented in the supplementary material). From Figure 8, we can see that our tracker performs the third-best in this dataset. Although MDNet and VITAL achieve better accuracies than our tracking algorithm, their speeds are far from the real-time requirement (MDNet, 1fps and VITAL, 1.5fps).
4.5 Ablation Analysis
To verify the contribution of each component in our algorithm, we implement and evaluate several variations of our approach (Ours) on OTB-2015. These versions include: (1) ‘Ours w/o M’: GradNet without template generalization training process; (2) ‘Ours w/o MG’: GradNet removed template generalization training process and gradient application. It can be seen as SiameseFC with two unshared branches; (3) ‘Ours w/o U’: the proposed method without template update; (4) ‘Ours w 2U’: the two sub-nets (in purple) in Figure 3 do not share parameters; (5) ‘Ours-baseline’: SiameseFC.
|Ours w/o M||0.823||0.615||80|
|Ours w/o MG||0.717||0.524||94|
|Ours w/o U||0.775||0.552||85|
|Ours w 2U||0.833||0.628||80|
The performance of all variations and our final method is reported in Table 3, from which we can see that all components facilitate improving the tracking accuracy. For examples, the comparison of the ‘Ours w/o M’ and final methods demonstrates the template generalization training method could effectively learn an expected GradNet. With the same amount of training data, ‘Ours’ improves the precision and IOU score of ‘Ours-baseline’ about and respectively, which demonstrates the effectiveness of the GradNet.
To further analyze the template generalization, we show the initial score map and the optimal score map of two different training methods in Figure 10. The initial score maps of the model with template generalization (a) are noisy score maps where the approximate area of all objects has high response values. After the template updating based on gradients, the promising score maps (b) only have a high response at the target position. Differently, the model without template generalization is likely to output initial score maps (c) with a high response at the target position directly. Thus, we think the model trained by template generalization learns different tasks in the initial embedding and template update processes. During initial embedding, it learns a general template to detect the target and background clutter. This manner provides the model more discriminative gradients. Then, the model learns to update the template based on these gradients in the template update process. The discriminative gradients enable the fast adaptation of the network.
In this work, we propose a GradNet for template update, achieving accurate tracking with a high speed. The two sub-nets in GradNet exploits the discriminative information in gradients through feed-forward and backward operations and speeds up the hand-designed optimization process. To take full use of gradients and obtain versatile templates, a template generalization method is applied during offline training, which can force the update branch to concentrate on the gradient and avoid overfitting. Experiments on four benchmarks show that our method significantly improves the tracking performance compared with other real-time trackers.
The paper is supported in part by Natioal Natural Science Foundation of China No.61725202, 61829102, 61751212 and the Fundamental Research Funds for the Central Universities under Grant Nos. DUT19GJ201, DUT18JC30.
-  (2016) Learning to learn by gradient descent by gradient descent. In NIPS, Cited by: §2.3.
-  (2016) Staple: complementary learners for real-time tracking. In CVPR, Cited by: §4.1.
-  (2016) Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision Workshops, pp. 850–865. Cited by: §1, §1, §2.1, §3.1, §3.5, §4.1, §4.2, 1.
-  (2019) Multi attention module for visual tracking. Pattern Recognition 87, pp. 80–93. Cited by: §1.
-  (2018) Real-time ’actor-critic’ tracking. In ECCV, Cited by: §4.1, §4.2.
-  (2019) Visual tracking via adaptive spatially-regularized correlation filters. In ICCV, Cited by: §2.2.
-  (2017) ECO: efficient convolution operators for tracking. In CVPR, Cited by: §4.1.
-  (2014) Accurate scale estimation for robust visual tracking. In BMVC, Cited by: §4.1.
-  (2015) Learning spatially regularized correlation filters for visual tracking. In ICCV, Cited by: §4.2.
-  (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In Proceedings of the European Conference on Computer Vision, pp. 472–488. Cited by: §2.2.
Adaptive subgradient methods for online learning and stochastic optimization.
Journal of Machine Learning Research12, pp. 2121–2159. Cited by: §1, §2.3.
-  (2018) LaSOT: A high-quality benchmark for large-scale single object tracking. CoRR abs/1809.07845. Cited by: §4.4, §4.
-  (2017) Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking. In ICCV, Cited by: §4.1, §4.2.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §2.3.
-  (2017) Learning background-aware correlation filters for visual tracking. In ICCV, Cited by: §4.2.
-  (2017) Learning dynamic siamese network for visual object tracking. In ICCV, Cited by: §1, §2.2, §4.1, §4.2.
-  (2016) Learning to track at 100 fps with deep regression networks. In ECCV, Cited by: §2.1.
-  (2015) High-speed tracking with kernelized correlation filters. pami, pp. 583–596. Cited by: §2.2, §4.1, §4.2.
Learning policies for adaptive tracking with deep feature cascades. In ICCV, Cited by: §1.
-  (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §1, §2.3.
-  (2017) The visual object tracking VOT2017 challenge results. In ICCVW, Cited by: §4.3, §4.
-  (2018) High performance visual tracking with siamese region proposal network. In CVPR, Cited by: §1, §4.1, §4.3.
-  (2018) Deep visual tracking: review and experimental comparison. Pattern Recognition 76, pp. 323–338. Cited by: §1.
-  (2017) Meta-sgd: learning to learn quickly for few shot learning. CoRR abs/1707.09835. Cited by: §2.3.
-  (2015) Encoding color information for visual tracking: algorithms and benchmark. IEEE Transactions on Image Processing 24 (12), pp. 5630–5644. Cited by: §4.2, §4.
-  (2015) Hierarchical convolutional features for visual tracking. In ICCV, Cited by: §2.2, §4.2.
-  (2015) Long-term correlation tracking. In CVPR, Cited by: §4.1.
Learning multi-domain convolutional neural networks for visual tracking. In CVPR, Cited by: §1.
-  (2018) Meta-tracker: fast and robust online adaptation for visual object trackers. In ECCV, Cited by: §2.3.
-  (2018) Meta-learning with latent embedding optimization. CoRR abs/1807.05960. Cited by: §2.3.
-  (2016) Meta-learning with memory-augmented neural networks. In ICML, Cited by: §2.3.
-  (2017) CREST: convolutional residual learning for visual tracking. In ICCV, Cited by: §1, §1, §2.2.
-  (2016) Siamese instance search for tracking. In CVPR, Cited by: §2.1.
-  (1998) An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization 8 (2), pp. 506–531. Cited by: §1, §2.3.
-  (2017) End-to-end representation learning for correlation filter based tracking. In CVPR, Cited by: §4.1.
-  (2015) Visual tracking with fully convolutional networks. In ICCV, Cited by: §1, §1, §2.2, §4.2.
-  (2016) Sequentially training convolutional networks for visual tracking. In CVPR, Cited by: §1, §4.2.
-  (2015) Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1834–1848. Cited by: §4.1, §4.
-  (2019) ‘Skimming-perusal’ tracking: a framework for real-time and robust long-term tracking. In ICCV, Cited by: §1.
-  (2018) Learning dynamic memory networks for object tracking. In ECCV, Cited by: §1, §2.3.
-  (2014) MEEM: robust tracking via multiple experts using entropy minimization. In Proceedings of the European Conference on Computer Vision, pp. 188–203. Cited by: §4.2.
-  (2018) Correlation particle filter for visual tracking. IEEE Transactions on Image Processing 27 (6), pp. 2676–2687. Cited by: §2.2.
-  (2019) Learning multi-task correlation particle filters for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 365–378. Cited by: §2.2.
-  (2018) Structured siamese network for real-time visual tracking. In ECCV, Cited by: §1, §1, §4.1.
-  (2018) End-to-end flow correlation tracking with spatial-temporal attention. In ECCV, Cited by: §1, §2.2.