Learning Features with Differentiable Closed-Form Solver for Tracking

06/25/2019
by   Linyu Zheng, et al.
3

We present a novel and easy-to-implement training framework for visual tracking. Our approach mainly focuses on learning feature embeddings in an end-to-end way, which can generalize well to the trackers based on online discriminatively trained ridge regression model. This goal is efficiently achieved by taking advantage of the following two important theories. 1) Ridge regression problem has closed-form solution and is implicit differentiation under the optimality condition. Therefore, its solver can be embedded as a layer with efficient forward and backward processes in training deep convolutional neural networks. 2) Woodbury identity can be utilized to ensure efficient solution of ridge regression problem when the high-dimensional feature embeddings are employed. Moreover, in order to address the extreme foreground-background class imbalance during training, we modify the origin shrinkage loss and then employ it as the loss function for efficient and effective training. It is worth mentioning that the above core parts of our proposed training framework are easy to be implemented with several lines of code under the current popular deep learning frameworks, thus our approach is easy to be followed. Extensive experiments on six public benchmarks, OTB2015, NFS, TrackingNet, GOT10k, VOT2018, and VOT2019, show that the proposed tracker achieves state-of-the-art performance, while running at over 30 FPS. Code will be made available.

READ FULL TEXT VIEW PDF

Authors

page 3

page 8

08/21/2019

DomainSiam: Domain-Aware Siamese Network for Visual Object Tracking

Visual object tracking is a fundamental task in the field of computer vi...
06/18/2020

Cascaded Regression Tracking: Towards Online Hard Distractor Discrimination

Visual tracking can be easily disturbed by similar surrounding objects. ...
11/14/2016

Convolutional Regression for Visual Tracking

Recently, discriminatively learned correlation filters (DCF) has drawn m...
03/10/2022

Deep Regression Ensembles

We introduce a methodology for designing and training deep neural networ...
03/25/2020

Boosting Ridge Regression for High Dimensional Data Classification

Ridge regression is a well established regression estimator which can co...
06/17/2019

Hallucinated Adversarial Learning for Robust Visual Tracking

Humans can easily learn new concepts from just a single exemplar, mainly...
06/11/2020

Deep Transfer Learning with Ridge Regression

The large amount of online data and vast array of computing resources en...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Visual object tracking is one of the fundamental problems in computer vision with many applications. In the model free tracking problem, the goal is to estimate the states (e.g., position and size) of the target in a whole image sequence only with the initial frame 

[Wu, Lim, and Yang2015]. Model free tracking is very challenging because the tracker has to learn the robust appearance model from a very limited training samples to resist extremely challenging interference, such as occlusion, large appearance changes, illumination variation, fast motion, and background clutters. In general, the key problem of model free tracking is how to construct a tracker which can not only tolerate the appearance variation of target, but also exclude background interference, while maintaining the processing speed as fast as possible.

Figure 1: Speed and accuracy plot of state-of-the-art trackers on OTB2015. The proposed DCFST achieves the best accuracy, while running beyond real-time speed.

There has been significant progress on convolutional neural networks (CNNs) based trackers in recent years. From a technical standpoint, existing state-of-the-art CNNs based trackers mainly fall in two categories. One category is to treat tracking as a similarity learning problem and only performs offline training, typically SINT [Tao, Gavves, and Smeulders2016], SiamFC [Bertinetto et al.2016] and SiamRPN [Li et al.2018]. Although such trackers achieve excellent performance against many challenging benchmarks, the lack of online learning makes them cannot integrate background information in an online adaptive way to improve the discriminative power of their models. Therefore, they seriously affected by the interference of cluttered background, which hinders further improvement in their localization performance. Another category is to apply CNN features to the online discriminatively trained trackers, typically HCF [Ma et al.2015], ECO [Danelljan et al.2017] and CFNet [Valmadre et al.2017]

. Some of such trackers extract features from the deep CNNs which are trained on ImageNet 

[Deng et al.2009] for object classification, however, these features are not optimal for visual tracking. Others attempt to learn feature embeddings for the trackers which are online discriminatively trained by approximation methods such as circulant samples in correlation filters [Bolme et al.2010, Henriques et al.2014], however, these approximations inherently affect trackers’ localization performance.

After investigating previous approaches which aim to learn feature embeddings for the online discriminatively trained trackers, we hold the opinion that the main challenge in this task is efficiency. In each iteration of the offline training phase for such methods, they need to train the discriminative models in the forward process and calculate the gradients of feature embeddings from the above trained models in the backward process. If one of the above two processes is inefficient, the network training which often involves millions of iterations will take an unacceptable amount of time. Consequently, in order to be efficient, most current such approaches formulate the correlation filters based efficient solver as a layer to learn feature embeddings. However, it is well known that correlation filters inherently suffer from the negative boundary effect [Henriques et al.2014, Danelljan et al.2015] which seriously affects the localization performance. Even though CFNet [Valmadre et al.2017]

proposes to relax the boundary effect by cropping the trained filters, its experimental results show that this heuristic idea produces very little effect. Besides, CFCF 

[Gundogdu and Alatan2018] employs CCOT tracker [Danelljan et al.2016] which is less affected by the boundary effect to locate the target in the inference phase, however, its tracking speed is far from real-time due to the low efficiency of CCOT and the feature embeddings learned in its offline training phase are not optimal for CCOT.

Taking inspiration from the fact that ridge regression model has been successfully applied by many modern online discriminatively trained trackers [Bolme et al.2010, Henriques et al.2014, Danelljan et al.2017, Sun et al.2018] due to its simplicity, efficiency and effectiveness, in this paper, we present a novel and easy-to-implement training framework for visual tracking. Our approach mainly focuses on learning feature embeddings in an end-to-end way, which can generalize well to the trackers based on online discriminatively trained ridge regression model. In our design, we mainly take advantage of the following two important theories to achieve efficient network training without approximations. 1) Ridge regression problem has closed-form solution and is implicit differentiation under the optimality condition. Therefore, its solver can be embedded as a layer with efficient forward and backward processes in training deep networks [Lee et al.2019, Bertinetto et al.2019]. 2) Woodbury identity [Petersen, Pedersen, and others2008] can be utilized to ensure efficient solution of ridge regression problem when the high-dimensional feature embeddings are employed. Because it allows us to address the dependence of time complexity of solving ridge regression problem on the dimension of feature embeddings. Therefore, according to above, we claim that there is no obstacle to efficiently learn feature embeddings for trackers based on online discriminatively trained ridge regression model without approximations. Moreover, it is clear that the entire network can be trained in an end-to-end fashion with image pairs.

In addition to above, we find that the extreme foreground-background class imbalance encountered during training seriously affects the convergence speed and generalization ability of the learned feature embeddings when the general mean square error loss is employed. This is because the vast number of background samples are easy ones and they make it difficult for the proposed network to be trained. However, this problem has received little attention in previous similar approaches. In order to solve it, we modify the origin shrinkage loss [Lu et al.2018] which is designed for deep regression learning and then employ it as the loss function for efficient and effective training in our task. Specifically, it down-weights the loss assigned to easy examples and mainly focuses on a sparse set of hard examples, thus prevents the vast number of easy negatives from overwhelming the feature embeddings learning during training.

In the inference phase, following the general tracking procedures of most recent online discriminatively trained trackers [Bolme et al.2010, Henriques et al.2014, Danelljan et al.2015], we train a ridge regression model with the feature embeddings learned above, then locate the target by it. Further, the target size and localization are refined by ATOM [Danelljan et al.2019] for more accurate tracking.

It is worth mentioning that the above core parts of our proposed training framework are easy to be implemented with several lines of code under the current popular deep learning frameworks, thus our approach is easy to be followed. Extensive experiments are performed on six public benchmarks: OTB2015, NFS, TrackingNet, GOT10k, VOT2018, and VOT2019. The proposed tracker, DCFST, achieves the state-of-the-art localization performance, while running at over 30 FPS. Fig. 1 provides a glance of our DCFST compared with other state-of-the-art trackers on OTB2015.

Figure 2: Full architecture of the proposed features learning network. For each input image,

sample RoIs with target size are obtained by uniform sampling. ResNet-18 Block3 and Block4 backbone feature maps extracted from the input image are first passed through two convolutional layers to obtain the learned feature maps of input image, respectively. Each sample RoI is then pooled to feature maps with a fixed size using PrPool layers and further mapped to feature vectors by fully-connected layers.

and are data matrices composed of learned feature vectors of all sample RoIs in the training image and test image, respectively. A regressor model is discriminatively trained by solving a ridge regression problem to fit samples in the training image to their labels. Finally, is employed to predict the regression values of samples in the test image and loss is calculated.

Related Work

In this section, we briefly introduce recent state-of-the-art trackers, with a special focus on the Siamese network based ones and the discriminatively trained model based ones. Besides, we also shortly describe the recent developments of meta-learning approaches for few-shot learning since our approach shares similar insights to them.

Siamese Network Based Trackers

Recently, the Siamese network based trackers [Tao, Gavves, and Smeulders2016, Bertinetto et al.2016, Wang et al.2018, Li et al.2018, Zhu et al.2018, Li et al.2019, Wang et al.2019a, Zhang and Peng2019, Gao, Zhang, and Xu2019a] have received significant attentions for their well-balanced tracking accuracy and efficiency. These trackers model visual object tracking as a similarity learning problem and apply the Siamese network to it. By comparing the target image patch with the candidate patches in a search region, they can track the object to the location where the highest similarity score is obtained. A notable characteristics of such trackers is that they do not need online learning and update, thus they can run at high-speed in inference. Specifically, SiamFC [Bertinetto et al.2016] employs a fully-convolutional Siamese network to extract features, and uses a simple cross-correlation layer to perform dense and efficient sliding-window evaluation in the search region. RASNet [Wang et al.2018] proposes to learn attentions for SiamFC. SiamRPN [Li et al.2018] enhances tracking performance by adding a region proposal subnetwork after the Siamese network. SPM [Wang et al.2019a] presents a coarse-to-fine tracking framework based on SiamRPN. SiamRPN++ [Li et al.2019] enables SiamRPN to benefit from deeper networks. Even though these trackers achieve the state-of-the-art performance on multiple challenging benchmarks, a key limitation is their inability to incorporate information from the background region to improve the discriminative power of their models online.

Online Discriminatively Trained Trackers

In contrast to the Siamese network based trackers, another family of trackers [Hare et al.2015, Henriques et al.2014, Danelljan et al.2015, Kiani Galoogahi, Fagg, and Lucey2017, Ma et al.2015, Danelljan et al.2017, Sun et al.2018] train discriminative models online to distinguish the target object from the background. These approaches can effectively utilize background information, thereby achieving impressive robustness and discriminative power on multiple challenging benchmarks. The most advanced such trackers mainly focus on taking advantage of deep CNNs features. Specifically, HCF [Ma et al.2015] and ECO [Danelljan et al.2017] extract features from the deep CNNs which are trained on ImageNet for object classification and then employ them to the online discriminative trained trackers, KCF [Henriques et al.2014] and SRDCF [Danelljan et al.2015], respectively. However, it is clear that these CNNs features are not optimal for visual tracking. In order to learn feature embeddings for visual tracking, CFCF [Gundogdu and Alatan2018] and CFNet [Valmadre et al.2017] integrate the closed-form solutions of the correlation filters [Bolme et al.2010, Henriques et al.2014] into deep networks training. However, the correlation filters inherently suffer from the negative boundary effect [Henriques et al.2014] which seriously affects the localization performance. Even though CFNet [Valmadre et al.2017] proposes to relax the boundary effect by cropping the trained filters, its experimental results show that this heuristic idea produces very little effect. Besides, CFCF [Gundogdu and Alatan2018] employs CCOT tracker [Danelljan et al.2016] which is less affected by the boundary effect to locate the target in the inference phase, however, its tracking speed is far from real-time due to the low efficiency of CCOT and the feature embeddings learned in its offline training phase are not optimal for CCOT.

Meta-Learning Based Few-Shot Learning

Meta-learning studies what aspects of the learner effect generalization cross a distribution of tasks. Recently, differentiable convex optimization based meta-learning approaches greatly promote the development of few-shot learning. Instead of the nearest-neighbor based learners, MetaOptNet [Lee et al.2019] proposes to use the discriminatively trained linear predictors as base learners to learn representations for few-shot learning, and it aims at learning feature embeddings that generalize well under a linear classification rule for novel categories. [Bertinetto et al.2019]

proposes both closed-form and iterative solvers, based on ridge regression and logistic regression components to teach a deep network to use standard machine learning tools as part of its internal model, enabling it to adapt to novel data quickly.

To our best knowledge, our proposed training framework is the first one to apply the differentiable closed-form solver to learn feature embeddings for visual tracking without approximate methods and achieve the state-of-the-art performance on multiple challenging benchmarks.

Proposed Method

We propose a novel training framework for tracking. Our approach, called DCFST, mainly focuses on learning feature embeddings in an end-to-end way, which can generalize well to the trackers based on online discriminatively trained ridge regression model. As shown in Fig. 2, our training framework receives a pair of

RGB images, that is training image and test image, as its inputs for feature embeddings learning. It consists of three components: 1) CNNs for features extraction; 2) ridge regression solver for discriminative model training; 3) shrinkage loss for regression learning. In this section, we will introduce them as well as the inference procedure of our proposed DCFST separately.

Features Extraction Network

For each input image, the features extraction procedure consists of the following five steps:

  1. sample RoIs with target size are obtained by uniformly sampling. In addition, their Gaussian labels are calculated as in KCF.

  2. ResNet-18 [He et al.2016]

    Block3 and Block4 backbone feature maps extracted from the input image are passed through two convolutional layers to obtain the learned feature maps of input image, respectively, and their strides are

    and , respectively.

  3. Each sample RoI is pooled to feature maps with a fixed size using PrPool layers [Jiang et al.2018] and further mapped to feature vectors using fully-connected layers. Specifically, the output sizes of PrPool layers after the above two learned feature maps are and , respectively, and both their following fully-connected layers output a -dimensional feature vector.

  4. Two -dimensional feature vectors from each sample RoI are concatenated as the learned feature vector of this sample, and its dimension denoted as is .

  5. The learned feature vectors of all training sample RoIs are used to form the training data matrix , and so is the test data matrix .

It is worth noting that different from CFCF [Gundogdu and Alatan2018] and CFNet [Valmadre et al.2017] whose training data matrices are circulant and most training samples are virtual ones, training data matrix and training samples are non-circulant and real sampled in our DCFST.

Efficient Ridge Regression Solver

For the discriminatively trained trackers, the main role of discriminator or regressor is to train a model which is able to not only fit the training samples well but also generalize well to the test samples online. In addition to the discriminant model solver itself, features are crucial to the generalize ability of models. Therefore, we claim learning feature embeddings for specific discriminant model.

As a classical discriminant model, ridge regression model has been confirmed to be simple, efficient and effective in the field of visual object tracking by many state-of-the-art trackers [Henriques et al.2014, Danelljan et al.2015, Danelljan et al.2017, Sun et al.2018]. It can not only exploit all positive and negative examples to train a good regressor, but also effectively use high-dimensional feature embeddings as the model capacity can be controlled by l2-norm regularization. Moreover, it has two important properties in mathematics. Specifically, it has closed-form solution and is implicit differentiation under the optimality condition. Therefore, according to previous works [Lee et al.2019, Bertinetto et al.2019], ridge regression solver can be embedded as a layer with efficient forward and backward processes in training deep networks.

The optimization problem of ridge regression in our approach can be formulated as

(1)

where and contain the sample pairs of input -dimensional feature vectors and output labels, stacked as rows. is the regularization parameter. The optimization solution of Problem 1 can be expressed as

(2)

Achieving optimal by directly using Eq. 2, however, is time-consuming, because and the time complexity of matrix inversion is . Therefore, we usually achieve by solving a system of linear equations as with Gauss elimination method and its time complexity is . Even so, its time-consuming grows cubically with the dimension of feature embeddings, and high-dimensional feature embeddings are often used in deep networks. To address the dependence of time complexity on the dimension of feature embeddings, we propose to employ the Woodbury formula [Petersen, Pedersen, and others2008] which is written as

(3)

where . It is easy to know that the right hand of Eq. 3 allows us achieving with time complexity , and visual tracking inherently has the characteristic of fewer samples, in other words, is small. Therefore, when the dimension of feature embeddings is larger than the number of samples, that is , we use the right hand of Eq. 3 to achieve , whereas, the left hand is used.

Last but not least, when we would like to integrate the ridge regression solver into deep networks training, we need to calculate for the backward process. Fortunately, because ridge regression problem is implicit differentiation under the optimality condition and has a closed-form solution Eq. 2,

exists and it is relatively easy to be obtained automatically using standard automatic differentiation packages such as PyTorch and Tensorflow. Specifically, in order to train the deep networks, given

as well as and which are obtained by the features extraction network, we only need several lines of code to achieve in the forward process and there is no code need in the backward process.

Fast Convergence with Shrinkage Loss

After solving with and , we employ it to predict the regression values of test samples in as . Then, fitting error evaluation on test samples, that is loss calculation, is needed for the backward process.

We find that there exists extreme foreground-background class imbalance during training, and this problem seriously affects the convergence speed and generalization ability of the learned feature embeddings when using the general mean square error loss like CFCF [Gundogdu and Alatan2018] and CFNet [Valmadre et al.2017]. We believe that this is because the vast number of background samples are easy ones and they make it difficult for the proposed network to be trained. However, this problem has received little attention in previous similar approaches, CFCF and CFNet.

In order to solve the problem above, we propose a new shrinkage loss which can be written as

(4)

where and are hyper-parameters controlling the shrinkage speed and the localization, respectively, and employ it as the loss function for efficient and effective training in our task. Specifically, it down-weights the loss assigned to easy examples and mainly focuses on a sparse set of hard examples, thus prevents the vast number of easy negatives from overwhelming the feature embeddings learning in training.

In fact, Eq. 4 is a modified version of the origin shrinkage loss [Lu et al.2018] which is designed for deep regression learning and can be written as:

(5)

What Eq. 4 and Eq. 5 have in common is that they are both used to solve the foreground-background class imbalance problem for regression learning where a large number of negative samples are easy ones in general. The main difference is that between them is that in Eq. 5, a test sample is regarded as easy one when its predict value is larger than its ground truth and their difference is less than , whereas in Eq. 4, we only consider the absolute difference to determine whether a test sample is easy one or not. In fact, we find that this change can not only accelerate the convergence speed in our network training but also improve the tracking accuracy in our online inference. Moreover, Eq. 4 can also be implemented with several lines of code under the current popular deep learning frameworks.

Online Tracking

Training. In the online inference, given the input image as well as the target position and size in frame , we first achieve , then train the ridge regression model with the same approaches used in the offline training including sampling, features extraction, and model solving.

Update. To robustly locating the target object, updating the appearance model of a tracker is often necessary. Following the update method used in [Henriques et al.2014, Kiani Galoogahi, Fagg, and Lucey2017, Valmadre et al.2017], we update the above by means of the linear weighting based approach which can be expressed as:

(6)

where is the actual training data matrix of learning region in frame , and is the learning rate.

Localization. Given the trained regression model and the test data matrix of search region, we detect the target by

(7)

and the sample corresponding to the maximum value in is regarded as the target object.

Refine. After locating the target in frame , we refine the target bounding box by ATOM [Danelljan et al.2019].

Experiments

We evaluate our DCFST on six public benchmarks, OTB2015 [Wu, Lim, and Yang2015], NFS [Kiani Galoogahi et al.2017], TrackingNet [Muller et al.2018], GOT10k [Huang, Zhao, and Huang2018], VOT2018 [Kristan et al.2018] and VOT2019 [Kristan et al.2018], then compare its performance with the state-of-the-art trackers on each benchmark. Code will be made available.

Implementation Details

Platform. We implement the proposed DCFST in Python using the PyTorch toolbox [Paszke et al.2017]. Our experiments are performed on Linux with a Intel E5-2630 CPU @2.20GHz and a single TITAN X (Pascal). Our DCFST runs over 30 FPS on average.

Training Data. To increase the generalization capability of the learned feature embeddings, we use the training splits of recently introduced large-scale tracking datasets including TrackingNet [Muller et al.2018], GOT10k [Huang, Zhao, and Huang2018], and LaSOT [Fan et al.2019]. The pair of training image and test image are sampled from a video snippet within the nearest 100 frames. For training image, we sample a square patch centered at the target, with an area of times the target area. For test image, we sample a similar patch, with a random translation and scale relative to the target. These cropped regions are then resized to a fixed size . In addition, we use image flipping and color jittering for data augmentation.

Learning Setting.

For the backbone network, we freeze all weights during training. For the head network, weights are randomly initialized with zero-mean Gaussian distributions. We train the head network for 40 epochs with 1500 iterations per epoch and 48 image pairs per batch, giving a total training time of less than 30 hours on a single TITAN X (Pascal) GPU. The ADAM 

[Kingma and Ba2014] optimizer is employed with initial learning rate of 0.005, and using a factor 0.2 decay every 15 epochs.

Parameters. We sample RoIs for each input image, that is . The regularization parameter in Eq. 1 is set to 0.1. Two hyper-parameters and in Eq. 4 are set to and , respectively. The learning rate in Eq. 6 is set to .

Figure 3: The mean success plots of our DCFST and eight state-of-the-art trackers on OTB2015. The mean AUCs are reported in the legends. DCFST achieves the top AUC score.
Tracker DCFST ATOM PTAV RTMDNet SiamDW ECO MDNet VITAL DaSiamRPN
Mean Overlap 0.864 0.832 0.776 0.822 0.840 0.842 0.852 0.857 0.858
Table 1: The mean overlap of our DCFST and eight state-of-the-art trackers on OTB2015. DCFST achieves the top score.
Tracker DCFST DiMP50 DiMP18 ATOM ECO CCOT MDNet HDT SFC FCNT
Mean AUC 0.628 0.620 0.610 0.584 0.466 0.488 0.429 0.403 0.401 0.397
Table 2: The mean AUC of our DCFST and eight state-of-the-art trackers on NFS. DiMP18 and DiMP50 are the DiMP tracker with ResNet-18 and ResNet-50 backbone network, respectively. DCFST achieves the top score.
TRACKER DCFST ATOM DiMP18 DiMP50 CFNet SiamFC GOTURN CCOT ECO HCF MDNet
AO 0.592 0.556 0.579 0.611 0.374 0.348 0.347 0.325 0.316 0.315 0.299
SR(0.50) 0.683 0.634 0.672 0.717 0.404 0.353 0.375 0.328 0.309 0.297 0.303
SR(0.75) 0.448 0.402 0.446 0.492 0.144 0.098 0.124 0.107 0.111 0.088 0.099
Table 3: State-of-the-art comparison on the GOT10k test set in terms of average overlap (AO), and success rates (SR) at overlap thresholds 0.5 and 0.75. DiMP18 and DiMP50 are the DiMP tracker with ResNet-18 and ResNet-50 backbone network, respectively. Our approach, DCFST, employing ResNet-18 as its backbone network, outperforms the previous methods (including DiMP18) except DiMP50 with large margins.

Evaluation on OTB2015

OTB2015 [Wu, Lim, and Yang2015] is the most popular benchmark for tracker evaluation, containing 100 videos with various challenges. In OTB2015 experiments, we compare our DCFST against eight state-of-the-art trackers. They are DaSiamRPN [Zhu et al.2018], ATOM [Danelljan et al.2019], SiamDW [Zhang and Peng2019], MDNet [Nam and Han2016], PTAV [Fan and Ling2017], RT-MDNet [Jung et al.2018], VITAL [Song et al.2018], and ECO [Danelljan et al.2017]. Success plot, mean overlap precision and AUC are employed to quantitatively evaluate all trackers, and their definitions have been introduced in [Wu, Lim, and Yang2015] in detail. Fig. 3 and Table 1 show the results. On both mean overlap and AUC, DCFST achieves the top performance among state-of-the-art trackers.

Additionally, Fig. 1 shows the comparison of our DCFST with CFCF [Gundogdu and Alatan2018], LSART [Sun et al.2018], DiMP [Bhat et al.2019], SiamRPN++ [Li et al.2019], C-SiamRPN [Fan and Ling2019], GCNT [Gao, Zhang, and Xu2019b], SPMT [Wang et al.2019b] along with the above trackers on both AUC and speed. DCFST achieves the best balance between accuracy and speed.

Evaluation on NFS

We evaluate our approach on the 30 FPS version of NFS benchmark which contains 100 challenging videos with fast-moving objects. In addition, we compare our DCFST against DiMP, ATOM, ECO, CCOT [Danelljan et al.2016] along with the top-4 performance trackers officially evaluated by NFS. All trackers are quantitatively evaluated by AUC.

Table 2 shows the results. DCFST achieves the top performance among state-of-the-art trackers. It is worth noting that despite DiMP50 employs deeper backbone network than our DCFST (ResNet-50 vs. ResNet-18) for features extraction, DCFST still outperforms it. Besides, DCFST also outperforms other state-of-the-art trackers with large margins.

Evaluation on GOT10k

We evaluate our approach on GOT10k which is a large-scale tracking benchmark and contains over 10000 training videos as well as 180 test videos. According to the requirement of GOT10k where trackers are forbidden from using external datasets for training to ensure fair evaluation, we re-train our DCFST on the training set of GOT10k and evaluate it on the test set. Moreover, following the GOT10k challenge protocol, we quantitatively evaluate trackers by average overlap, and success rates at overlap thresholds 0.5 and 0.75.

Table 3 shows the results, where we compare our DCFST against ATOM, DiMP along with the top-7 performance trackers officially evaluated by GOT10k. The localization performance of our DCFST is inferior to that of DiMP50 which employs deeper backbone network than DCFST for features extraction (ResNet-50 vs. ResNet-18). However, with the same backbone network, that is ResNet-18, DCFST outperforms DiMP18 evidently. Besides, DCFST also outperforms other state-of-the-art trackers with large margins.

Figure 4: Expected average overlap on VOT2018. Best trackers are closer to the top-right corner.

Evaluation on VOT2018 and VOT2019

We present the evaluation results on the 2018 and 2019 version of the Visual Object Tracking (VOT) challenge, that is VOT2018 and VOT2019 consisting 60 sequences, respectively. We follow the VOT challenge protocol to compare trackers, where mainly reports the expected average overlap (EAO) and ranks trackers based on it.

Fig. 4 shows the EAO ranking plots where we compare our DCFST against SiamRPN++, DiMP, ATOM along with the top-20 trackers on VOT2018. The performances of these trackers come from the VOT2018 report or their original papers. Our DCFST achieves the EAO of . It outperforms DiMP with the same backbone network, that is ResNet-18. Moreover, even though SiamRPN++ employs deeper backbone network than our DCFST (ResNet-50 vs. ResNet-18) for features extraction, DCFST still outperforms it. Besides, DCFST also outperforms other state-of-the-art trackers including the winner of VOT2018 with large margins.

Evaluation on VOT2019. (Work in progress.)

Evaluation on TrackingNet

Work in progress.

Figure 5: Qualitative results for the proposed DCFST, compared with five state-of-the-art trackers on four hard sequences, Matrix, Diving, Jump, and Tiger1. DCFST can track the targets accurately and robustly in these hard cases.

Qualitative Results

Fig. 5 illustrates the tracking results of our DCFST and five representative state-of-the-art trackers on four hard sequences of OTB2015 benchmark.

Conclusion

References

  • [Bertinetto et al.2016] Bertinetto, L.; Valmadre, J.; Henriques, J. F.; Vedaldi, A.; and Torr, P. H. 2016. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, 850–865. Springer.
  • [Bertinetto et al.2019] Bertinetto, L.; Henriques, J. F.; Torr, P.; and Vedaldi, A. 2019. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations.
  • [Bhat et al.2019] Bhat, G.; Danelljan, M.; Van Gool, L.; and Timofte, R. 2019. Learning discriminative model prediction for tracking. arXiv preprint arXiv:1904.07220.
  • [Bolme et al.2010] Bolme, D. S.; Beveridge, J. R.; Draper, B. A.; and Lui, Y. M. 2010. Visual object tracking using adaptive correlation filters. In

    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , 2544–2550.
    IEEE.
  • [Danelljan et al.2015] Danelljan, M.; Hager, G.; Shahbaz Khan, F.; and Felsberg, M. 2015. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision, 4310–4318.
  • [Danelljan et al.2016] Danelljan, M.; Robinson, A.; Khan, F. S.; and Felsberg, M. 2016. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision, 472–488. Springer.
  • [Danelljan et al.2017] Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; and Felsberg, M. 2017. Eco: efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6638–6646.
  • [Danelljan et al.2019] Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; and Felsberg, M. 2019. Atom: Accurate tracking by overlap maximization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
  • [Fan and Ling2017] Fan, H., and Ling, H. 2017. Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, 5486–5494.
  • [Fan and Ling2019] Fan, H., and Ling, H. 2019. Siamese cascaded region proposal networks for real-time visual tracking. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Fan et al.2019] Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; and Ling, H. 2019. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5374–5383.
  • [Gao, Zhang, and Xu2019a] Gao, J.; Zhang, T.; and Xu, C. 2019a. Graph convolutional tracking. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Gao, Zhang, and Xu2019b] Gao, J.; Zhang, T.; and Xu, C. 2019b. Graph convolutional tracking. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Gundogdu and Alatan2018] Gundogdu, E., and Alatan, A. A. 2018. Good features to correlate for visual tracking. IEEE Transactions on Image Processing 27(5):2526–2540.
  • [Hare et al.2015] Hare, S.; Golodetz, S.; Saffari, A.; Vineet, V.; Cheng, M.-M.; Hicks, S. L.; and Torr, P. H. 2015. Struck: Structured output tracking with kernels. IEEE transactions on pattern analysis and machine intelligence 38(10):2096–2109.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • [Henriques et al.2014] Henriques, J. F.; Caseiro, R.; Martins, P.; and Batista, J. 2014. High-speed tracking with kernelized correlation filters. IEEE transactions on pattern analysis and machine intelligence 37(3):583–596.
  • [Huang, Zhao, and Huang2018] Huang, L.; Zhao, X.; and Huang, K. 2018. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. arXiv preprint arXiv:1810.11981.
  • [Jiang et al.2018] Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; and Jiang, Y. 2018. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), 784–799.
  • [Jung et al.2018] Jung, I.; Son, J.; Baek, M.; and Han, B. 2018. Real-time mdnet. In European Conference on Computer Vision (ECCV).
  • [Kiani Galoogahi et al.2017] Kiani Galoogahi, H.; Fagg, A.; Huang, C.; Ramanan, D.; and Lucey, S. 2017. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision, 1125–1134.
  • [Kiani Galoogahi, Fagg, and Lucey2017] Kiani Galoogahi, H.; Fagg, A.; and Lucey, S. 2017. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, 1135–1143.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Kristan et al.2018] Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Cehovin Zajc, L.; Vojir, T.; Bhat, G.; Lukezic, A.; Eldesokey, A.; et al. 2018. The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV), 0–0.
  • [Lee et al.2019] Lee, K.; Maji, S.; Ravichandran, A.; and Soatto, S. 2019. Meta-learning with differentiable convex optimization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Li et al.2018] Li, B.; Yan, J.; Wu, W.; Zhu, Z.; and Hu, X. 2018. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8971–8980.
  • [Li et al.2019] Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; and Yan, J. 2019. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Lu et al.2018] Lu, X.; Ma, C.; Ni, B.; Yang, X.; Reid, I.; and Yang, M.-H. 2018. Deep regression tracking with shrinkage loss. In Proceedings of the European Conference on Computer Vision (ECCV), 353–369.
  • [Ma et al.2015] Ma, C.; Huang, J.-B.; Yang, X.; and Yang, M.-H. 2015. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE international conference on computer vision, 3074–3082.
  • [Muller et al.2018] Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; and Ghanem, B. 2018. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), 300–317.
  • [Nam and Han2016] Nam, H., and Han, B. 2016. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4293–4302.
  • [Paszke et al.2017] Paszke, A.; Gross, S.; Chintala, S.; and Chanan, G. 2017.

    Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration.

  • [Petersen, Pedersen, and others2008] Petersen, K. B.; Pedersen, M. S.; et al. 2008. The matrix cookbook. Technical University of Denmark 7(15):510.
  • [Song et al.2018] Song, Y.; Ma, C.; Wu, X.; Gong, L.; Bao, L.; Zuo, W.; Shen, C.; Lau, R. W.; and Yang, M.-H. 2018. Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8990–8999.
  • [Sun et al.2018] Sun, C.; Wang, D.; Lu, H.; and Yang, M.-H. 2018. Learning spatial-aware regressions for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8962–8970.
  • [Tao, Gavves, and Smeulders2016] Tao, R.; Gavves, E.; and Smeulders, A. W. 2016. Siamese instance search for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1420–1429.
  • [Valmadre et al.2017] Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; and Torr, P. H. 2017. End-to-end representation learning for correlation filter based tracking. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, 5000–5008. IEEE.
  • [Wang et al.2018] Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; Hu, W.; and Maybank, S. 2018. Learning attentions: residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4854–4863.
  • [Wang et al.2019a] Wang, G.; Luo, C.; Xiong, Z.; and Zeng, W. 2019a. Spm-tracker: Series-parallel matching for real-time visual object tracking. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Wang et al.2019b] Wang, G.; Luo, C.; Xiong, Z.; and Zeng, W. 2019b. Spm-tracker: Series-parallel matching for real-time visual object tracking. arXiv preprint arXiv:1904.04452.
  • [Wu, Lim, and Yang2015] Wu, Y.; Lim, J.; and Yang, M.-H. 2015. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(9):1834–1848.
  • [Zhang and Peng2019] Zhang, Z., and Peng, H. 2019. Deeper and wider siamese networks for real-time visual tracking. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Zhu et al.2018] Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; and Hu, W. 2018. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), 101–117.