Visual object tracking, which tracks a specified target in a changing video sequence automatically, is a fundamental problem in many aspects such as visual analysis , automatic driving , pose tracking [64, 65, 35] and robotics [42, 26, 73, 60, 20, 23, 22, 21, 74]. On the one hand, a core problem of tracking is how to detect and locate the object accurately in changing scenarios such as illumination variations, scale variations, occlusions, shape deformation, and camera motion [61, 52, 30, 44]. On the other hand, tracking is a time-critical problem because it is always performed in each frame of sequences. Therefore, how to improve accuracy, robustness and efficiency are main development directions of the recent tracking approaches.
As a core component of trackers, appearance model can be divided into generative models and discriminative models. In generative models, candidates are searched to minimize reconstruction errors. Representative sparse coding [55, 54]
have been exploited for visual tracking. In discriminative models, tracking is regarded as a classification problem by separating foreground and background. Numerous classifiers have been adopted for object tracking, such as structured support vector machine (SVM), boosting  and online multiple instance learning . Recently, significant attention has been paid to discriminative correlation filters (DCF) based methods [18, 10, 41, 36] for real-time visual tracking. The DCF trackers can efficiently train a repressor by exploiting the properties of circular correlation and performing operations in the Fourier domain. Thus conventional DCF trackers can perform at more than 100 FPS [18, 19], and these approaches are significant for real-time applications. Many improvements for DCF tracking approaches have also been proposed, such as SAMF  and fDSST  for scale changes, LCT  for long-term tracking, SRDCF  and BACF  to mitigate boundary effects. The better performance is obtained but the high-speed property of DCF is broken. Moreover, all these methods use handcrafted features, which hinder their accuracy and robustness.
, the visual tracking community has started to focus on the deep trackers that exploit the strength of CNN in recent years. These deep trackers come from two aspects: one is DCF framework with deep features, which means replacing the handcrafted features with CNN features in DCF trackers[40, 47, 12, 9]. The other kind of deep trackers is to design the tracking networks and pre-train them which aim to learn the target-specific features for each new video [4, 56]. Despite their notable performance, most of these approaches separate tracking system into some individual components, which may lead to suboptimal performance. Furthermore, most of trackers are not designed towards real-time applications because of their time-consuming feature extraction and complex optimization details. For example, the speed of winners in VOT2015  and VOT2016  are less than 1 FPS on modern GPUs.
We address these two problems (not end-to-end training and low speed) by introducing a unified convolutional tracker (UCT) to learn the features and perform the tracking process simultaneously. UCT is an end-to-end and extensible framework for tracking, which achieves high performance in terms of both accuracy and speed. Specifically, The proposed UCT treats feature extractor and tracking process both as convolution operation, resulting in a fully convolutional network architecture. In online tracking, the whole patch can be predicted using the foreground response map by one-pass forward propagation. Additionally, efficient model updating and scale handling skills are proposed to ensure tracker’s real-time property.
The contributions of this paper can be summarized in three folds as follows:
1, We propose unified convolutional networks to learn the convolutional features and perform the tracking process simultaneously. The feature extractor and tracking process are both treated as convolution operation that can be trained simultaneously. End-to-end training enables learned CNN features are tightly coupled with tracking process.
2, In online tracking, efficient updating and scale handling strategies are incorporated into the tracking framework. The proposed standard UCT (with VGG-Net) and UCT-lite (with ZF-Net) can track generic objects at 58 FPS and 154 FPS respectively, which is far beyond real time.
3, Extensive experiments are carried out on four tracking benchmarks: OTB2013, OTB2015, VOT2015 and VOT2016. Results demonstrate that the proposed tracking algorithm performs favorably against existing state-of-the-art methods in terms of accuracy and speed.
This paper extends our work , which is published on ICCV2017 VOT Workshop. In this paper, we additionally perform a comprehensive summary of unified convolutional networks for high performance visual tracking. Furthermore, we modify the backbone network from ResNet to VGG-Net, and improve the training strategies and training data. These enhancements allow us to increase the performance by fine-tuning more convolutional layers, with higher running speed. The proposed improvements result in superior tracking performance and a speedup. The experiments are extended by evaluating our approach comparing with more state-of-the-art trackers. Finally, we also present results on the VOT2016 benchmark.
The rest of this paper is organized as follows: Section II summarizes related works about recent visual tracking. Unified convolutional networks for tracking is introduced in Section III, including overall architecture, formulation and training process. Section IV introduces our online tracking algorithm which consists of model updating and scale estimation. Experiments on four challenging benchmarks are shown in Section V. Section VI concludes the paper with a summary.
Ii Related work
Visual tracking is a significant problem in computer vision systems and a series of approaches have been successfully proposed. Since our main contribution is a UCT framework for high performance visual tracking, we give a brief review on three directions closely related to this work: CNN-based trackers, real-time trackers, and fully convolutional networks (FCN).
Ii-a On CNN-based trackers
Inspired by the success of CNN in object recognition [32, 16, 48, 69, 37, 67, 68], researchers in tracking community have started to focus on the deep trackers that exploit the strength of CNN [57, 34, 2]. Since DCF provides an excellent framework for recent tracking research, the first trend is the combination of DCF framework with CNN features. In HCF  and HDT , the CNN is employed to extract features instead of handcrafted features, and final tracking results are obtained by combining hierarchical response and hedging weak trackers, respectively. DeepSRDCF  exploits shallow CNN features in a spatially regularized DCF framework. Another trend in deep trackers is to design the tracking networks and pre-train them which aim to learn the target-specific features and handle the challenges for each new video. MDNet  trains a small-scale network by multi-domain methods, thus separating domain independent information from domain-specific layers. CCOT 
employs the implicit interpolation method to solve the learning problem in the continuous spatial domain. These trackers have two major drawbacks: Firstly, they can only tune the hyper-parameters heuristically since feature extraction and tracking process are separate, which are not end-to-end trained. Secondly, most of these trackers are not designed towards real-time applications.
Ii-B On real-time trackers
Other than accuracy and robustness, the speed of the visual tracker is a crucial factor in many real world applications. Therefore, a practical tracking approach should be accurate and robust while operating at real time. Classical real-time trackers, such as NCC  and Mean-shift , perform tracking using matching. Recently, discriminative correlation filters (DCF) based methods, which efficiently train a repressor by exploiting the properties of circular correlation and performing the operations in the Fourier domain, have drawn attentions for real-time visual tracking. Conventional DCF trackers such as MOSSE, CSK and KCF can perform at more than 100 FPS [5, 19, 18]. Subsequently, a series of trackers that follow DCF method are proposed. In fDSST algorithm , the tracker searches over the scale space for correlation filters to handle the variation of object size. Staple 
combines complementary template and color cues in a ridge regression framework. CFLB and BACF  mitigate the boundary effects of DCF in the Fourier domain. Nevertheless, all these DCF-based trackers employ handcrafted features that hinders their performance.
The recent years have witnessed significant advances of CNN-based real-time tracking approaches [4, 17, 53, 59, 72, 66, 33, 71]. SiamFC  proposes a fully convolutional Siamese network to predict motion between two frames. The network is trained off-line and evaluated without any fine-tuning. Similarly to SiamFC, in GOTURN tracker , the motion between successive frames is predicted using a deep regression network. These two trackers are able to perform at 86 FPS and 100 FPS respectively on GPU because no fine-tuning is performed online. Although their simplicity and fixed-model nature lead to high speed, this method loses the ability to update the appearance model online which is often critical to account for drastic appearance changes in tracking scenarios. It is worth noting that CFNet  and DCFNet  interpret the correlation filters as a differentiable layer in a Siamese tracking framework, thus achieving an end-to-end representation learning. The main drawback is their unsatisfying performance. Therefore, there still is performance improvement space for real-time deep trackers.
Ii-C On fully convolutional trackers
Fully convolutional networks can efficiently learn to make dense predictions for visual tasks like semantic segmentation, detection as well as tracking. Paper  transforms fully connected layers into convolutional layers to output a heat map for semantic segmentation. The region proposal network (RPN) in Faster R-CNN  is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. DenseBox  is an end-to-end FCN detection framework that directly predicts bounding boxes and object class confidences through whole image. In tracking literatures, FCNT  proposes a two-stream fully convolutional network to capture both general and specific object information for visual tracking. However, its tracking components are still independent, so the performance may be impaired. In addition, the FCNT can only perform at 3 FPS on GPU because its layers switch mechanism and feature map selection method are time-consuming, which hinder it from real-time applications. Compared with FCNT, our UCT treats feature extractor and tracking process in a unified architecture and trains them end-to-end, resulting a more compact and much faster tracking approach.
Iii Unified Convolutional networks for tracking
In this section, the overall architecture of proposed UCT is introduced firstly. Afterwards, we detail the formulation of convolutional operation both in training and test stages. Lastly, training process of UCT is described.
Iii-a UCT Architecture
The overall framework of our tracking approach is shown in Figure 1, which consists of feature extractor and convolutions performing tracking process. The solid lines indicate online tracking process, while dashed box and dashed lines are included in off-line training and training on first frame. The search window of current frame is cropped and sent to unified convolutional networks. The estimated new target position is obtained by finding the maximum value of the response map. Another separate 1-dimensional convolutional branch is used to estimate target scale and model updating is performed if necessary, which enables our tracker can perform at real-time rate. Each feature channel in the extracted sample is always multiplied by a Hann window, as described in . The proposed framework has two advantages: First, we adopt two groups convolutional filters to perform tracking process which is trained with features extractor. Compared to two-stage approaches adopted in DCF framework within CNN features [40, 47, 9], our end-to-end training pipeline is generally preferable. The reason is that the parameters in all components can cooperate to achieve tracking objective. Second, during online tracking, the whole patch can be predicted using the foreground heat map by one-pass forward propagation. Redundant computation is saved. Whereas in  and , network has to be evaluated for times given samples cropped from the frame. The overlap between patches leads to a lot of redundant computation. Besides, we adopt efficient scale handling and model updating strategies to ensure real-time speed.
In the UCT formulation, the aim is to learn a series of convolution filters from training samples . Each sample is extracted using another CNN from an image region. Assuming sample has the spatial size , the output has the spatial size (). The desired output is a response map which includes a target score for each location in the sample . The convolutional response of the filter on sample is given by
where and is -th channel of extracted CNN features and desired filters, respectively, denotes convolutional operation. The filter can be trained by minimizing loss which is obtained between the response on sample and the corresponding Gaussian label
The second term in (2) is a regularization with a weight parameter .
In test stage, the trained filters are used to evaluate an image patch centered around the predicted target location. The evaluation is applied in a sliding-window manner, thus can be operated as convolution:
where denote the feature map extracted from last target position including context.
It is noticed that the formulation in our framework is similar to DCF, which solve this ridge regression problem in frequency domain by circularly shifting the sample. Different from DCF, we adopt gradient descent to solve equation (2), resulting in convolution operations. Noting that the sample
is also extracted by CNN, these convolution operations can be naturally unified in a fully convolutional network. Compared to DCF framework, our approach has three advantages: firstly, both feature extraction and tracking convolutions can be pre-trained simultaneously, while DCF based trackers can only tune the hyper-parameters heuristically. Secondly, model updating can be performed by stochastic gradient descent (SGD), which maintains the long-term memory of target appearance. Lastly, our framework is much faster than DCF framework within CNN features.
Since the objective function defined in equation (2) is convex, it is possible to obtain the approximate global optima via gradient descent with an appropriate learning rate in limited steps. We divide the training process into two periods: off-line training that can encode the prior tracking knowledge, and the training on first frame to adapt to specific target.
In off-line training, the goal is to minimize the loss function in equation (2
). In tracking, the target position in last frame is always not centered in current cropped patch. So for each image, the train patch centered at the given object is cropped with jittering. The jittering consists of translation and scale jittering, which approximates the variation in adjacent frames when tracking. Above cropped patch also includes background information as context. In training, the final response map is obtained by last convolution layer within one channel. The label is generated using a Gaussian function with variances proportional to the width and height of object. Then theloss can be generated and the gradient descent can be performed to minimize equation (2
). In this stage, the overall network consists of a pre-trained network with ImageNet (VGG-Net in UCT and ZF-Net in UCT-lite) and following convolutional filters. Last part of VGG-Net or ZF-Net is trained to encode the prior tracking knowledge with following convolutional filters, making the extracted feature more suitable for tracking.
The goal of training on first frame is to adapt to a specific target. The network architecture follows that in off-line training, while later convolutional filters are randomly initialized by zero-mean Gaussian distribution. Only these randomly initialized layers are trained using SGD in first frame. Offline training encodes prior tracking knowledge and constitutes a tailored feature extractor. We perform online tracking with and without off-line training to illustrate this effect. In Figure2, we show tracking results and corresponding response maps without or with offline training. In left part of Figure 2, the target singer is moving to right, the response map with off-line training effectively reflects this translation changes while response map without off-line training is not capable of doing this. So the tracker without off-line training misses this critical frame. In right part of Figure 2, the target player is occluded by another player, the response map without off-line training becomes fluctuated and tracking result is effected by distractor, while response map with off-line training still keeps discriminative. The results are somewhat unsurprising, since CNN features trained on ImageNet classification data are expected to have greater invariance to position and same class. In contrast, we can obtain more suitable feature tracking by end-to-end off-line training.
Iv Online tracking
After off-line training and training on first frame, the learned network is used to perform online tracking by equation (3). The estimate of the current target state is obtained by finding the maximum response score. Since we use a fully convolutional network architecture to perform tracking, the whole patch can be predicted using the foreground heat map by one-pass forward propagation. Redundant computation was saved. Whereas in  and , network has to be evaluated for times given samples cropped from the frame. The overlap between patches leads to a lot of redundant computation.
Iv-a Model update
Most of tracking approaches update their models in each frame or at a fixed interval [18, 19, 40, 12]. However, this strategy may introduce false background information when the tracking is inaccurate, target is occluded or out of view. In the proposed method, model update is decided by evaluating the tracking results. Specifically, we consider the maximum value in the response map and the distribution of other response value simultaneously.
Ideal response map should have only one peak value in actual target position and the other values are small. On the contrary, the response will fluctuate intensely and include more peak values as shown in Figure 3. We introduce a novel criterion called peak-versus-noise ratio (PNR) to reveal the distribution of response map. The PNR is defined as
and is corresponding minimum value of response map. Denominator in equation (4) represents mean value of response map except maximum value and it is used to measure the noise approximately. The PNR becomes larger when response map has fewer noise and sharper peak. Otherwise, the PNR will fall into a smaller value. We save the PNR and and calculate their historical average values as threshold:
Model update is performed only when PNR and are larger than corresponding threshold at the same time. The updating is one step SGD with smaller learning rate compared with that in the first frame. Figure 3 illustrates the necessity of proposed PNR criterion by showing tracking results under occlusions. As shown in Figure 3, updating is still performed if only is adopted. Introduced noise will result in inaccurate tracking results even failures. The PNR value significantly decreases in these unreliable frames thus avoids unwanted updating.
Iv-B Scale estimation
A conventional approach of incorporating scale estimation is to evaluate the appearance model at multiple resolutions by performing an exhaustive scale search . However, this search strategy is computationally demanding and not suitable for real-time tracking. Inspired by , we introduce a 1-dimensional convolutional filters branch to estimate the target size as shown in Figure 1. This scale filter is applied at an image location to compute response scores in the scale dimension, whose maximum value can be used to estimate the target scale. Such learning separate convolutional filters to explicitly handle the scale changes is more efficient for real-time tracking.
In training and updating of scale convolutional filters, the sample is extracted from variable patch sizes centered around the target:
where is the size of scale convolutional filters, and are the current target size, is the scale factor. In scale estimation test, the sample is extracted using the same way after translation filters are performed. Then the scale changes compared to previous frame can be obtained by maximizing the response score. Note that the scale estimation is performed only when model updating conditions are satisfied.
The overall tracking algorithm is summarized in Algorithm 1.
Experiments are performed on four challenging tracking datasets: OTB2013 , OTB2015 , VOT2015  and VOT2016 . Basic information about four dataset is summarized in Table I. All the tracking results use the reported results to ensure a fair comparison.
|Dataset||Frame number||Main challenges|
V-a Implementation details
We adopt VGG-Net  in standard UCT and ZF-Net  in UCT-lite as feature extractor, respectively. In off-line training, last six layers of VGG-Net and last three layers of ZF-Net are fine-tuned. The kernel size of two convolutional layers for tracking is
and the activation function is Sigmoid. Our training data comes from VID, containing the training and validation set. In each frame, patch is cropped around ground truth and resized into with a padding of 1.8. The translation and scale jittering are 0.05 and 0.02 proportional to the size of images, respectively. We apply SGD with momentum of 0.9 to train the network and set the weight decay
to 0.005. The model is trained for 30 epochs with a learning rate of. In online training on first frame, SGD is performed 50 steps with a learning rate of and is set to 0.01. In online tracking, the model updating is performed by one step SGD with a learning rate of . and in equation (7) is set to 33 and 1.02, respectively.
V-B Results on OTB2013
OTB2013  contains 50 fully annotated sequences that are collected from commonly used tracking sequences. The evaluation is based on two metrics: precision plot and success plot. The precision plot shows the percentage of frames that the tracking results are within certain distance determined by given threshold to the ground truth. The value when threshold is 20 pixels is always taken as the representative precision score. The success plot shows the ratios of successful frames when the threshold varies from 0 to 1, where a successful frame means its overlap is larger than this given threshold. The area under curve (AUC) of each success plot is used to rank the tracking algorithms.
In this experiment, ablation analyses are performed to illustrate the effectiveness of proposed component at first. Then we compare our method against the recent real-time (and near real-time) trackers presented at top conferences and journals, including PTAV , BACF , CFNet , CACF , CSR-DCF , fDSST , SiamFC , Staple , SCT , HDT , HCF , LCT , KCF . The one-pass evaluation (OPE) is employed to compare these trackers.
V-B1 Ablation analyses
To verify the contribution of each component in our algorithm, we implement and evaluate four variations of our approach: firstly, the effectiveness of our off-line training is tested by comparison without this procedure (UCTNoOff-line), where the network is only trained within the first frame of a specific sequence. Secondly, the tracking algorithm that updates model without PNR constraint (UCTNoPNR, only depends on ) is compared with the proposed efficient updating method. Last two additional versions are UCT within multi-resolutions scale (UCTMulResScale) and without scale handling (UCTNoScale).
As shown in Table II, the performances of all the variations are not as good as our full algorithm (UCT) and each component in our tracking algorithm is helpful to improve performance. Specifically, off-line training encodes prior tracking knowledge and constitutes a tailored feature extractor, so the UCT outperforms UCTNoOff-line with a large margin. Proposed PNR constraint for model update improves performance as well as speed, since it avoids updating in unreliable frames. Although exhaustive scale method in multiple resolutions improves the performance of tracker, it brings higher computational cost. By contrast, learning separate filters for scale in our approach gets a better performance while being computationally efficient.
V-B2 Comparison with state-of-the-art trackers
We compare our method against the state-of-the-art trackers as listed earlier. Figure 4 illustrates the precision and success plots based on center location error and bounding box overlap ratio, respectively. It clearly illustrates that our algorithm, denoted by UCT, outperforms the state-of-the-art trackers significantly in both measures. In success plot, our approach obtains an AUC score of 0.693, significantly outperforms BACF and SiamFC by 3.6% and 8.5%, respectively. In precision plot, our approach obtains a score of 0.927, outperforms BACF and SiamFC by 6.6% and 11.8%, respectively. We summarise the model update frequency, scale estimation, performance and speed of compared trackers in Table III.
Besides standard UCT, we also implement a lite version of UCT (UCT-lite) which adopts ZF-Net  and ignores scale changes. As shown in Figure 4, the UCT-lite obtains a success score of 0.634 and precision score of 0.892, while operates at 154 FPS. Our UCT-lite approach is much faster than recent real-time trackers such as CFNet, SiamFC and Staple, while significantly outperforms them in performance.
Furthermore, the results on various challenge attributes in OTB2013 are reported for detailed performance analysis. These challenges include scale variation, fast motion, background clutter, deformation, occlusion, etc. Figure 5 demonstrates that our tracker effectively handles these challenging situations while other trackers obtain lower scores. Qualitative comparisons of our approach with four state-of-the-art trackers in the changing scenario are shown in Figure 13. The top performance can be attributed to that our methods encodes prior tracking knowledge by off-line training and extracted features is more suitable for following tracking convolution operations. For example, our tracker ranks top in out-of-plane rotation and in-plane rotation challenges due to the end-to-end training. By contrast, the CNN features in other trackers are always pre-trained in different tasks and are independently with the tracking process, thus the achieved tracking performance may not be optimal. Furthermore, efficient updating and scale handling strategies ensure robustness and speed of the tracker.
V-C Results on OTB2015
OTB2015  is the extension of OTB2013 and contains 100 video sequences. Some new sequences are more difficult to track. In this experiment, we compare our method against recent real-time trackers, including PTAV , BACF , CFNet , fDSST , SiamFC , Staple , HDT , HCF , LCT , KCF . The one-pass evaluation (OPE) is employed to compare these trackers.
Figure 6 illustrates the precision and success plots of the compared trackers, respectively. The proposed UCT approach outperforms all the other trackers in terms of both precision score and success score. Specifically, our method achieves a success score of 0.670, which outperforms the PTAV (0.631) and BACF (0.621) methods with a large margin. Since the proposed tracker adopts a unified convolutional architecture and efficient online tracking strategies, it achieves superior tracking performance and real-time speed. It is worth mentioning that our UCT-lite also provides significantly better performance and faster speed compared with SiamFC, CFNet and Staple. Model update frequency, scale estimation, performance and speed of compared trackers are summarised in Table III.
For detailed performance analysis, we report the results on various challenge attributes in OTB2015, such as illumination variation, scale changes, occlusion, etc. Figure 7 demonstrates that our tracker effectively handles these challenging situations while other trackers obtain lower scores. Comparisons of our approach with four state-of-the-art trackers in the challenging scenarios are shown in Figure 13.
|Approaches||Publication||Model update frequency||Scale estimation||AUC in OTB2013||AUC in OTB2015||Speed (FPS)|
|PTAV||ICCV 2017||in short-term phase||yes||0.654||0.631||27|
|UCT||ours||score is satisfied||yes||0.693||0.670||58|
|UCT-lite||ours||score is satisfied||no||0.634||0.608||154|
V-D Results on VOT2015
VOT2015  consists of 60 challenging videos that are automatically selected from a 356 sequences pool. The trackers in VOT2015 are evaluated by expected average overlap (EAO) measure, which is the inner product of the empirically estimating the average overlap and the typical-sequence-length distribution. The EAO measures the expected no-reset overlap of a tracker run on a short-term sequence. Besides, accuracy (mean overlap) and robustness (average number of failures) are also reported.
In VOT2015 experiment, we present a state-of-the-art comparison with the participants in the challenge according to the latest VOT rules (see http://votchallenge.net). It is worth noting that MDNet  is not compatible with the latest VOT rules because of OTB training data.
Figure 8 illustrates that our UCT ranks 1st in 61 trackers according to EAO criterion, while performing at 58 FPS. The faster UCT-lite (154 FPS) can rank 4th in EAO criterion. In Table IV, we list the EAO, accuracy and failures of UCT and top 10 entries in VOT2015. UCT ranks 1st according to all 3 criterions. The top performance can be attributed to the unified convolutional architecture and end-to-end training.
Additionally, further experimental attributes evaluations on the VOT2015 dataset with 60 videos are presented. In the VOT2015 dataset, each frame is labeled with 5 different attributes: camera motion, illumination change, occlusion, size change and motion change. The performance is evaluated by expected average overlap (EAO) measure. As shown in Figure 9, our approach (UCT) achieves the best results on 4 in 5 attributes.
V-E Results on VOT2016
The datasets in VOT2016  are the same as VOT2015, but the ground truth has been re-annotated. VOT2016 also adopts EAO, accuracy and robustness for evaluations.
In experiment, we compare our method with participants in challenges. Figure 10 illustrates that our UCT ranks 1st in 70 trackers according to EAO criterion. It is worth noting that our method can operate at 58 FPS, which is near 200 times faster than CCOT  (0.3 FPS). The UCT-lite ranks 6th with a speed of 154 FPS. Figure 12 shows the performance and speed of the state-of-the-art trackers. It illustrates that our tracker can achieve a superior performance while operating at high speed.
For detailed performance analysis, we also list accuracy and robustness of representative trackers in VOT2016. As shown in Table V, the accuracy and robustness of proposed UCT can rank 1st and 2nd, respectively. At last, we provide further experimental attributes evaluation on the VOT2016 dataset with 60 videos. In the VOT2016 dataset, each frame is labeled with five different attributes: camera motion, illumination change, occlusion, size change and motion change. Figure 11 visualizes the EAO of each attribute individually. Our approach (UCT) ranks 1st in 4 attributes and 2nd in 1 attributes.
V-F Qualitative results
To visualize the superiority of our framework on tracking performance, we show examples of UCT results compared with recent trackers on challenging sample videos. As shown in Figure 13, the target in sequence undergoes severe fast motion. Only UCT and CCOT successfully tracks until the end, while proposed UCT is more precise than CCOT as shown in #14, #25 and #40. The target in sequence singer2, ironman undergoes different severe deformation. Since end-to-end training enables learned CNN features are tightly coupled with tracking process, the proposed UCT can handle these challenges. While CCOT, CFNet, SiamFC and PTAV lead to failure in #366 of sequence singer2 and #155, #157, #160 of sequence ironman. Sequences liquor and matrix illustrate background clutter challenges, where the UCT results in successful tracking. Sequences motorRolling and skating2 contain in-plane rotation and out-of-plane rotation. The trained UCT tracks the target while other trackers tend to drift to backgrounds. Lastly, the proposed UCT handles partial occlusions and scale changes in sequence woman such as #568 and #597.
In this work, we propose a high performance unified convolutional tracker (UCT) that learn the convolutional features and perform the tracking process simultaneously. In online tracking, efficient updating and scale handling strategies are incorporated into the network. It is worth emphasizing that our proposed algorithm not only performs superiorly, but also runs at a very fast speed which is significant for real-time applications. On VOT2016 dataset, the proposed UCT obtains an EAO of 0.342 and performs at 58 FPS, yielding 3.2% relative gain in performance and 200 times gain in speed compared with challenge winner-CCOT .
Future directions of this paper will try to apply the proposed UCT framework to mobile robot with active vision systems, and it would be interesting to explore the more efficient network architecture such as MobileNet and ShuffleNet.
This work is supported in part by the National Natural Science Foundation of China under Grant No. 61403378 and 51405484, and in part by the National High Technology Research and Development Program of China under Grant No.2015AA042307.
-  (2011) Robust object tracking with online multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (8), pp. 1619–1632. Cited by: §I.
-  (2018) Multi-hierarchical independent correlation filters for visual tracking. arXiv preprint arXiv:1811.10302. Cited by: §II-A.
Staple: complementary learners for real-time tracking.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-B, §V-B, §V-C.
-  (2016) Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision Workshop, pp. 850–865. Cited by: §I, §II-B, Fig. 13, §V-B, §V-C.
-  (2010) Visual object tracking using adaptive correlation filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550. Cited by: §II-B.
-  (2001) Template matching using fast normalized cross correlation. In Optical Pattern Recognition XII, Vol. 4387, pp. 95–103. Cited by: §II-B.
-  (2016) Visual tracking using attention-modulated disintegration and integration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4321–4330. Cited by: §V-B.
-  (2000) Real-time tracking of non-rigid objects using mean shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 142–149. Cited by: §II-B.
-  (2015) Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshop, pp. 621–629. Cited by: §I, §II-A, §III-A.
-  (2015) Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4310–4318. Cited by: §I.
-  (2017) Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (8), pp. 1561–1575. Cited by: §I, §II-B, §IV-B, §V-B, §V-C.
-  (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In Proceedings of the European Conference on Computer Vision, pp. 472–488. Cited by: §I, §II-A, §IV-A, Fig. 13, §V-E, §VI.
-  (2017) Parallel tracking and verifying: a framework for real-time and high accuracy visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Fig. 13, §V-B, §V-C.
-  (2006) Real-time tracking via on-line boosting. In Proceedings of the British Machine Vision Conference, Cited by: §I.
-  (2016) Struck: structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 2096–2109. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I, §II-A.
-  (2016) Learning to track at 100 fps with deep regression networks. In Proceedings of the European Conference on Computer Vision, pp. 749–765. Cited by: §II-B.
-  (2015) High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 583. Cited by: §I, §II-B, §III-A, §IV-A, §V-B, §V-C.
-  (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European Conference on Computer Vision, pp. 702–715. Cited by: §I, §II-B, §IV-A.
-  (2018) Optical flow based real-time moving object detection in unconstrained scenes. arXiv preprint arXiv:1807.04890. Cited by: §I.
-  (2019) Motion cue based instance-level moving object detection. Cited by: §I.
-  (2018) An efficient optical flow based motion detection method for non-stationary scenes. arXiv preprint arXiv:1811.08290. Cited by: §I.
-  (2018) Optical flow based online moving foreground analysis. arXiv preprint arXiv:1811.07256. Cited by: §I.
-  (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874. Cited by: §II-C.
-  (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. Cited by: §V-A.
-  Adaptive trajectory tracking of wheeled mobile robots based on a fish-eye camera. International Journal of Control, Automation and Systems, pp. 1–13. Cited by: §I.
-  (2017-10) Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §I, §II-B, §V-B, §V-C.
-  (2015) Correlation filters with limited boundaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4630–4638. Cited by: §II-B.
-  (2016) The visual object tracking vot2016 challenge results. In Proceedings of the European Conference on Computer Vision Workshop, pp. 191–217. Cited by: §I, Fig. 12, §V-E, §V.
-  (2018) The sixth visual object tracking vot2018 challenge results. In European Conference on Computer Vision, pp. 3–53. Cited by: §I.
-  (2015) The visual object tracking vot2015 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshop, pp. 564–586. Cited by: §V-D, §V.
-  (2012) ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §I, §II-A.
-  (2018) High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980. Cited by: §II-B.
-  (2016) Deeptrack: learning discriminative feature representations online for robust visual tracking. IEEE Transactions on Image Processing 25 (4), pp. 1834–1848. Cited by: §II-A.
-  (2019) State-aware re-identification feature for multi-target multi-camera tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §I.
-  (2014) A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the European Conference on Computer Vision Workshop, pp. 254–265. Cited by: §I, §IV-B.
-  (2019) Attention-guided unified network for panoptic segmentation. In CVPR, Cited by: §II-A.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §I, §II-C.
-  (2017) Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §V-B.
-  (2015-12) Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §I, §II-A, §III-A, §IV-A, §V-B, §V-C.
-  (2015) Long-term correlation tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5388–5396. Cited by: §I, §V-B, §V-C.
-  Selection of observation position and orientation in visual servoing with eye-in-vehicle configuration for manipulator. International Journal of Automation and Computing, pp. 1–14. Cited by: §I.
Research on automatic parking systems based on parking scene recognition. IEEE Access 5, pp. 21901–21917. Cited by: §I.
-  (2015) The visual object tracking vot2015 challenge results. In Workshop on the Visual Object Tracking Challenge (VOT, in conjunction with ICCV), Cited by: §I, §I.
-  (2017) Context-aware correlation filter tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1396–1404. Cited by: §V-B.
-  (2016-06) Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-A, §III-A, §IV, §V-D.
-  (2016-06) Hedged deep tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II-A, §III-A, §V-B, §V-C.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §I, §II-A, §II-C.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. External Links: Cited by: §V-A.
-  (2017) Sparse representation for crowd attributes recognition. IEEE Access 5, pp. 10422–10433. Cited by: §I.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §V-A.
-  (2014) Visual tracking: an experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1442–1468. Cited by: §I.
-  (2017) End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-B, Fig. 13, §V-B, §V-C.
-  (2015) Inverse sparse tracker with a locally weighted distance metric. IEEE Transactions on Image Processing 24 (9), pp. 2646–2657. Cited by: §I.
-  (2013) Online object tracking with sparse prototypes. IEEE Transactions on Image Processing 22 (1), pp. 314–325. Cited by: §I.
-  (2015) Visual tracking with fully convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119–3127. Cited by: §I, §II-C.
-  (2015) Transferring rich feature hierarchies for robust visual tracking. arXiv preprint arXiv:1501.04587. Cited by: §II-A.
-  (2013) Learning a deep compact image representation for visual tracking. In Proceedings of the Advances in Neural Information Processing Systems, pp. 809–817. Cited by: §III-A, §IV.
-  (2017) DCFNet: discriminant correlation filters network for visual tracking. arXiv preprint arXiv:1704.04057. Cited by: §II-B.
-  (2017) Motion control in saccade and smooth pursuit for bionic eye based on three-dimensional coordinates. Journal of Bionic Engineering 14 (2), pp. 336–347. Cited by: §I.
-  (2015) Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1834–1848. Cited by: §I, §V-C, §V.
-  (2013) Online object tracking: a benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418. Cited by: §V-B, §V.
-  (2014) Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, pp. 818–833. Cited by: §V-A, §V-B2.
-  (2019) FastPose: towards real-time pose estimation and tracking via scale-normalized multi-task networks. arXiv preprint arXiv:1908.05593. Cited by: §I.
-  (2019) Exploiting offset-guided network for pose estimation and tracking. arXiv preprint arXiv:1906.01344. Cited by: §I.
-  (2018) End-to-end video-level representation learning for action recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 645–650. Cited by: §II-B.
-  (2018) Action machine: rethinking action recognition in trimmed videos. arXiv preprint arXiv:1812.05770. Cited by: §II-A.
-  (2017) Learning gating convnet for two-stream based methods in action recognition. arXiv preprint arXiv:1709.03655 1 (2), pp. 6. Cited by: §II-A.
-  (2018) Two-stream gated fusion convnets for action recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 597–602. Cited by: §II-A.
-  (2017-10) UCT: learning unified convolutional networks for real-time visual tracking. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Vol. , pp. 1973–1982. Cited by: §I-A.
-  (2018) Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 101–117. Cited by: §II-B.
-  (2018) End-to-end flow correlation tracking with spatial-temporal attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-B.
-  (2018) A velocity compensation visual servo method for oculomotor control of bionic eyes. Cited by: §I.
-  (2016) STD: a stereo tracking dataset for evaluating binocular tracking algorithms. In 2016 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 2215–2220. Cited by: §I.