High Performance Visual Object Tracking with Unified Convolutional Networks

08/26/2019 ∙ by Zheng Zhu, et al. ∙ 8

Convolutional neural networks (CNN) based tracking approaches have shown favorable performance in recent benchmarks. Nonetheless, the chosen CNN features are always pre-trained in different tasks and individual components in tracking systems are learned separately, thus the achieved tracking performance may be suboptimal. Besides, most of these trackers are not designed towards real-time applications because of their time-consuming feature extraction and complex optimization details. In this paper, we propose an end-to-end framework to learn the convolutional features and perform the tracking process simultaneously, namely, a unified convolutional tracker (UCT). Specifically, the UCT treats feature extractor and tracking process both as convolution operation and trains them jointly, which enables learned CNN features are tightly coupled with tracking process. During online tracking, an efficient model updating method is proposed by introducing peak-versus-noise ratio (PNR) criterion, and scale changes are handled efficiently by incorporating a scale branch into network. Experiments are performed on four challenging tracking datasets: OTB2013, OTB2015, VOT2015 and VOT2016. Our method achieves leading performance on these benchmarks while maintaining beyond real-time speed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual object tracking, which tracks a specified target in a changing video sequence automatically, is a fundamental problem in many aspects such as visual analysis [50], automatic driving [43], pose tracking [64, 65, 35] and robotics [42, 26, 73, 60, 20, 23, 22, 21, 74]. On the one hand, a core problem of tracking is how to detect and locate the object accurately in changing scenarios such as illumination variations, scale variations, occlusions, shape deformation, and camera motion [61, 52, 30, 44]. On the other hand, tracking is a time-critical problem because it is always performed in each frame of sequences. Therefore, how to improve accuracy, robustness and efficiency are main development directions of the recent tracking approaches.

As a core component of trackers, appearance model can be divided into generative models and discriminative models. In generative models, candidates are searched to minimize reconstruction errors. Representative sparse coding [55, 54]

have been exploited for visual tracking. In discriminative models, tracking is regarded as a classification problem by separating foreground and background. Numerous classifiers have been adopted for object tracking, such as structured support vector machine (SVM)

[15], boosting [14] and online multiple instance learning [1]. Recently, significant attention has been paid to discriminative correlation filters (DCF) based methods [18, 10, 41, 36] for real-time visual tracking. The DCF trackers can efficiently train a repressor by exploiting the properties of circular correlation and performing operations in the Fourier domain. Thus conventional DCF trackers can perform at more than 100 FPS [18, 19], and these approaches are significant for real-time applications. Many improvements for DCF tracking approaches have also been proposed, such as SAMF [36] and fDSST [11] for scale changes, LCT [41] for long-term tracking, SRDCF [10] and BACF [27] to mitigate boundary effects. The better performance is obtained but the high-speed property of DCF is broken. Moreover, all these methods use handcrafted features, which hinder their accuracy and robustness.

Inspired by the success of CNN in object classification [32, 16], detection [48] and segmentation [38]

, the visual tracking community has started to focus on the deep trackers that exploit the strength of CNN in recent years. These deep trackers come from two aspects: one is DCF framework with deep features, which means replacing the handcrafted features with CNN features in DCF trackers

[40, 47, 12, 9]. The other kind of deep trackers is to design the tracking networks and pre-train them which aim to learn the target-specific features for each new video [4, 56]. Despite their notable performance, most of these approaches separate tracking system into some individual components, which may lead to suboptimal performance. Furthermore, most of trackers are not designed towards real-time applications because of their time-consuming feature extraction and complex optimization details. For example, the speed of winners in VOT2015 [44] and VOT2016 [29] are less than 1 FPS on modern GPUs.

We address these two problems (not end-to-end training and low speed) by introducing a unified convolutional tracker (UCT) to learn the features and perform the tracking process simultaneously. UCT is an end-to-end and extensible framework for tracking, which achieves high performance in terms of both accuracy and speed. Specifically, The proposed UCT treats feature extractor and tracking process both as convolution operation, resulting in a fully convolutional network architecture. In online tracking, the whole patch can be predicted using the foreground response map by one-pass forward propagation. Additionally, efficient model updating and scale handling skills are proposed to ensure tracker’s real-time property.

I-a Contributions

The contributions of this paper can be summarized in three folds as follows:

1, We propose unified convolutional networks to learn the convolutional features and perform the tracking process simultaneously. The feature extractor and tracking process are both treated as convolution operation that can be trained simultaneously. End-to-end training enables learned CNN features are tightly coupled with tracking process.

2, In online tracking, efficient updating and scale handling strategies are incorporated into the tracking framework. The proposed standard UCT (with VGG-Net) and UCT-lite (with ZF-Net) can track generic objects at 58 FPS and 154 FPS respectively, which is far beyond real time.

3, Extensive experiments are carried out on four tracking benchmarks: OTB2013, OTB2015, VOT2015 and VOT2016. Results demonstrate that the proposed tracking algorithm performs favorably against existing state-of-the-art methods in terms of accuracy and speed.

This paper extends our work [70], which is published on ICCV2017 VOT Workshop. In this paper, we additionally perform a comprehensive summary of unified convolutional networks for high performance visual tracking. Furthermore, we modify the backbone network from ResNet to VGG-Net, and improve the training strategies and training data. These enhancements allow us to increase the performance by fine-tuning more convolutional layers, with higher running speed. The proposed improvements result in superior tracking performance and a speedup. The experiments are extended by evaluating our approach comparing with more state-of-the-art trackers. Finally, we also present results on the VOT2016 benchmark.

The rest of this paper is organized as follows: Section II summarizes related works about recent visual tracking. Unified convolutional networks for tracking is introduced in Section III, including overall architecture, formulation and training process. Section IV introduces our online tracking algorithm which consists of model updating and scale estimation. Experiments on four challenging benchmarks are shown in Section V. Section VI concludes the paper with a summary.

Ii Related work

Visual tracking is a significant problem in computer vision systems and a series of approaches have been successfully proposed. Since our main contribution is a UCT framework for high performance visual tracking, we give a brief review on three directions closely related to this work: CNN-based trackers, real-time trackers, and fully convolutional networks (FCN).

Ii-a On CNN-based trackers

Inspired by the success of CNN in object recognition [32, 16, 48, 69, 37, 67, 68], researchers in tracking community have started to focus on the deep trackers that exploit the strength of CNN [57, 34, 2]. Since DCF provides an excellent framework for recent tracking research, the first trend is the combination of DCF framework with CNN features. In HCF [40] and HDT [47], the CNN is employed to extract features instead of handcrafted features, and final tracking results are obtained by combining hierarchical response and hedging weak trackers, respectively. DeepSRDCF [9] exploits shallow CNN features in a spatially regularized DCF framework. Another trend in deep trackers is to design the tracking networks and pre-train them which aim to learn the target-specific features and handle the challenges for each new video. MDNet [46] trains a small-scale network by multi-domain methods, thus separating domain independent information from domain-specific layers. CCOT [12]

employs the implicit interpolation method to solve the learning problem in the continuous spatial domain. These trackers have two major drawbacks: Firstly, they can only tune the hyper-parameters heuristically since feature extraction and tracking process are separate, which are not end-to-end trained. Secondly, most of these trackers are not designed towards real-time applications.

Ii-B On real-time trackers

Other than accuracy and robustness, the speed of the visual tracker is a crucial factor in many real world applications. Therefore, a practical tracking approach should be accurate and robust while operating at real time. Classical real-time trackers, such as NCC [6] and Mean-shift [8], perform tracking using matching. Recently, discriminative correlation filters (DCF) based methods, which efficiently train a repressor by exploiting the properties of circular correlation and performing the operations in the Fourier domain, have drawn attentions for real-time visual tracking. Conventional DCF trackers such as MOSSE, CSK and KCF can perform at more than 100 FPS [5, 19, 18]. Subsequently, a series of trackers that follow DCF method are proposed. In fDSST algorithm [11], the tracker searches over the scale space for correlation filters to handle the variation of object size. Staple [3]

combines complementary template and color cues in a ridge regression framework. CFLB

[28] and BACF [27] mitigate the boundary effects of DCF in the Fourier domain. Nevertheless, all these DCF-based trackers employ handcrafted features that hinders their performance.

The recent years have witnessed significant advances of CNN-based real-time tracking approaches [4, 17, 53, 59, 72, 66, 33, 71]. SiamFC [4] proposes a fully convolutional Siamese network to predict motion between two frames. The network is trained off-line and evaluated without any fine-tuning. Similarly to SiamFC, in GOTURN tracker [17], the motion between successive frames is predicted using a deep regression network. These two trackers are able to perform at 86 FPS and 100 FPS respectively on GPU because no fine-tuning is performed online. Although their simplicity and fixed-model nature lead to high speed, this method loses the ability to update the appearance model online which is often critical to account for drastic appearance changes in tracking scenarios. It is worth noting that CFNet [53] and DCFNet [59] interpret the correlation filters as a differentiable layer in a Siamese tracking framework, thus achieving an end-to-end representation learning. The main drawback is their unsatisfying performance. Therefore, there still is performance improvement space for real-time deep trackers.

Fig. 1: The overall UCT architecture. In each frame, patch is cropped around ground truth and resized into

with a padding of 1.8. The solid lines indicate online tracking process, while dashed box and dashed lines indicate off-line training and training on first frame.

Ii-C On fully convolutional trackers

Fully convolutional networks can efficiently learn to make dense predictions for visual tasks like semantic segmentation, detection as well as tracking. Paper [38] transforms fully connected layers into convolutional layers to output a heat map for semantic segmentation. The region proposal network (RPN) in Faster R-CNN [48] is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. DenseBox [24] is an end-to-end FCN detection framework that directly predicts bounding boxes and object class confidences through whole image. In tracking literatures, FCNT [56] proposes a two-stream fully convolutional network to capture both general and specific object information for visual tracking. However, its tracking components are still independent, so the performance may be impaired. In addition, the FCNT can only perform at 3 FPS on GPU because its layers switch mechanism and feature map selection method are time-consuming, which hinder it from real-time applications. Compared with FCNT, our UCT treats feature extractor and tracking process in a unified architecture and trains them end-to-end, resulting a more compact and much faster tracking approach.

Iii Unified Convolutional networks for tracking

In this section, the overall architecture of proposed UCT is introduced firstly. Afterwards, we detail the formulation of convolutional operation both in training and test stages. Lastly, training process of UCT is described.

Iii-a UCT Architecture

The overall framework of our tracking approach is shown in Figure 1, which consists of feature extractor and convolutions performing tracking process. The solid lines indicate online tracking process, while dashed box and dashed lines are included in off-line training and training on first frame. The search window of current frame is cropped and sent to unified convolutional networks. The estimated new target position is obtained by finding the maximum value of the response map. Another separate 1-dimensional convolutional branch is used to estimate target scale and model updating is performed if necessary, which enables our tracker can perform at real-time rate. Each feature channel in the extracted sample is always multiplied by a Hann window, as described in [18]. The proposed framework has two advantages: First, we adopt two groups convolutional filters to perform tracking process which is trained with features extractor. Compared to two-stage approaches adopted in DCF framework within CNN features [40, 47, 9], our end-to-end training pipeline is generally preferable. The reason is that the parameters in all components can cooperate to achieve tracking objective. Second, during online tracking, the whole patch can be predicted using the foreground heat map by one-pass forward propagation. Redundant computation is saved. Whereas in [46] and [58], network has to be evaluated for times given samples cropped from the frame. The overlap between patches leads to a lot of redundant computation. Besides, we adopt efficient scale handling and model updating strategies to ensure real-time speed.

Iii-B Formulation

In the UCT formulation, the aim is to learn a series of convolution filters from training samples . Each sample is extracted using another CNN from an image region. Assuming sample has the spatial size , the output has the spatial size (). The desired output is a response map which includes a target score for each location in the sample . The convolutional response of the filter on sample is given by

(1)

where and is -th channel of extracted CNN features and desired filters, respectively, denotes convolutional operation. The filter can be trained by minimizing loss which is obtained between the response on sample and the corresponding Gaussian label

(2)

The second term in (2) is a regularization with a weight parameter .

In test stage, the trained filters are used to evaluate an image patch centered around the predicted target location. The evaluation is applied in a sliding-window manner, thus can be operated as convolution:

(3)

where denote the feature map extracted from last target position including context.

It is noticed that the formulation in our framework is similar to DCF, which solve this ridge regression problem in frequency domain by circularly shifting the sample. Different from DCF, we adopt gradient descent to solve equation (2), resulting in convolution operations. Noting that the sample

is also extracted by CNN, these convolution operations can be naturally unified in a fully convolutional network. Compared to DCF framework, our approach has three advantages: firstly, both feature extraction and tracking convolutions can be pre-trained simultaneously, while DCF based trackers can only tune the hyper-parameters heuristically. Secondly, model updating can be performed by stochastic gradient descent (SGD), which maintains the long-term memory of target appearance. Lastly, our framework is much faster than DCF framework within CNN features.

Iii-C Training

Since the objective function defined in equation (2) is convex, it is possible to obtain the approximate global optima via gradient descent with an appropriate learning rate in limited steps. We divide the training process into two periods: off-line training that can encode the prior tracking knowledge, and the training on first frame to adapt to specific target.

In off-line training, the goal is to minimize the loss function in equation (

2

). In tracking, the target position in last frame is always not centered in current cropped patch. So for each image, the train patch centered at the given object is cropped with jittering. The jittering consists of translation and scale jittering, which approximates the variation in adjacent frames when tracking. Above cropped patch also includes background information as context. In training, the final response map is obtained by last convolution layer within one channel. The label is generated using a Gaussian function with variances proportional to the width and height of object. Then the

loss can be generated and the gradient descent can be performed to minimize equation (2

). In this stage, the overall network consists of a pre-trained network with ImageNet (VGG-Net in UCT and ZF-Net in UCT-lite) and following convolutional filters. Last part of VGG-Net or ZF-Net is trained to encode the prior tracking knowledge with following convolutional filters, making the extracted feature more suitable for tracking.

The goal of training on first frame is to adapt to a specific target. The network architecture follows that in off-line training, while later convolutional filters are randomly initialized by zero-mean Gaussian distribution. Only these randomly initialized layers are trained using SGD in first frame. Offline training encodes prior tracking knowledge and constitutes a tailored feature extractor. We perform online tracking with and without off-line training to illustrate this effect. In Figure 

2, we show tracking results and corresponding response maps without or with offline training. In left part of Figure 2, the target singer is moving to right, the response map with off-line training effectively reflects this translation changes while response map without off-line training is not capable of doing this. So the tracker without off-line training misses this critical frame. In right part of Figure 2, the target player is occluded by another player, the response map without off-line training becomes fluctuated and tracking result is effected by distractor, while response map with off-line training still keeps discriminative. The results are somewhat unsurprising, since CNN features trained on ImageNet classification data are expected to have greater invariance to position and same class. In contrast, we can obtain more suitable feature tracking by end-to-end off-line training.

Fig. 2: From left to right: images, response maps without off-line training and response maps with off-line training. Green and red boxes in images indicates tracking results without and with off-line training, respectively.

Iv Online tracking

After off-line training and training on first frame, the learned network is used to perform online tracking by equation (3). The estimate of the current target state is obtained by finding the maximum response score. Since we use a fully convolutional network architecture to perform tracking, the whole patch can be predicted using the foreground heat map by one-pass forward propagation. Redundant computation was saved. Whereas in [46] and [58], network has to be evaluated for times given samples cropped from the frame. The overlap between patches leads to a lot of redundant computation.

Iv-a Model update

Most of tracking approaches update their models in each frame or at a fixed interval [18, 19, 40, 12]. However, this strategy may introduce false background information when the tracking is inaccurate, target is occluded or out of view. In the proposed method, model update is decided by evaluating the tracking results. Specifically, we consider the maximum value in the response map and the distribution of other response value simultaneously.

Fig. 3: Updating results of UCT and UCTNoPNR (UCT without PNR criterion). The first row shows frames that the target is occluded by distractor. The second row is corresponding response maps. still keeps large in occlusion while PNR significantly decreases. So the unwanted updating is avoided by considering PNR constraint simultaneously. The red and blue boxes in last image are tracking results of UCT and UCTNoPNR, respectively.

Ideal response map should have only one peak value in actual target position and the other values are small. On the contrary, the response will fluctuate intensely and include more peak values as shown in Figure 3. We introduce a novel criterion called peak-versus-noise ratio (PNR) to reveal the distribution of response map. The PNR is defined as

(4)

where

(5)

and is corresponding minimum value of response map. Denominator in equation (4) represents mean value of response map except maximum value and it is used to measure the noise approximately. The PNR becomes larger when response map has fewer noise and sharper peak. Otherwise, the PNR will fall into a smaller value. We save the PNR and and calculate their historical average values as threshold:

(6)

Model update is performed only when PNR and are larger than corresponding threshold at the same time. The updating is one step SGD with smaller learning rate compared with that in the first frame. Figure 3 illustrates the necessity of proposed PNR criterion by showing tracking results under occlusions. As shown in Figure 3, updating is still performed if only is adopted. Introduced noise will result in inaccurate tracking results even failures. The PNR value significantly decreases in these unreliable frames thus avoids unwanted updating.

Iv-B Scale estimation

A conventional approach of incorporating scale estimation is to evaluate the appearance model at multiple resolutions by performing an exhaustive scale search [36]. However, this search strategy is computationally demanding and not suitable for real-time tracking. Inspired by [11], we introduce a 1-dimensional convolutional filters branch to estimate the target size as shown in Figure 1. This scale filter is applied at an image location to compute response scores in the scale dimension, whose maximum value can be used to estimate the target scale. Such learning separate convolutional filters to explicitly handle the scale changes is more efficient for real-time tracking.

In training and updating of scale convolutional filters, the sample is extracted from variable patch sizes centered around the target:

(7)

where is the size of scale convolutional filters, and are the current target size, is the scale factor. In scale estimation test, the sample is extracted using the same way after translation filters are performed. Then the scale changes compared to previous frame can be obtained by maximizing the response score. Note that the scale estimation is performed only when model updating conditions are satisfied.

The overall tracking algorithm is summarized in Algorithm 1.

0:    Initial target position, ;Off-line learned filters ;
0:  Target position during sequences
1:  repeat
2:     if first frame then
3:        fine-tune the network using ground truth by (2)
4:     else
5:        Crop out new patch centered at previous results
6:        Calculate the response map using (3)
7:        Calculate the scale changes compared to previous frame by (7)
8:        Obtain the bounding box in current frame according to and scale changes
9:        Calculate the and using (4) and (6)
10:        if  two criterions are satisfied then
11:           update the model using (2)
12:        end if
13:     end if
14:  until End of video sequences;
Algorithm 1 Unified Convolutional Networks for Tracking

V Experiments

Experiments are performed on four challenging tracking datasets: OTB2013 [62], OTB2015 [61], VOT2015 [31] and VOT2016 [29]. Basic information about four dataset is summarized in Table I. All the tracking results use the reported results to ensure a fair comparison.

Dataset Frame number Main challenges
OTB2013 29491 IV,OPR,SV,OCC,DEF MB,FM,IPR,OV,BC,LR
OTB2015 59040 IV,OPR,SV,OCC,DEF MB,FM,IPR,OV,BC,LR
VOT2015 21871 CM,IV,OCC,SV,MC
VOT2016 21871 CM,IV,OCC,SV,MC
TABLE I: Basic information about four dataset in experiments. IV: Illumination Variation. OPR: Out-of-Plane Rotation. SV: Scale Variation. OCC: Occlusion. DEF: Deformation. MB: Motion Blur. FM: Fast Motion. IPR: In-Plane Rotation. OV: Out-of-View. BC: Background Clutters. LR: Low Resolution. CM: Camera Motion. MC: Motion Change.

V-a Implementation details

We adopt VGG-Net [51] in standard UCT and ZF-Net [63] in UCT-lite as feature extractor, respectively. In off-line training, last six layers of VGG-Net and last three layers of ZF-Net are fine-tuned. The kernel size of two convolutional layers for tracking is

and the activation function is Sigmoid. Our training data comes from VID

[49], containing the training and validation set. In each frame, patch is cropped around ground truth and resized into with a padding of 1.8. The translation and scale jittering are 0.05 and 0.02 proportional to the size of images, respectively. We apply SGD with momentum of 0.9 to train the network and set the weight decay

to 0.005. The model is trained for 30 epochs with a learning rate of

. In online training on first frame, SGD is performed 50 steps with a learning rate of and is set to 0.01. In online tracking, the model updating is performed by one step SGD with a learning rate of . and in equation (7) is set to 33 and 1.02, respectively.

The proposed UCT is implemented using Caffe

[25] with Matlab wrapper on a PC with an Intel i7 6700 CPU, 48 GB RAM, Nvidia GTX TITAN X GPU. The code and results will be made publicly available.

V-B Results on OTB2013

OTB2013 [62] contains 50 fully annotated sequences that are collected from commonly used tracking sequences. The evaluation is based on two metrics: precision plot and success plot. The precision plot shows the percentage of frames that the tracking results are within certain distance determined by given threshold to the ground truth. The value when threshold is 20 pixels is always taken as the representative precision score. The success plot shows the ratios of successful frames when the threshold varies from 0 to 1, where a successful frame means its overlap is larger than this given threshold. The area under curve (AUC) of each success plot is used to rank the tracking algorithms.

In this experiment, ablation analyses are performed to illustrate the effectiveness of proposed component at first. Then we compare our method against the recent real-time (and near real-time) trackers presented at top conferences and journals, including PTAV [13], BACF [27], CFNet [53], CACF [45], CSR-DCF [39], fDSST [11], SiamFC [4], Staple [3], SCT [7], HDT [47], HCF [40], LCT [41], KCF [18]. The one-pass evaluation (OPE) is employed to compare these trackers.

Fig. 4: Precision and success plots on OTB2013. The numbers in the legend indicate the representative precisions at 20 pixels for precision plots, and the area-under-curve scores for success plots. Best viewed on color display.
Fig. 5: Attributes performance on OTB2013. Best viewed on color display.

V-B1 Ablation analyses

To verify the contribution of each component in our algorithm, we implement and evaluate four variations of our approach: firstly, the effectiveness of our off-line training is tested by comparison without this procedure (UCTNoOff-line), where the network is only trained within the first frame of a specific sequence. Secondly, the tracking algorithm that updates model without PNR constraint (UCTNoPNR, only depends on ) is compared with the proposed efficient updating method. Last two additional versions are UCT within multi-resolutions scale (UCTMulResScale) and without scale handling (UCTNoScale).

As shown in Table II, the performances of all the variations are not as good as our full algorithm (UCT) and each component in our tracking algorithm is helpful to improve performance. Specifically, off-line training encodes prior tracking knowledge and constitutes a tailored feature extractor, so the UCT outperforms UCTNoOff-line with a large margin. Proposed PNR constraint for model update improves performance as well as speed, since it avoids updating in unreliable frames. Although exhaustive scale method in multiple resolutions improves the performance of tracker, it brings higher computational cost. By contrast, learning separate filters for scale in our approach gets a better performance while being computationally efficient.

Approaches AUC Precision20 Speed (FPS)
UCTnooff-line 0.629 0.879 58
UCTnoPNR 0.676 0.912 37
UCTnoscale 0.654 0.901 69
UCTmulResscale 0.664 0.909 27
UCT 0.693 0.927 58
UCT-lite 0.634 0.892 154
TABLE II: Performance on OTB2013 of UCT and its variations

V-B2 Comparison with state-of-the-art trackers

We compare our method against the state-of-the-art trackers as listed earlier. Figure 4 illustrates the precision and success plots based on center location error and bounding box overlap ratio, respectively. It clearly illustrates that our algorithm, denoted by UCT, outperforms the state-of-the-art trackers significantly in both measures. In success plot, our approach obtains an AUC score of 0.693, significantly outperforms BACF and SiamFC by 3.6% and 8.5%, respectively. In precision plot, our approach obtains a score of 0.927, outperforms BACF and SiamFC by 6.6% and 11.8%, respectively. We summarise the model update frequency, scale estimation, performance and speed of compared trackers in Table III.

Besides standard UCT, we also implement a lite version of UCT (UCT-lite) which adopts ZF-Net [63] and ignores scale changes. As shown in Figure 4, the UCT-lite obtains a success score of 0.634 and precision score of 0.892, while operates at 154 FPS. Our UCT-lite approach is much faster than recent real-time trackers such as CFNet, SiamFC and Staple, while significantly outperforms them in performance.

Furthermore, the results on various challenge attributes in OTB2013 are reported for detailed performance analysis. These challenges include scale variation, fast motion, background clutter, deformation, occlusion, etc. Figure 5 demonstrates that our tracker effectively handles these challenging situations while other trackers obtain lower scores. Qualitative comparisons of our approach with four state-of-the-art trackers in the changing scenario are shown in Figure 13. The top performance can be attributed to that our methods encodes prior tracking knowledge by off-line training and extracted features is more suitable for following tracking convolution operations. For example, our tracker ranks top in out-of-plane rotation and in-plane rotation challenges due to the end-to-end training. By contrast, the CNN features in other trackers are always pre-trained in different tasks and are independently with the tracking process, thus the achieved tracking performance may not be optimal. Furthermore, efficient updating and scale handling strategies ensure robustness and speed of the tracker.

V-C Results on OTB2015

OTB2015 [61] is the extension of OTB2013 and contains 100 video sequences. Some new sequences are more difficult to track. In this experiment, we compare our method against recent real-time trackers, including PTAV [13], BACF [27], CFNet [53], fDSST [11], SiamFC [4], Staple [3], HDT [47], HCF [40], LCT [41], KCF [18]. The one-pass evaluation (OPE) is employed to compare these trackers.

Figure 6 illustrates the precision and success plots of the compared trackers, respectively. The proposed UCT approach outperforms all the other trackers in terms of both precision score and success score. Specifically, our method achieves a success score of 0.670, which outperforms the PTAV (0.631) and BACF (0.621) methods with a large margin. Since the proposed tracker adopts a unified convolutional architecture and efficient online tracking strategies, it achieves superior tracking performance and real-time speed. It is worth mentioning that our UCT-lite also provides significantly better performance and faster speed compared with SiamFC, CFNet and Staple. Model update frequency, scale estimation, performance and speed of compared trackers are summarised in Table III.

For detailed performance analysis, we report the results on various challenge attributes in OTB2015, such as illumination variation, scale changes, occlusion, etc. Figure 7 demonstrates that our tracker effectively handles these challenging situations while other trackers obtain lower scores. Comparisons of our approach with four state-of-the-art trackers in the challenging scenarios are shown in Figure 13.

Fig. 6: Precision and success plots on OTB2015. The numbers in the legend indicate the representative precisions at 20 pixels for precision plots, and the area-under-curve scores for success plots. Best viewed on color display.
Fig. 7: Attributes performance on OTB2015. Best viewed on color display.
Approaches Publication Model update frequency Scale estimation AUC in OTB2013 AUC in OTB2015 Speed (FPS)
PTAV ICCV 2017 in short-term phase yes 0.654 0.631 27
BACF ICCV 2017 always yes 0.657 0.621 35
SiamFC ECCVw 2016 no yes 0.608 0.582 86
Staple CVPR 2016 always yes 0.600 0.581 80
CFNet CVPR 2017 always yes 0.611 0.568 75
HDT CVPR 2016 always yes 0.603 0.564 10
HCF ICCV 2015 always yes 0.605 0.562 11
LCT CVPR 2015 always yes 0.628 0.562 27
fDSST TPAMI 2017 always yes 0.595 0.517 54
KCF TPAMI 2015 always no 0.514 0.477 172
UCT ours score is satisfied yes 0.693 0.670 58
UCT-lite ours score is satisfied no 0.634 0.608 154
TABLE III: Performance on OTB of UCT and compared trackers

V-D Results on VOT2015

Trackers EAO Accuracy Failures
UCT 0.3576 0.58 0.97
UCT-lite 0.2957 0.56 1.26
DeepSRDCF 0.3181 0.56 1.05
EBT 0.3130 0.47 1.02
srdcf 0.2877 0.56 1.24
LDP 0.2785 0.51 1.84
sPST 0.2767 0.55 1.48
scebt 0.2548 0.55 1.86
nsamf 0.2536 0.53 1.29
struck 0.2458 0.47 1.61
rajssc 0.2458 0.57 1.63
s3tracker 0.2420 0.52 1.77
TABLE IV: Comparisons with top trackers in VOT2015. Red, green and blue fonts indicate 1st, 2nd, 3rd performance, respectively. Best viewed on color display.

VOT2015 [31] consists of 60 challenging videos that are automatically selected from a 356 sequences pool. The trackers in VOT2015 are evaluated by expected average overlap (EAO) measure, which is the inner product of the empirically estimating the average overlap and the typical-sequence-length distribution. The EAO measures the expected no-reset overlap of a tracker run on a short-term sequence. Besides, accuracy (mean overlap) and robustness (average number of failures) are also reported.

Fig. 8: EAO ranking with trackers in VOT2015. The better trackers are located at the right. Best viewed on color display.

In VOT2015 experiment, we present a state-of-the-art comparison with the participants in the challenge according to the latest VOT rules (see http://votchallenge.net). It is worth noting that MDNet [46] is not compatible with the latest VOT rules because of OTB training data.

Figure 8 illustrates that our UCT ranks 1st in 61 trackers according to EAO criterion, while performing at 58 FPS. The faster UCT-lite (154 FPS) can rank 4th in EAO criterion. In Table IV, we list the EAO, accuracy and failures of UCT and top 10 entries in VOT2015. UCT ranks 1st according to all 3 criterions. The top performance can be attributed to the unified convolutional architecture and end-to-end training.

Additionally, further experimental attributes evaluations on the VOT2015 dataset with 60 videos are presented. In the VOT2015 dataset, each frame is labeled with 5 different attributes: camera motion, illumination change, occlusion, size change and motion change. The performance is evaluated by expected average overlap (EAO) measure. As shown in Figure 9, our approach (UCT) achieves the best results on 4 in 5 attributes.

Fig. 9: EAO scores for each attribute on the VOT2015 dataset. empty denotes frames with no labeled attribute. Best viewed on color display.

V-E Results on VOT2016

The datasets in VOT2016 [29] are the same as VOT2015, but the ground truth has been re-annotated. VOT2016 also adopts EAO, accuracy and robustness for evaluations.

In experiment, we compare our method with participants in challenges. Figure 10 illustrates that our UCT ranks 1st in 70 trackers according to EAO criterion. It is worth noting that our method can operate at 58 FPS, which is near 200 times faster than CCOT [12] (0.3 FPS). The UCT-lite ranks 6th with a speed of 154 FPS. Figure 12 shows the performance and speed of the state-of-the-art trackers. It illustrates that our tracker can achieve a superior performance while operating at high speed.

For detailed performance analysis, we also list accuracy and robustness of representative trackers in VOT2016. As shown in Table V, the accuracy and robustness of proposed UCT can rank 1st and 2nd, respectively. At last, we provide further experimental attributes evaluation on the VOT2016 dataset with 60 videos. In the VOT2016 dataset, each frame is labeled with five different attributes: camera motion, illumination change, occlusion, size change and motion change. Figure 11 visualizes the EAO of each attribute individually. Our approach (UCT) ranks 1st in 4 attributes and 2nd in 1 attributes.

Fig. 10: EAO ranking with trackers in VOT2016. The better trackers are located at the right. Best viewed on color display.
Trackers EAO Accuracy Robustness
UCT 0.342 0.569 0.239
UCT-lite 0.304 0.551 0.362
CCOT 0.331 0.539 0.238
TCNN 0.325 0.554 0.268
Staple 0.295 0.544 0.378
EBT 0.291 0.465 0.252
DNT 0.278 0.515 0.329
SiamFC 0.277 0.549 0.382
MDNet 0.257 0.541 0.337
TABLE V: Comparisons with top trackers in VOT2016. Red, green and blue fonts indicate 1st, 2nd, 3rd performance, respectively. Best viewed on color display.
Fig. 11: EAO scores for each attribute on the VOT2016 dataset. empty denotes frames with no labeled attribute. Best viewed on color display.
Fig. 12: Performance and speed of our tracker and some state-of-the-art trackers in VOT2016. Speed is evaluated by normalized equivalent filter operations (EFO) [29]. More closed to top means higher precision, and more closed to right means faster. UCT is able to rank 1st in EAO while operating at 27 EFO (58 FPS). Best viewed on color display.

V-F Qualitative results

To visualize the superiority of our framework on tracking performance, we show examples of UCT results compared with recent trackers on challenging sample videos. As shown in Figure  13, the target in sequence undergoes severe fast motion. Only UCT and CCOT successfully tracks until the end, while proposed UCT is more precise than CCOT as shown in #14, #25 and #40. The target in sequence singer2, ironman undergoes different severe deformation. Since end-to-end training enables learned CNN features are tightly coupled with tracking process, the proposed UCT can handle these challenges. While CCOT, CFNet, SiamFC and PTAV lead to failure in #366 of sequence singer2 and #155, #157, #160 of sequence ironman. Sequences liquor and matrix illustrate background clutter challenges, where the UCT results in successful tracking. Sequences motorRolling and skating2 contain in-plane rotation and out-of-plane rotation. The trained UCT tracks the target while other trackers tend to drift to backgrounds. Lastly, the proposed UCT handles partial occlusions and scale changes in sequence woman such as #568 and #597.

Fig. 13: Comparisons of our approach with four state-of-the-art trackers in the changing scenario skiing, singer2, ironman, motorRolling, liquor, matrix, woman, skating2. The compared trackers are three recent real-time trackers: SiamFC [4], CFNet [53], PTAV [13], and the winner of VOT2016: CCOT [12].

Vi Conclusions

In this work, we propose a high performance unified convolutional tracker (UCT) that learn the convolutional features and perform the tracking process simultaneously. In online tracking, efficient updating and scale handling strategies are incorporated into the network. It is worth emphasizing that our proposed algorithm not only performs superiorly, but also runs at a very fast speed which is significant for real-time applications. On VOT2016 dataset, the proposed UCT obtains an EAO of 0.342 and performs at 58 FPS, yielding 3.2% relative gain in performance and 200 times gain in speed compared with challenge winner-CCOT [12].

Future directions of this paper will try to apply the proposed UCT framework to mobile robot with active vision systems, and it would be interesting to explore the more efficient network architecture such as MobileNet and ShuffleNet.

Acknowledgment

This work is supported in part by the National Natural Science Foundation of China under Grant No. 61403378 and 51405484, and in part by the National High Technology Research and Development Program of China under Grant No.2015AA042307.

References

  • [1] B. Babenko, M. Yang, and S. Belongie (2011) Robust object tracking with online multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (8), pp. 1619–1632. Cited by: §I.
  • [2] S. Bai, Z. He, T. Xu, Z. Zhu, Y. Dong, and H. Bai (2018) Multi-hierarchical independent correlation filters for visual tracking. arXiv preprint arXiv:1811.10302. Cited by: §II-A.
  • [3] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr (2016-06) Staple: complementary learners for real-time tracking. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §II-B, §V-B, §V-C.
  • [4] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr (2016) Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision Workshop, pp. 850–865. Cited by: §I, §II-B, Fig. 13, §V-B, §V-C.
  • [5] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui (2010) Visual object tracking using adaptive correlation filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550. Cited by: §II-B.
  • [6] K. Briechle and U. D. Hanebeck (2001) Template matching using fast normalized cross correlation. In Optical Pattern Recognition XII, Vol. 4387, pp. 95–103. Cited by: §II-B.
  • [7] J. Choi, H. Jin Chang, J. Jeong, Y. Demiris, and J. Young Choi (2016) Visual tracking using attention-modulated disintegration and integration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4321–4330. Cited by: §V-B.
  • [8] D. Comaniciu, V. Ramesh, and P. Meer (2000) Real-time tracking of non-rigid objects using mean shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 142–149. Cited by: §II-B.
  • [9] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg (2015) Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshop, pp. 621–629. Cited by: §I, §II-A, §III-A.
  • [10] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg (2015) Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4310–4318. Cited by: §I.
  • [11] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg (2017) Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (8), pp. 1561–1575. Cited by: §I, §II-B, §IV-B, §V-B, §V-C.
  • [12] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In Proceedings of the European Conference on Computer Vision, pp. 472–488. Cited by: §I, §II-A, §IV-A, Fig. 13, §V-E, §VI.
  • [13] H. Fan and H. Ling (2017) Parallel tracking and verifying: a framework for real-time and high accuracy visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Fig. 13, §V-B, §V-C.
  • [14] H. Grabner, M. Grabner, and H. Bischof (2006) Real-time tracking via on-line boosting. In Proceedings of the British Machine Vision Conference, Cited by: §I.
  • [15] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M. Cheng, S. L. Hicks, and P. H. Torr (2016) Struck: structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 2096–2109. Cited by: §I.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I, §II-A.
  • [17] D. Held, S. Thrun, and S. Savarese (2016) Learning to track at 100 fps with deep regression networks. In Proceedings of the European Conference on Computer Vision, pp. 749–765. Cited by: §II-B.
  • [18] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2015) High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 583. Cited by: §I, §II-B, §III-A, §IV-A, §V-B, §V-C.
  • [19] J. F. Henriques, C. Rui, P. Martins, and J. Batista (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European Conference on Computer Vision, pp. 702–715. Cited by: §I, §II-B, §IV-A.
  • [20] J. Huang, W. Zou, J. Zhu, and Z. Zhu (2018) Optical flow based real-time moving object detection in unconstrained scenes. arXiv preprint arXiv:1807.04890. Cited by: §I.
  • [21] J. Huang, W. Zou, Z. Zhu, J. Zhu, et al. (2019) Motion cue based instance-level moving object detection. Cited by: §I.
  • [22] J. Huang, W. Zou, Z. Zhu, and J. Zhu (2018) An efficient optical flow based motion detection method for non-stationary scenes. arXiv preprint arXiv:1811.08290. Cited by: §I.
  • [23] J. Huang, W. Zou, and Z. Zhu (2018) Optical flow based online moving foreground analysis. arXiv preprint arXiv:1811.07256. Cited by: §I.
  • [24] L. Huang, Y. Yang, Y. Deng, and Y. Yu (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874. Cited by: §II-C.
  • [25] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. Cited by: §V-A.
  • [26] Z. Kang, W. Zou, H. Ma, and Z. Zhu Adaptive trajectory tracking of wheeled mobile robots based on a fish-eye camera. International Journal of Control, Automation and Systems, pp. 1–13. Cited by: §I.
  • [27] H. Kiani Galoogahi, A. Fagg, and S. Lucey (2017-10) Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §I, §II-B, §V-B, §V-C.
  • [28] H. Kiani Galoogahi, T. Sim, and S. Lucey (2015) Correlation filters with limited boundaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4630–4638. Cited by: §II-B.
  • [29] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Čehovin, T. Vojír̃, G. Häger, A. Lukežič, and G. Fernández (2016) The visual object tracking vot2016 challenge results. In Proceedings of the European Conference on Computer Vision Workshop, pp. 191–217. Cited by: §I, Fig. 12, §V-E, §V.
  • [30] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Č. Zajc, T. Vojír̃, G. Bhat, A. Lukežič, A. Eldesokey, et al. (2018) The sixth visual object tracking vot2018 challenge results. In European Conference on Computer Vision, pp. 3–53. Cited by: §I.
  • [31] M. Kristan, J. Matas, A. Leonardis, and M. Felsberg (2015) The visual object tracking vot2015 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshop, pp. 564–586. Cited by: §V-D, §V.
  • [32] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §I, §II-A.
  • [33] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980. Cited by: §II-B.
  • [34] H. Li, Y. Li, and F. Porikli (2016) Deeptrack: learning discriminative feature representations online for robust visual tracking. IEEE Transactions on Image Processing 25 (4), pp. 1834–1848. Cited by: §II-A.
  • [35] P. Li, J. Zhang, Z. Zhu, Y. Li, L. Jiang, and G. Huang (2019) State-aware re-identification feature for multi-target multi-camera tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §I.
  • [36] Y. Li and J. Zhu (2014) A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the European Conference on Computer Vision Workshop, pp. 254–265. Cited by: §I, §IV-B.
  • [37] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang (2019) Attention-guided unified network for panoptic segmentation. In CVPR, Cited by: §II-A.
  • [38] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §I, §II-C.
  • [39] A. Lukežič, T. Vojíř, L. Čehovin, J. Matas, and M. Kristan (2017) Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §V-B.
  • [40] C. Ma, J. Huang, X. Yang, and M. Yang (2015-12) Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §I, §II-A, §III-A, §IV-A, §V-B, §V-C.
  • [41] C. Ma, X. Yang, C. Zhang, and M. Yang (2015) Long-term correlation tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5388–5396. Cited by: §I, §V-B, §V-C.
  • [42] H. Ma, W. Zou, Z. Zhu, C. Zhang, and Z. Kang Selection of observation position and orientation in visual servoing with eye-in-vehicle configuration for manipulator. International Journal of Automation and Computing, pp. 1–14. Cited by: §I.
  • [43] S. Ma, H. Jiang, M. Han, J. Xie, and C. Li (2017)

    Research on automatic parking systems based on parking scene recognition

    .
    IEEE Access 5, pp. 21901–21917. Cited by: §I.
  • [44] K. Matej, J. Matas, A. Leonardis, M. Felsberg, L. Čehovin, G. Fernández, T. Vojíř, G. Häger, G. Nebehay, R. Pflugfelder, et al. (2015) The visual object tracking vot2015 challenge results. In Workshop on the Visual Object Tracking Challenge (VOT, in conjunction with ICCV), Cited by: §I, §I.
  • [45] M. Mueller, N. Smith, and B. Ghanem (2017) Context-aware correlation filter tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1396–1404. Cited by: §V-B.
  • [46] H. Nam and B. Han (2016-06) Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-A, §III-A, §IV, §V-D.
  • [47] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M. Yang (2016-06) Hedged deep tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II-A, §III-A, §V-B, §V-C.
  • [48] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §I, §II-A, §II-C.
  • [49] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. External Links: Document Cited by: §V-A.
  • [50] A. N. Shuaibu, I. Faye, Y. S. Ali, N. Kamel, M. N. Saad, and A. S. Malik (2017) Sparse representation for crowd attributes recognition. IEEE Access 5, pp. 10422–10433. Cited by: §I.
  • [51] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §V-A.
  • [52] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah (2014) Visual tracking: an experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1442–1468. Cited by: §I.
  • [53] J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. S. Torr (2017) End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-B, Fig. 13, §V-B, §V-C.
  • [54] D. Wang, H. Lu, Z. Xiao, and M. Yang (2015) Inverse sparse tracker with a locally weighted distance metric. IEEE Transactions on Image Processing 24 (9), pp. 2646–2657. Cited by: §I.
  • [55] D. Wang, H. Lu, and M. Yang (2013) Online object tracking with sparse prototypes. IEEE Transactions on Image Processing 22 (1), pp. 314–325. Cited by: §I.
  • [56] L. Wang, W. Ouyang, X. Wang, and H. Lu (2015) Visual tracking with fully convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119–3127. Cited by: §I, §II-C.
  • [57] N. Wang, S. Li, A. Gupta, and D. Yeung (2015) Transferring rich feature hierarchies for robust visual tracking. arXiv preprint arXiv:1501.04587. Cited by: §II-A.
  • [58] N. Wang and D. Yeung (2013) Learning a deep compact image representation for visual tracking. In Proceedings of the Advances in Neural Information Processing Systems, pp. 809–817. Cited by: §III-A, §IV.
  • [59] Q. Wang, J. Gao, J. Xing, M. Zhang, and W. Hu (2017) DCFNet: discriminant correlation filters network for visual tracking. arXiv preprint arXiv:1704.04057. Cited by: §II-B.
  • [60] Q. Wang, W. Zou, D. Xu, and Z. Zhu (2017) Motion control in saccade and smooth pursuit for bionic eye based on three-dimensional coordinates. Journal of Bionic Engineering 14 (2), pp. 336–347. Cited by: §I.
  • [61] Y. Wu, J. Lim, and M. H. Yang (2015) Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1834–1848. Cited by: §I, §V-C, §V.
  • [62] Y. Wu, J. Lim, and M. H. Yang (2013) Online object tracking: a benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418. Cited by: §V-B, §V.
  • [63] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, pp. 818–833. Cited by: §V-A, §V-B2.
  • [64] J. Zhang, Z. Zhu, W. Zou, P. Li, Y. Li, H. Su, and G. Huang (2019) FastPose: towards real-time pose estimation and tracking via scale-normalized multi-task networks. arXiv preprint arXiv:1908.05593. Cited by: §I.
  • [65] R. Zhang, Z. Zhu, P. Li, R. Wu, C. Guo, G. Huang, and H. Xia (2019) Exploiting offset-guided network for pose estimation and tracking. arXiv preprint arXiv:1906.01344. Cited by: §I.
  • [66] J. Zhu, Z. Zhu, and W. Zou (2018) End-to-end video-level representation learning for action recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 645–650. Cited by: §II-B.
  • [67] J. Zhu, W. Zou, L. Xu, Y. Hu, Z. Zhu, M. Chang, J. Huang, G. Huang, and D. Du (2018) Action machine: rethinking action recognition in trimmed videos. arXiv preprint arXiv:1812.05770. Cited by: §II-A.
  • [68] J. Zhu, W. Zou, and Z. Zhu (2017) Learning gating convnet for two-stream based methods in action recognition. arXiv preprint arXiv:1709.03655 1 (2), pp. 6. Cited by: §II-A.
  • [69] J. Zhu, W. Zou, and Z. Zhu (2018) Two-stream gated fusion convnets for action recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 597–602. Cited by: §II-A.
  • [70] Z. Zhu, G. Huang, W. Zou, D. Du, and C. Huang (2017-10) UCT: learning unified convolutional networks for real-time visual tracking. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Vol. , pp. 1973–1982. Cited by: §I-A.
  • [71] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu (2018) Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 101–117. Cited by: §II-B.
  • [72] Z. Zhu, W. Wu, W. Zou, and J. Yan (2018) End-to-end flow correlation tracking with spatial-temporal attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II-B.
  • [73] Z. Zhu, W. Zou, Q. Wang, F. Zhang, et al. (2018) A velocity compensation visual servo method for oculomotor control of bionic eyes. Cited by: §I.
  • [74] Z. Zhu, W. Zou, Q. Wang, and F. Zhang (2016) STD: a stereo tracking dataset for evaluating binocular tracking algorithms. In 2016 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 2215–2220. Cited by: §I.