I Introduction
Visual tracking is one of the most important components in computer vision system. It has been widely used in visual surveillance, human computer interaction, and robotics
[1, 2]. Given an annotation of the object (bounding box) in the first frame, the task of visual tracking is to estimate the target locations and scales in subsequent video frames. Though much progress has been made in recent years, robust visual tracking, which can reconcile different varying circumstances, still remains a challenging problem [1]. On the one hand, it is expected that the designed trackers can compensate for large appearance changes caused by illuminations, occlusions, etc. On the other hand, the realtime requirement in real applications impedes the usage of overcomplicated models.Briefly speaking, there are two major categories of modern trackers: generative trackers and discriminative trackers [3, 1]
. Generative trackers typically assume a generative process of the target appearance and search for the regions most similar to the target model, while discriminative trackers usually train a classifier to distinguish the target from the background. Among discriminative trackers, correlation filterbased trackers (CFTs) drawn an increasing number of attentions since the development of Kernel Correlation Filter (KCF) tracker
[4]. As has been demonstrated, KCF can achieve impressive performance on accuracy, robustness and speed on both the Online Tracking Benchmark (OTB) [5] and the Visual Object Tracking (VOT) challenges [6].Despite the overwhelming evidence of success achieved by CFTs, two observations prompt us to come up with our tracking approach. First, almost all the CFTs ignore the temporal consistency or invariance among consecutive frames, which has been demonstrated to be effective in augmenting tracking performance [7, 8]
. Second, there is still a lack of theoretical sound yet computational efficient model to integrate multiple feature cues. Admittedly, integrating different channel features is not new under the CFTs umbrella. However, previous work either straightforwardly concatenating various feature vectors
[9, 10] (i.e., assuming mutual independence of feature channels) or inheriting high computational burden which severely compromises the tracking speed [11, 12].In this paper, to circumvent these two drawbacks simultaneously, we embark on the basic KCF tracker and present a multiframe multifeature joint modeling tracker () to strike a good tradeoff between robustness and speed. In , the interdependencies between different feature cues are modeled using a multiview learning (MVL) approach to enhance the discriminative power of target appearance undergoing various changes. Specifically, we use the view consistency principle to regularize the objective function learned from each view agree on the labels of most training samples [13]. On the other hand, the temporal consistency is exploited under a multitask learning (MTL) framework [14], i.e., we model () consecutive frames simultaneously by constraining the learned objectives from each frame close to their mean. We extend to its kernelized version (i.e., ) and develop a fast blockwise diagonal matrix inversion algorithm to accelerate model training. Finally, we adopt a twostage filtering pipeline [9, 15, 16] to cope with the problem of scale variations.
To summarize, the main contributions of our work are twofold. First, a novel tracker, which can integrate multiple feature cues and temporal consistency in a unified model, is developed for robust visual tracking. Specifically, instead of simply concatenating multiple features into a single vector, we demonstrate how to reorganize these features by taking into consideration their intercorrelations. Moreover, we also present an advanced way to reconcile temporal relatedness amongst multiple consecutive frames, rather than naively using a forgetting factor [10, 17]. Second, a fast blockwise diagonal matrix inversion algorithm is developed to speed up training and detection. Experiments against stateoftheart trackers reveal the underline clues on the importance of “intelligent” feature integration and temporal modeling, and also illustrate future directions for the design of modern discriminative trackers.
The rest of this paper is organized as follows. In section II, we introduce the background knowledge. Then, in section III, we discuss the detailed methodology of and extend it under kernel setting. A fast algorithm for blockwise diagonal matrix inversion is also developed to speed up training and detection. Following this, an adaptive scale estimation mechanism is incorporated into our tracker in section IV. The performance of our proposed approach against other stateoftheart trackers is evaluated in section V. This paper is concluded in section VI.
: Scalars are denoted by lowercase letters (e.g., ), vectors appear as lowercase boldface letters (e.g., ), and matrices are indicated by uppercase letters (e.g., or ). The th element of (or ) is represented by (or ), while denotes transpose and denotes Hermitian transpose. If (or ) is a square matrix, then (or ) denotes its inverse.
stands for the identity matrix with compatible dimensions,
denotes a square diagonal matrix with the elements of vector on the main diagonal. The th row of a matrix (or ) is declared by the row vector , while the th column is indicated with the column vector . If denotes the absolute value operator, then, for , the norm and norm of are defined as and , respectively.Ii Related work
In this section, we briefly review previous work of most relevance to our approach, including popular conventional trackers, the basic KCF tracker and its extensions.
Iia Popular conventional trackers
Success in visual tracking relies heavily on how discriminative and robust the representation of target appearance is against varying circumstances [18, 19]. This is especially important for discriminative trackers, in which a binary classifier is required to distinguish target from background. Numerous types of visual features have been successfully applied for discriminative trackers in the last decades, including color histograms [20, 21], texture features [22], Haarlike features [23], etc. Unfortunately, none of them can handle all kinds of varying circumstances individually and the discriminant capability of a unique type of feature is not stable across the video sequence [20]. As a result, it becomes prevalent to take advantage of multiple features (i.e., multiview representations) to enable more robust tracking. For example, [24, 25] used the AdaBoost to combine an ensemble of weak classifiers to form a powerful classifier, where each weak classifier is trained online on a different training set using pixel colors and a local orientation histogram. Note that, several generative trackers also have improved performance by incorporating multiview representations. [26] employed a group sparsity technique to integrate color histograms, intensity, histograms of oriented gradients (HOGs) [27] and local binary patterns (LBPs) [28] via requiring these features to share the same subset of the templates, whereas [29] proposed a probabilistic approach to integrate HOGs, intensity and Haarlike features for robust tracking. These trackers perform well normally, but they are far from satisfactory when being tested on challenging videos.
IiB The basic KCF tracker and its extensions
Much like other discriminative trackers, the KCF tracker needs a set of training examples to learn a classifier. The key idea of KCF tracker is that the augmentation of negative samples is employed to enhance the discriminative ability of the trackingbydetection scheme while exploring the structure of the circulant matrix to speed up training and detection.
Following the basic KCF tracker, numerous extensions have been conducted to boost its performance, which generally fall into two categories: application of improved features and conceptual improvements in filter learning [30]. The first category lies in designing more discriminative features to deal with challenging environments [31, 32] or straightforwardly concatenating multiple feature cues into a single feature vector to boost representation power [9, 33]
. A recent trend is to integrate features extracted from convolutional neural networks (CNNs) trained on large image datasets to replace the traditional handcrafted features
[34, 35]. The design or integration of more powerful features often suffers from high computational burden, whereas blindly concatenating multiple features assumes the mutual independence of these features and neglects their interrelationships. Therefore, it is desirable to have a more “intelligent” manner to reorganize these features such that their interrelationships are fully exploited.Conceptually, the theoretical extension on the filter learning also drawn lots of attentions. Early work focused on accurate and efficient scale estimation as the basic KCF assumes a fixed target size [15, 11]. The most recent work concentrated on developing more advanced convolutional operators. For example, [36]
employed an interpolation model to train convolution filters in continuous spatial domain. The socalled Continuous Convolution Operator Tracker (CCOT) is further improved in
[37], in which the authors introduce factorized convolution operators to drastically reduce the number of parameters CCOT as well as the number of filters.Apart from these two extensions, efforts have been made to embed the conventional tracking strategies (like partbased tracking [38]) on KCF framework. For example, [39] developed a probabilistic tracking reliability metric to measure how reliable a patch can be tracked. On the other hand, [40] employed an online random fern classifier as a redetection component for longterm tracking, whereas [16] presented a biologyinspired framework where shortterm processing and longterm processing are cooperated with each other under a correlation filter framework. Finally, it is worth noting that, with the rapid development of correlation filters on visual tracking, several correlation filterbased thermal infrared trackers (e.g., [41, 42, 43]) have been developed in recent years. This work only focuses on visual tracking on color video sequences. We leave extension to thermal infrared video sequences as future work.
Iii The MultiFrame MultiFeature Joint Modeling Tracker ()
The made two strategic extensions to the basic KCF tracker. Our idea is to integrate the temporal information and multiple feature cues in a unified model, thus providing a theoretical sound and computationalefficient solution for robust visual tracking. To this end, instead of simply concatenating different features, the
integrates multiple feature cues using a MVL approach to better exploit their interrelationships, thus forming a more informative representation to target appearance. Moreover, the temporal consistency is taken into account under a MTL framework. Specifically, different from prevalent CFTs that learn filter taps using a ridge regression function which only makes use of template (i.e. circulant matrix
) from the current frame, we show that it is possible to use examples from () consecutive frames to learn the filter taps very efficiently by exploiting the structure of blockwise diagonal matrix.Before our work, there are two ways attempting to incorporate temporal consistency into KCF framework. The SpatioTemporal Context (STC) tracker [10] and its extensions (e.g., [17]) formulate the spatial relationships between the target and its surrounding dense contexts in a Bayesian framework and then use a temporal filtering procedure together with a forgetting factor to update the spatiotemporal model. Despite its simplicity, the tracking accuracy of STC tracker is poor compared with other stateoftheart KCF trackers (see Section VF). On the other hand, another trend is to learn a temporally invariant feature representation trained on natural video repository for visual tracking [31]. Although the learned feature can accommodate partial occlusion or slight illumination variation, it cannot handle the scenarios where there are large appearance changes. Moreover, the effectiveness of trained features depends largely on the selected repository which suffers from limited generalization capability. Different from these two kinds of methods, we explicitly model multiple consecutive frames in a joint cost function to circumvent abrupt drift in a single frame and to reconcile temporal relatedness amongst frames in a short period of time. Experiments demonstrate the superiority of our method.
Iiia Formulation of
We start from the formulation of the basic . Given training frames, we assume the number of candidate patches in the th () frame is . Suppose the dimensions for the first and second types of features are and respectively^{1}^{1}1This paper only considers two types of features, but the developed objective function (1) and associated solution can be straightforwardly extended to three or more feature cues., we denote () the matrix consists of the first (second) type of feature in the th frame (each row represents one sample). Also, let represent sample labels. Then, the objective of can thus be formulated as:
(1)  
where are the nonnegative regularization parameters controlling model complexity. is the regression coefficients shared by frames for the first type of feature and denotes the deviation term from in the th frame. The same definition goes for and for the second type of feature.
The problem (1) contains three different items with distinct objectives, namely the MTL item, the MVL item and the regularization item. The MTL item, i.e., (or ), is analogous to the formulation of regularized multitask learning (RMTL) [14], as it encourages (or ) close to the mean value (or ) with a small deviation (or ). The MVL item employs the view consistency principle that widely exists in MVL approaches (e.g., [44]) to constrain the objectives learned from two views agree on most training samples. Finally, the regularization term (or
) serves to prevent the illposed solution and enhance the robustness of selected features to noises or outliers.
IiiB Solution to
Minimization of has a closedform solution by equating the gradients of (1) w.r.t. and to zero. Albeit its simplicity, reorganizing linear equation array into standard form is intractable and computational expensive herein. As an alternative, we present an equivalent yet simpler form of (1), which can be solved with matrix inversion in one step.
Denote
(2)  
(3)  
(4)  
(5) 
(6) 
where
denotes a zero matrix of the same size as
in (or in ), we have:(7)  
(8)  
(9)  
(10)  
Denote
(12) 
then (LABEL:MTMVmodel2form1) becomes:
(13)  
Equating the gradients of w.r.t. and to zero. With straightforward derivation, we have:
(14) 
Having computed and , the responses of candidate samples in the next frame for the trained model can be computed as:
where
(18) 
in which and are feature matrices constructed from features in the new frame. Specifically, let consist of the first type of feature, we construct with feature in the new frame as defined in (4) and set as zero matrices to coincide with the size of in (12). In this sense, only and contribute to . The same goes for and ().
IiiC Kernel extension and fast implementation
Although (15) gives a tractable solution to (1), it contains the inversion of with the computational complexity when using the well acknowledged Gaussian elimination algorithm [46]. In this section, for a more powerful regression function and a fast implementation, we demonstrate how to incorporate the dense sampling strategy in the basic under the kernel setting to speed up its tracking and detection. A computationalefficient optimization method is also presented by exploiting the structure of blockwise diagonal matrix.
The dense sampling was considered previously to be a drawback for discriminative trackers because of the large number of redundant samples that are required [3]. However, when these samples are collected and organized properly, they form a circulant matrix that can be diagonalized efficiently using the DFT matrix, thereby making the dual rigid regression problem can be solved entirely in the frequency domain [4]. Due to this attractive property, we first show that it is easy to embed the “kernel trick” on , we will also show that it is possible to obtain nonlinear filters as fast as linear correlation filters using the dense sampling, both to train and evaluate. We term this improvement kernel multiframe multifeature joint modeling tracker ().
Denote
(19) 
we have
(20) 
and
where .
According to (12), consists of block matrices and the th () block matrix can be represented as:
(22) 
where with denotes the indicator function.
If we project and
onto the Reproducing Kernel Hilbert Space (RKHS), i.e., applying a nonlinear transform
to both and , we can obtain the kernelized version of (1), i.e., . According to [4], (22) can be represented as:(23) 
where denotes a circular matrix generated by (see Appendix A for more details). Similarly,
(24)  
(25)  
(26) 
Note that , , and are circular matrices, thus can be made diagonal as expressed below [47]:
(27) 
where is the DFT matrix. Combining (20)(27), according to Appendix A, the (defined in (LABEL:kpred)) under kernel setting can be computed as:
(28)  
where and are given in () and is a block diagonal matrix with blocks on its main diagonal.
Denote
(30) 
then
(31) 
According to the Equation () of [45], we have:
where
(33)  
It is obvious that and are blockwise diagonal matrix that can be partitioned into diagonal matrices of size . According to Theorem 1, the computational cost for or is . Besides, it takes product operations to compute , and . In this sense, the computational cost for is still . On the other hand, the DFT bounds the cost at nearly by exploiting the circulant structure [4]. Therefore, the overall cost for computing is given that there are inverse DFTs in (28).
Theorem 1
Given an invertible blockwise diagonal matrix that can be partitioned into diagonal matrices of size , the computational complexity of is .
Proof:
Denote () the th block of , where is a diagonal matrix of size . After a series of elementary matrix operations, e.g., premultiply or postmultiply the matrix by different elementary matrices, we can interchange the rows and columns of
arbitrarily. Therefore, there exists an invertible matrix
with , such that the matrix satisfies: the elements of come from with row indices and column indices . Obviously, for . Thus, can be represented as:(34) 
Given that , which means can be obtained by allocating the elements of to locations with row indices and column indices . The main computational cost of comes from the calculation of . The size of is , thus the computational complexity of is . As a result, the computational complexity of is .
Iv Scale Adaptive
To further improve the overall performance of , we follow the Integrated Correlation Filters (ICFs) framework in [16] to cope with scale variations. The ICF is a cascadingstage filtering process that performs translation estimation and scale estimation, respectively (same as the pipeline adopted in [15] and [9]). Unless otherwise specified, the mentioned in the following experimental parts refers to the scale adaptive one.
Specifically, in scale adaptive , the training of basic is accompanied by the training of another D Discriminative Scale Space Correlation Filter (DSSCF) [15] and this new trained filter is performed for scale estimation. To evaluate the trained DSSCF, image patches centered around the location found by the are cropped from the image, each of size , where is the target size in the current frame, is the scale factor, and . All image patches are then resized to the template size for the feature extraction. Finally, the final output from the scale estimation is given as the image patch with the highest filtering response. Similar to , the model parameters are also updated in an interpolating manner with learning rate . We refer readers to [15] for more details and the implementation of DSSCF.
V Experiments
We conduct four groups of experiments to demonstrate the effectiveness and superiority of our proposed . First, we implement and several of its baseline variants, including multifeatureonly tracker (MFT), multiframeonly tracker (MFT2), scaleadaptiveonly tracker (SAT), scaleadaptive multifeature tracker (SAMFT) and scaleadaptive multiframe tracker (SAMFT2), to analyze and evaluate the componentwise contributions to the performance gain. We then evaluate our tracker against stateoftheart trackers on Object Tracking Benchmark (OTB) [5]. Following this, we present results on Visual Object Tracking (VOT) challenge [48]. Finally, we compare with several other representative CFTs, including MOSSE [49], SAMF [9], MUSTer [16] and the recently published CCOT[36], to reveal the properties of our proposed tracker among CFTs family and also illustrate future research directions for modern discriminative tracker design.
Va Experimental setup
To make a comprehensive evaluation, we use all the color video sequences in OTB 2015 dataset ( in total). These sequences are captured in various conditions, including occlusion, deformation, illumination variation, scale variation, outofplane rotation, fast motion, background clutter, etc. We use the success plot to evaluate all trackers on OTB dataset. The success rate counts the percentage of the successfully tracked frames by measuring the overlap score for trackers on each frame. The average overlap measure is the most appropriate for tracker comparison, which accounts for both position and size. Let denote the tracking bounding box and denote the ground truth bounding box, the overlap score is defined as , where and represent the intersection and union of two regions, and denotes the number of pixels in the region. In success plot, is varied from to , and the ranking of trackers is based on the Area Under Curve (AUC) score. We also report the speed of trackers in terms of average frames per second (FPS) over all testing sequences.
We also test tracker performance on VOT dataset containing video sequences. The VOT challenge is a popular competition for single object tracking. Different from OTB 2015 dataset, a tracker is restarted in the case of a failure (i.e., there is no overlap between the detected bounding box and ground truth) in the VOT 2015 data set. For the VOT 2015 dataset, tracking performance is evaluated in terms of accuracy (overlap with the groundtruth), robustness (failure rate) and the expected average overlap (EAO). EAO is obtained by combining the raw values of accuracy and failures. Its score represents the average overlap a tracker would obtain on a typical shortterm sequence without reset. For a full treatment of these metrics, interested readers can refer to [48].
VB Implementation details of the proposed tracker
We set the regularization parameters to , , , , , . These parameters are tuned with a coarsetofine procedure. In the coarse module, we roughly determine a satisfactory range for each parameter (for example, the range of is ). Here, the word “satisfactory” means that the value in the range can achieve higher mean success rate at the threshold in the OTB 2015 dataset. Then, in the finetuning module, we divide these parameters into three groups based on their correlations: (1) ; (2) ; and (3) . When we test the value of one group of parameters, other groups are set to default values, i.e., the mean value of the optimal range given by the coarse module. In each group, the parameters are tuned with grid search (for example, in the first group is tuned at the range with an interval ). We finally pinpointed the specific value as the one that can achieve the highest mean success rate (for example, the final value of is ). Besides, we set the learning rate to as previous work [4, 9, 15] and model three consecutive frames in the MTL module, i.e., . We select HOGs [27] and color names [50] for image representation in the MVL module. The HOGs and color names are complementary to each other, as the former puts emphasis on the image gradient which is robust to illumination and deformation while the latter focuses on the color information which is robust to motion blur. For HOGs, we employ a variant in [51], with the implementation provided by [52]. More particular, the cell size is and number of orientations is set to . For color names, we map the RGB values to a probabilistic dimensional color representation which sums up to . All the experiments mentioned in this work are conducted using MATLAB on an Intel i GHz QuadCore PC with GB RAM. MATLAB code is available from our project homepage http://bmal.hust.edu.cn/project/KMF2JMTtracking.html.
VC Evaluation on componentwise contributions
Before systematically evaluating the performance of our proposed tracker, we first compare with its baseline variants to demonstrate the componentwise contributions to the performance gain. To this end, we implement seven trackers with various degraded settings, including multifeatureonly tracker (MFT) which just uses multiple feature cues, multiframeonly tracker (MFT2) which just uses the temporal relatedness across consecutive frames, scaleadaptiveonly tracker (SAT) which just concerns scale variations, scaleadaptive multifeature tracker (SAMFT) and scaleadaptive multiframe tracker (SAMFT2). Besides, to validate the efficiency of modeling multiple feature cues under a MVL framework rather than simply concatenating them, we also implement Multiple Feature tracker with feature concatenation (MFTC). Note that, the KCF [4], which only uses the HOG feature without temporal modeling and scale searching (i.e., , , ), serves as a baseline in this section.
Table I summarized the differences between these trackers, where denotes the basic kernel multiframe multifeature joint modeling tracker without scale estimation. In this table, we report the mean success rate at the threshold of , which corresponds to the PASCAL evaluation criterion [53]. Although these trackers share one or more common components, their tracking performances differ significantly. This indicates that the visual features, temporal relatedness and scale searching strategy all are essentially important to the visual tracking tasks. KCF ranks the lowest among the compared trackers as expected. MFT (or MFTC), MFT2 and SAT extend KCF by augmenting the feature space with color information, taking advantage of temporal information with frame relatedness and introducing scale adaptive searching strategy respectively, thus achieving a few improvements. MFT outperforms MFTC with a more “intelligent” feature fusion strategy. Besides, it is obvious that the scale adaptive searching can effectively handle scale variations, thus obtaining a large improvement in success rate (see SAT, SAMFT, SAMFT2 as against KCF, MFT and MFT2 in Table I). Moreover, by comparing MFT2 with MFT and comparing SAMFT2 with SAMFT, one can see that the integration of multiple frames plays a more positive role to improve tracking performance than the integration of multiple features. Finally, it is interesting to find that the success rate gains of MFT, MFT2 and SAT are , and respectively compared with KCF, while gets a improvement. This indicates that the is not just the simple combination of the MFT, MFT2 and SAT.
MF  TM  SA  mean success rate  

KCF  no  no  no  
MFT  yes  no  no  
MFTC  yes  no  no  
MFT2  no  yes  no  
SAT  no  no  yes  
SAMFT  yes  no  yes  
SAMFT2  no  yes  yes  
yes  yes  no  
yes  yes  yes 
VD Comparison with stateoftheart trackers
In this section, we provide a comprehensive comparison of our proposed with stateoftheart trackers on OTB 2015 dataset: 1) stateoftheart trackers that do not follow CFT framework yet achieved remarkable performance on OTB 2015, that is MEEM [54], TGPR [55], LSST [56] and MTMVTLAD [26]; 2) the popular trackers provided in [5], such as Struck [57], ASLA [58], SCM [59], TLD [60], MTT[61], VTD [62], VTS [63] and MIL [64]. Note that, as a generative tracker, MTMVTLAD has similar motivations as our approach, as it also attempts to integrate multitask multiview learning to improve tracking performance. However, different from , MTMVTLAD casts tracking as a sparse representation problem in a particle filter framework. Moreover, MTMVTLAD does not reconcile the temporal coherence in consecutive frames explicitly.
The success plot of all the competing trackers using One Pass Evaluation (OPE) is shown in Fig. 1(a). For clarity, we only show the top trackers in this comparison. As can be seen, our proposed achieves overall the best performance, which persistently outperforms the overall second and third best trackers, i.e., MEEM [54] and TGPR [55]. If we look deeper (see a detailed summarization in Table II), the MEEM tracker, which uses an entropyminimizationbased ensemble learning strategy to avert bad model update, obtains a mean success rate of
. The transfer learning based TGPR tracker achieves a mean success rate of
. By contrast, our approach follows a CFT framework, while using an explicit temporal consistency regularization to enhance robustness. The MTMVTLAD tracker provides a mean success rate of . Our tracker outperforms MTMVTLAD by in mean success rate. Finally, it is worth noting that our approach achieves superior performance while operating at real time, while the mean FPS for MEEM, TGPR and MTMVTLAD are approximately , and , respectively.




We also report the tracking results using Temporal Robustness Evaluation (TRE) and Spatial Robustness Evaluation (SRE). For TRE, it runs trackers on subsequences segmented from the original sequence with different lengths, and SRE evaluates trackers by initializing them with slightly shifted or scaled ground truth bounding boxes. With TRE and SRE, the robustness of each evaluated trackers can be comprehensively interpreted. The SRE and TRE evaluations are shown in Fig. 1(b) and Fig. 1(c) respectively. In both evaluations, our approach provides a constant performance gain over the majority of existing methods. The MEEM achieves better robustness than our approach. A possible reason is that MEEM combines the estimates of an ensemble of experts to mitigate inaccurate predictions or sudden drifts, so that the weaknesses of the trackers are reciprocally compensated. We also perform an attribute based analysis on our tracker. In OTB, each sequence is annotated with different attributes, namely: fast motion, background clutter, motion blur, deformation, illumination variation, inplane rotation, low resolution, occlusion, outofplane rotation, outofview and scale variation. It is interesting to find that ranks the first on out of attributes (see Fig. 2), especially on illumination variation, occlusion, outofplane rotation and scale variation. This verifies the superiority of on target appearance representation and its capability on discovering reliable coherence from shortterm memory. However, the overwhelming advantage no longer exists for fast motion and outofview. This is because the temporal consistency among consecutive frames becomes weaker in these two scenarios. The qualitative comparison shown in Fig. 3 corroborates quantitative evaluation results. It is worth noting that, we strictly follow the protocol provided in [5] and use the same parameters for all sequences.
VE VOT 2015 Challenge
In this section, we present results on the VOT challenge. We compare our proposed with participating trackers in this challenge. For a fair comparison, the DSST [15] is substituted with its fast version (i.e., fDSST [65]) raised by the same authors, as fDSST demonstrates superior performance than DSST as shown in [65].
In the VOT benchmark, each video is annotated by five different attributes: camera motion, illumination change, occlusion, size change and motion change. Different from OTB 2015 dataset, the attributes in VOT 2015 are annotated perframe in a video. Fig. 4 shows the accuracy and robustness (AR) rank plots generated by sequence pooling and attribute normalization. The pooled AR plots are generated by concatenating the experimental results on all sequences to directly obtain a rank list, while the attribute normalized AR rank plots are obtained based on the average of ranks achieved on these individual attributes. Fig. 5 shows the EAO ranks. Only results for the top trackers in the VOT challenge are reported for clarity.
It is easy to summarize some key observations from these figures:
1) The CFTs (including our tracker , DeepSRDCF [35], SRDCF [66], RAJSSC [67], NSAMF [48], SAMF [9]) account for majority of the topperforming trackers. By fully exploiting the representation power of CNNs or rotating candidate samples to augment candidate set, MDNet [68] and sPST [69] also demonstrate superior (or even the best) performance. This result indicates that a welldesigned discriminative tracker with conventional trackingbydetection scheme and random sampling in the detection phase can achieve almost the same tracking accuracy with stateoftheart CFTs coupled with dense sampling, at the cost of high computational burden (as officially reported in [48]).
2) Among topperforming CFTs, tied for the first place with DeepSRDCF, SRDCF, RAJSSC and NSAMF in terms of tracking accuracy. However, the robustness of is inferior to DeepSRDCF and SRDCF. Both DeepSRDCF and SRDCF introduce a spatial regularization penalty term to circumvent boundary effects caused by conventional circular correlation, thus significantly mitigating inaccurate training samples and restricted searching regions. However, one should note that these two trackers can hardly run in real time. A thorough investigation between and DeepSRDCF is demonstrated in Section VF.
3) Our tracker outperforms RAJSSC, NSAMF and SAMF in terms of robustness and EAO values. All these trackers use HOGs and color information to describe target appearance under a scaleadaptive framework. The performance difference indicates that our enjoys an advanced feature fusion scheme and the integration of multiple frames is beneficial to performance gain.
Apart from these three observations, there are other interesting points. For example, the NSAMF achieves a large performance gain compared with SAMF. The main difference is that NSAMF substitutes the color name with color probability. On the other hand, the EBT
[70] reaches a fairly high robustness and EAO value. However, its tracking accuracy is desperately poor. One possible reason is that the adopted contour information does not have desirable adaptability to target scale and aspect ratio changes.VF Comparison among correlation filterbased trackers (CFTs)
In the last section, we investigate the performance of representative CFTs. Our goal is to demonstrate the effectiveness of and reveal its properties when compared with other CFT counterparts. We also attempt to illustrate the future research directions of CFTs through a comprehensive evaluation. Table III summarized the basic information of the selected CFTs as well as their corresponding FPS values (some descriptions are adapted from [71]).
Published year  Major contribution  FPS  

MOSSE [49]  Pioneering work of introducing correlation filters for visual tracking.  
CSK [73]  Introduced Ridge Regression problem with circulant matrix to apply kernel methods.  
STC [10]  Introduced spatiotemporal context information.  
CN [74]  Introduced color attributes as effective features.  
DSST [15]  Relieved the scaling issue using feature pyramid and 3dimensional correlation filter.  
SAMF [9]  Integrated both color feature and HOG feature; Applied a scaling pool to handle scale variations.  
KCF [4]  Formulated the work of CSK and introduced multichannel HOG feature.  
LCT [40]  Introduced online random fern classifier as redetection component for longterm tracking.  
MUSTer [16]  Proposed a biologyinspired framework to integrate shortterm processing and longterm processing.  
RPT [39]  Introduced reliable local patches to facilitate tracking.  
CF2 [34]  Introduced features extracted from convolutional neural networks (CNN) for visual tracking.  
DeepSRDCF [35]  Introduced CNN features for visual tracking and a spatial regularization term to handle bound effect.  
MvCFT [12]  Introduced KullbackLeibler (KL) divergence to fuse multiple feature cues.  
CCOT [36]  Employed an implicit interpolation model to train convolutional filters in continuous spatial domain.  
Integrated multiple feature cues and temporal consistency in a unified model. 
Fig. 6 shows the AUC scores of success plots vs. FPS for all competing CFTs in OPE, TRE and SRE, respectively. The dashed vertical line (with a FPS value of [72]) separates the trackers into realtime trackers and those cannot run in real time. Meanwhile, the solid horizontal line (mean of AUC values for all competing CFTs) separates trackers into wellperformed trackers and those perform poorly in terms of tracking accuracy. As can be seen, our method performs the best in terms of accuracy among all realtime trackers, thus achieving a good tradeoff in speed and accuracy. This suggests that our modifications on the basic KCF tracker are effective and efficient. By contrast, although introducing longterm tracking strategy (e.g., MUSTer, LCT) or imposing spatial regularization penalty term (e.g., DeepSRDCF) can augment tracking performance as well, most of these modifications cannot be directly applied in real applications where the realtime condition is a prerequisite. This unfortunate fact also applies to CCOT, in which the handcrafted features are substituted with powerful CNN features. Therefore, a promising research direction is to investigate computationalefficient long term tracking strategy or spatial regularization with little or no sacrifice in speed.
Finally, it is worth mentioning that our approach provides a consistent gain in performance compared to MvCFT, although both methods employ the same feature under a MVL framework. The MvCFT only fuses tracking results from different view to provide a more precise prediction. By contrast, our approach enjoys a more reliable and computationalefficient training model. On the other hand, it is surprising to find that SAMF can achieve desirable performance on both TRE and SRE experiments. One possible reason is that the scale estimation method in SAMF, i.e., exhaustively searching a scaling pool, is more robust to scale variations (although timeconsuming).
Vi Conclusions and future work
In this paper, we proposed kernel multiframe multifeature joint modeling tracker () to promote the original correlation filterbased trackers (CFTs) by exploiting multiple feature cues and the temporal consistency in a unified framework. A fast blockwise diagonal matrix inversion algorithm has been developed to speed up learning and detection. An adaptive scale estimation mechanism was incorporated to handle scale variations. Experiments on OTB 2015 and VOT 2015 datasets show that