Visual tracking is one of the most fundamental problems in computer vision due to its numerous applications such as video surveillance, motion analysis, vehicle navigation and human computer interactions. Although a great progress has been seen on developing algorithms[19, 39, 40, 46] and benchmark evaluations  for visual tracking, visual tracking is still a challenging problem in the situations of heavy illumination changes, pose deformations, partial and full occlusions, large scale variations, background clutter and fast motion.
Correlation filters have recently attracted a great attention due to their rapid speeds of calculation and robust tracking performance [24, 27, 28, 29]. Bolme et al.  proposed an adaptive correlation filer, called MOSSE, for producing ASEF-like filters by fewer training images. Henriques et al.  extended the correlation-filter-based trackers to kernel-based training, called CSK method, to utilize a circulant structure of one image patch to conduct dense sampling, and then improved the KCF tracker  by using multi-channel inputs and HOG descriptors. Danelljan et al.  developed the DSST method handling scale changes of a target, and Choi et al.  proposed a spatially attentional weight map to weight various correlation filters. Ma et al. 
used a correlation filter as a short-term tracker and an online random fern classifier for re-detection as a long-term memory system.
Although achieved the appealing results both in precision and success rate, these correlation-filter-based trackers cannot deal with fast motions and scale variations well. For example, although two correlation-filter-based trackers, namely SCT and KCF , have achieved state-of-the-art results and have beaten all other attended trackers in terms of accuracy in the OTB-2013 dataset , they fail to track a target object when partial occlusions or illumination changes occur.
To deal with the above issues, we propose a novel and an ensemble tracker, which first builds weak trackers by applying structural correlation filters, and then integrate all weak trackers into one stronger tracker using a reliability evaluation and a multi-task Gaussian particle filter. Each weak tracker is treated as an expert and the weights of all experts are computed via their confidence maps. Through the reliability evaluation, the weak trackers provide weak decisions for the Gaussian particle filter to make a final decision. The tracking result in the current frame is interred by the weights of the Gaussian particle filter. Particles respectively calculate two results using HOG and gray-normalization features, and therefore the Gaussian particle filter conducts a multi-task tracking for making a strong decision.
The contributions in this paper are summarized as follows.
A novel ensemble algorithm is proposed to combine structural correlation filters and a Gaussian particle filter into a single stronger tracker.
A large-scale search scope algorithm using for multiple structural correlation filters is proposed, considering spatial geometric relations between target locations in consecutive frames and therefore making the search reliable.
Extensive experiments in the OTB-2013 benchmark dataset  with 50 challenging sequences and 11 various attributes to demonstrate the outperformance of the proposed method in comparison with 16 state-of-the-art trackers.
2 Related Work
A comprehensive tracking review can be found in [43, 44, 36]. In this section, we discuss the methods closely related to this work, mainly regarding multi-task correlation filters, Gaussian particle filters and Average Peak-to-Correlation Energy (APCE).
2.1 Structural Correlation Filters
In , denote the feature map extracted from the -th convolutional layer with Gaussian function label . Let and , where
represents the discrete Fourier transformation (DFT). The objective function of correlation filter method can be extended into its -th filter modeled as
Here, the symbol is the element-wise product.
The optimization problem in Eq. 1 has a simple closed form solution, which can be efficiently computed in the Fourier domain by
Given the testing data from the output of the -th layer, we first transform it to the Fourier domain , and then the responses can be computed by
where denotes the inverse of DFT.
The -th weak tracker outputs the target position with the largest response
2.2 Gaussian Particle Filters (GPF)
. The DSS model represents the time-varying dynamics of an unobserved state variable. GPF is based on the particle filtering and Gaussian filtering concepts. Gaussian filters provide Gaussian approximations to the filtering and predictive distributions, and they include Extended Kalman Filter (EKF) and its variations [20, 31, 37, 42]. Unlike EKF, which assumes that predictive distributions are Gaussian and employs linearization of the functions in the process and observation equations, GPF updates the Gaussian approximations using particles. GPF only propagates the posterior mean and covariance of an unobserved state variable in a DSS model, and essentially importance sampling makes the procedure simple.
Particle filters [16, 32, 33] use sequential importance sampling (SIS)  to update the posterior distributions. GPF is quite similar to SIS filters by the fact that importance sampling is used to obtain particles. However, a phenomenon called sample degeneration occurs wherein only a few particles representing the distribution have significant weights. A procedure called resampling  has been introduced to mitigate this problem, but it may give limited results and may be computationally expensive. Since GPF approximates posterior distributions as Gaussians, unlike the SIS filters, particle resampling is not required. This results in a reduced complexity of GPF as compared with SIS with resampling and is a major advantage. Furthermore, Berzuini et al.  reported that particle filters with resampling also had bias due to resampling, and resampling in SIS filters is a nonparallel operation. Fortunately, resampling would never occur in GPF simulation examples, and the particle filters in GPF are amenable to parallel implementation. Therefore, GPF is more amenable for fully parallel implementation in very large scale integration (VLSI) than SIS.
Simulation results are presented to demonstrate the versatility and improved performance of GPF over conventional Gaussian filters and the lower complexity of GPF than the known particle filters. However, the parallelizibility of GPF and the absence of resampling makes it convenient for VLSI implementation and, hence, feasible for practical real-time applications.
2.3 Ensemble trackers
Multiple component trackers have been combined with hand-crafted features to develop ensemble tracking methods [1, 2, 41] for visual tracking. For example, several ensemble methods [1, 2] using a boosting framework  train each component weak tracker to classify foreground objects and background. In , Wang and Yeung used a conditional particle filter to infer a target’s position and the reliability of each component tracker. Qi et al.  treated tracking as a decision-theoretic online learning task and the tracked target was inferred by using the decisions from multiple expert trackers. Similar to , we consider visual tracking as a decision-theoretic online learning task , and use it in the structure of multiple correlation filters combined with a Gaussian particle filter. That is, in every round, each correlation filter makes a decision and the final decision is determined by a Gaussian particle filter.
2.4 Average Peak-to-Correlation Energy (APCE)
For correlation filters, the peak value denotes the maximum response score of a response map. To measure the fluctuation degree of a response map and the reliability degree of a detected object, Wang et al.  proposed an average peak-to-correlation energy (APCE) which is defined as
where , and denote the maximum, minimum and the -th row and -th column elements of . APCE indicates the fluctuated degrees of response maps and the confidence degrees of the tracked results. For sharper peaks and less noise, in the case that the target fully appearing in a tracking region, APCE will become greater and the response map will become smoother except for only one sharp peak. On the other hand, APCE will be small if an object is occluded or missing.
3 Proposed Algorithm
In this section, we present the combination of structural correlation filters with a multi-task Gaussian particle filter for ensemble tracking, namely KCF-GPF. Different from the KCF method [19, 35] that learns a single correlation filter in a fixed-size area, KCF-GPF is proposed to construct multiple weak correlation filters in a more reliable search scope for dealing with fast motion issues and bound effects in the conventional correlation filters. The Gaussian particle filter jointly learns particle weights based on different features to make a stronger tracker by a multi-task method. Furthermore, our tracker can effectively handle scale variations via the sampling strategy of a Gaussian particle filter. Overall, the proposed ensemble method will achieves the following two goals: 1) weak expert trackers are tuned to separate a foreground object from background and 2) the ensemble as a whole ensures the temporal coherence of each part of the tracker.
3.1 Weak Structural Correlation Filter
A conventional correlation filter has bound effects  during a target tracking and it interferes with the progress of target detection with fast motions. For this, we extend the conventional search window for a single tracker to a large-scale search scope for multiple trackers, and exploit spatial-geometric relations between target locations in consecutive frames to make the search scope reliable.
The feature map in Eq. 1 is extracted from an image patch which is sampled by a search sliding window (see Figure. 2), where donates the maximum historical-moving-distance of a tracked target in Eq. 17 at time and the object location in the previous frame is the center of the search scope. Besides it, we respectively use and to represent the horizontal and vertical moving distances at time .
Given the initial confidence weights of all weak experts, in the current round, a further decision is made based on the expert tracker with the greatest weight. In the visual tracking scenario, it is natural to treat each KCF tracker as an expert and then predict the target position at time by
where is the weight of expert and , denotes the absolute value and represents the signum function.
Kernel Selection: We choose the Gaussian kernel in the existing correlation filter tracker .
Feature Representation: Similar to , we use HOG features with 31 bins. However, our tracker is quite generic and any dense feature representation with arbitrary dimensions can be incorporated.
HDT uses CNN features, while SCF and KCF-GPF are based on HOG features.
The features of HDT are extracted from one layer to build a weak tracker, and the part-based correlation filter SCF samples several parts of a target object to construct features, while KCF-GPF samples via sliding window in a search scope which is subject to the maximum historic-moving-displacement of the tracked target over time .
3.2 Reliability Degree of Correlation Filter
The peak value and the fluctuation of the response map can reveal the confidence degree about the tracking results to some extent. The ideal response map should have only one sharp peak and be smooth in all other areas when the detected target is extremely matched to the correct target. The sharper the correlation peaks are, the better the location accuracy is. Otherwise, the whole response map will fluctuate intensely, and its pattern is significantly different from normal response maps. If we continue to use uncertain samples to update the tracking model, it would be corrupted mostly.
Inspired by , we evaluate the stability of the -th expert with two criteria. The first criterion is the maximum response score of the response map defined as
The second criterion is called average peak-to-correlation energy (APCE) measure which is defined in Eq. 6.
3.3 Multi-task Gaussian Particle Filter
The proposed multi-task Gaussian particle filter at time approximates the posterior mean and covariance of the unknown state variable using Bayesian importance sampling.
We draw samples from the importance function at time using
The respective weights are computed by
where the distribution represents the observation equation conditioned on the unknown state variable at time .
Then, we set , where denotes the absolute value and is the function of features, where represents the features of the template at time . Hence, each Gaussian particle weight can be calculated with
Normalize the weights as
In this paper, we adopt HOG for tackling deformation variations and gray normalization of raw pixel values from sample images for handling illumination changes, and the -th features are represented by and at time , respectively . Hence, we get the corresponding weights and .
For the multi-task Gaussian particle filter, we jointly define the similarities of respective samples as
where learning rate parameter is a constant value in this paper.
The mean and covariance are estimated by
where represents the Hermitian Matrix.
The maximun historical-moving-distance of the tracked target at time is defined as
Here, we set , and denotes the function for absolution values;
3.4 KCF-GPF Tracker
Figure 1 illustrates the flowchart of the proposed algorithm. Based on the structural correlation filter and a Gaussian particle filter, we propose a KCF-GPF tracker. The first step generates weak correlation filters. The second step is to make weak decisions via the weak structural correlation filter.s The third step is to evaluate the reliability degrees of weak decisions. Finally, the optimal decision is made using a multi-task Gaussian particle filter.
To update the KCF-GPF for visual tracking, we adopt an incremental strategy in the current frame to update the template in Eq. 12 by
where learning rate parameter is a constant value in this paper.
An overview of the proposed method is summarized in Algorithm 1.
4.1 Experimental Setups
Implementation Details. The conventional features used for KCF-GPF are composed of HOG features and gray normalization features. Our tracker is implemented on MATLAB in a PC with a 2.80 GHz CPU and runs faster than 21 FPS in Table 2. Our tracker requires few parameter settings, reported in Table 1
, where ‘padding’ (referring to KCF) means the magnification of the image region samples relative to the target bounding box.
|Part of Track||parameters||values|
|Learning rate (Eq. 14)||0.65|
|Learning rate (Eq. 18)||0.8|
Datasets. Our method is evaluated in the OTB-2013 dataset  consisting of 50 sequences. The images are annotated with ground truth bounding boxes and 11 various visual attributes include scale variation, out of view, out-of-plane rotation, low resolution, in-plane rotation, illumination, motion blur, background clutter, occlusion, deformation, and fast motion.
We compare the proposed method with the 16 state-of-the-art tracking methods using evaluation metrics and code provided by the respective benchmark dataset. For testing on OTB-2013, we employ the one-pass evaluation (OPE) and use two metrics: precision and success plots. The precision metric computes the rate of frames whose center location is within some certain distance from the ground truth location. The success metric computes the overlap ratio between the tracked and ground truth bounding boxes. In the legend, we report the area under curve (AUC) of success plot and precision score at a 20 pixel threshold (PS) corresponding to the one-pass evaluation for each tracking method.
4.2 Comparison with State-of-the-Art
We evaluate KCF-GPF in the OTB-2013 dataset  and compare it with 16 state-of-the-art trackers including LMCF , CFNet , CFN , CFN , CNT , BIT , SINT , SCT , Staple , SiamFC , SRDCF , DSST , MEEM , KCF , TLD  and Struck . Among them, LMCF, CFN, CFN , Staple, SRDCF, KCF, DSST, CFNet and SCT are CF based algorithms; SiamFC, SINT, CFNet, CNT and BIT are convolutional networks based algorithms; MEEM is developed based on regression and multiple trackers; TLD is based on an ensemble classifier; and Struck is structured SVM based methods.
The characteristics and tracking results are summarized in Table 2. The mean FPS here is estimated on all sequences in the OTB-2013 and achieves 21.3 fps satisfying the requirement of real-time capability. LMCF achieves the second best performance in terms of the success metric and SINT shows the second best performance in terms of precision metric. Figure 5 illustrates the precision and success plots of all trackers under all challenging attributes in the OTB-2013. KCF-GPF is also superior to other up-to-date trackers with precision and success evaluation metrics in the OTB-2013 benchmark.
For detailed analyses, we also evaluate KCF-GPF with these state-of-the-art trackers on various challenging attributes in the OTB-2013 benchmark dataset and the results are shown in Figure 3 and Figure 4. The results demonstrate that KCF-GPF is ranked on top three in each attribute and achieves the best performances in the general success plots. Besides that, the proposed method outperforms other trackers in terms of deformation and out-of-plane rotation attributes.
4.3 Qualitative Comparison
To demonstrate the effect of the proposed KCF-GPF tracking algorithm, we make a qualitative comparison with above state-of-the-art trackers in the OTB-2013 benchmark dataset with 11 different attributes. As shown in Figure 6, all trackers perform well overall, but the existing trackers have the following drawbacks. The SCT does not perform well under scale variations (Liquor, Woman, and Dog1). The CFNet cannot handle occlusion (Lemming, Skating1, Subway, Singer2, Suv, Liquor, Woman and Soccer), deformation(Skating1, Subway, Singer2, Suv and Woman) ,out-of-plane rotation (Lemming, Skating1, Singer2, Liquor, Woman and Soccer) and background clutters (Skating1, Subway, Singer2, Suv, Liquor and Soccer). The KCF drifts when there are illumination variations (Shaking, Lemming and Woman), scale variations (Shaking, Lemming, Woman and Dog1), out-of-plane rotations (Shaking, Lemming, Woman and Soccer), and fast motions (Woman and Soccer). The TLD and Struck methods drift when target objects undergo illumination changes (Shaking, Skating1, Singer2 and Soccer), heavy occlusion (Lemming, Subway, Singer2, Suv, Liquor, Woman and Soccer) and scale variations (Lemming and Dog1). Overall, the proposed KCF-GPF tracker performs the best against the existing trackers in tracking objects on these challenging sequences.
In this paper, we have proposed a novel tracker combining multiple structural correlation filters with a multi-task gaussian particle filter, namely KCF-GPF, to construct a strong tracker for ensemble tracking. The proposed method takes multiple correlation filters as weak expert trackers, and exploits spatial-geometric relations between target locations in consecutive frames to provide weak decisions in a reliable search scope. The reliability degrees of weak decisions are introduced in experiments for the GPF to make a strong decision. As a result, it not only has the advantages of the existing correlation filter trackers, such as, computational efficiency and robustness, but also can deal with scale variations by the sampling strategy of a GPF. The proposed KCF-GPF tracking algorithm outperforms 16 state-of-the-art methods over all 50 sequences in the OTB-2013 benchmark in terms of qualitative and quantitative evaluations.
-  S. Avidan. Ensemble tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(2):261–271, 2007.
-  Q. Bai, Z. Wu, S. Sclaroff, M. Betke, and C. Monnier. Randomized ensemble tracking. In IEEE International Conference on Computer Vision, pages 2040–2047, 2013.
-  L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr. Staple: Complementary learners for real-time tracking. 38(2):1401–1409, 2015.
-  L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-convolutional siamese networks for object tracking. pages 850–865, 2016.
C. Berzuini, N. G. Best, W. R. Gilks, and C. Larizza.
Dynamic conditional independence models and markov chain monte carlo methods.Journal of the American Statistical Association, 92(440):1403–1412, 1997.
D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui.
Visual object tracking using adaptive correlation filters.
Computer Vision and Pattern Recognition, pages 2544–2550, 2010.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.
Distributed optimization and statistical learning via the alternating
direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
-  B. Cai, X. Xu, X. Xing, K. Jia, J. Miao, and D. Tao. Bit: Biologically inspired tracker. IEEE Transactions on Image Processing, 25(3):1327–1339, 2016.
-  K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. Computer Science, pages 297–305, 2009.
-  J. Choi, H. J. Chang, J. Jeong, Y. Demiris, and Y. C. Jin. Visual tracking using attention-modulated disintegration and integration. In Computer Vision and Pattern Recognition, pages 4321–4330, 2016.
-  J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, and Y. C. Jin. Attentional correlation filter network for adaptive visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Accurate scale estimation for robust visual tracking. In British Machine Vision Conference, pages 65.1–65.11, 2014.
-  M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Learning spatially regularized correlation filters for visual tracking. In IEEE International Conference on Computer Vision, pages 4310–4318, 2015.
-  M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8):1561–1575, 2016.
Y. Freund and R. E. Schapire.
A desicion-theoretic generalization of on-line learning and an
application to boosting.
European Conference on Computational Learning Theory, pages 23–37, 1995.
-  F. Gustafsson, F. Gunnarsson, N. Bergman, U. Forssell, J. Jansson, R. Karlsson, and P. J. Nordlund. Particle filters for positioning, navigation, and tracking. IEEE Transactions on Signal Processing, 50(2):425–437, 2002.
-  S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structured output tracking with kernels. In IEEE International Conference on Computer Vision, pages 263–270, 2012.
-  P. J. Harrison and C. F. Stevens. Bayesian forecasting. discussion. Journal of the Royal Statistical Society, 38, 1976.
-  J. F. Henriques, C. Rui, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
-  Jazwinski and AndrewH. Stochastic processes and filtering theory. Academic Press,, 1970.
-  S. J. Julier and J. K. Uhlmann. Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 92(3):401–422, 2004.
-  Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409, 2012.
-  J. H. Kotecha and P. M. Djuric. Gaussian particle filtering. IEEE Transactions on Signal Processing, 51(10):2592–2601, 2003.
-  Y. Li, J. Zhu, and S. C. H. Hoi. Reliable patch trackers: Robust visual tracking by exploiting reliable patches. In Computer Vision and Pattern Recognition, pages 353–361, 2015.
-  J. S. Liu and R. Chen. Sequential monte carlo methods for dynamic systems. Journal of the American Statistical Association, 93(443):1032–1044, 1998.
-  J. S. Liu, R. Chen, and T. Logvinenko. A Theoretical Framework for Sequential Importance Sampling with Resampling. Springer New York, 2001.
-  S. Liu, T. Zhang, X. Cao, and C. Xu. Structural correlation filter for robust visual tracking. In Computer Vision and Pattern Recognition, pages 4312–4320, 2016.
-  T. Liu, G. Wang, and Q. Yang. Real-time part-based visual tracking via adaptive correlation filters. In Computer Vision and Pattern Recognition, pages 4902–4912, 2015.
-  C. Ma, J. B. Huang, X. Yang, and M. H. Yang. Hierarchical convolutional features for visual tracking. In IEEE International Conference on Computer Vision, pages 3074–3082, 2015.
-  C. Ma, X. Yang, C. Zhang, and M. H. Yang. Long-term correlation tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5388–5396, 2015.
-  J. Mendel. Optimal filtering. IEEE Transactions on Automatic Control, 25(3):615–616, 1980.
-  R. V. D. Merwe, A. Doucet, N. D. Freitas, and E. Wan. The unscented particle filter. Advances in Neural Information Processing Systems, 13:584–590, 2001.
-  K. Nummiaro, E. Koller-Meier, and L. V. Gool. An adaptive color-based particle filter. Image and Vision Computing, 21(1):99–110, 2003.
-  Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M. H. Yang. Hedged deep tracking. In Computer Vision and Pattern Recognition, pages 4303–4311, 2016.
-  C. Rui, P. Martins, and J. Batista. Exploiting the circulant structure of tracking-by-detection with kernels. In European Conference on Computer Vision, pages 702–715, 2012.
-  A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: An experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1442–1468, 2014.
-  H. W. Sorenson. Recursive estimation for nonlinear dynamic systems. Bayesian Analysis of Time, 1988.
-  R. Tao, E. Gavves, and A. W. M. Smeulders. Siamese instance search for tracking. In Computer Vision and Pattern Recognition, pages 1420–1429, 2016.
-  J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. End-to-end representation learning for correlation filter based tracking. 2017.
-  M. Wang, Y. Liu, and Z. Huang. Large margin object tracking with circulant feature maps. 2017.
-  N. Wang and D. Y. Yeung. Ensemble-based tracking: aggregating crowdsourced structured time series data. In International Conference on International Conference on Machine Learning, pages II–1107, 2014.
-  G. Welch and G. Bishop. An introduction to the kalman filter. University of North Carolina at Chapel Hill, 8(7):127–132, 2010.
-  Y. Wu, J. Lim, and M. H. Yang. Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2411–2418, 2013.
-  A. Yilmaz. Object tracking: A survey. Acm Computing Surveys, 38(4):13, 2006.
-  J. Zhang, S. Ma, and S. Sclaroff. Meem: Robust tracking via multiple experts using entropy minimization. In European Conference on Computer Vision, pages 188–203, 2014.
-  K. Zhang, Q. Liu, Y. Wu, and M. H. Yang. Robust visual tracking via convolutional networks without training. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, 25(4):1779, 2016.