Visual tracking plays an important role in computer vision and has many applications such as video surveillance, robotics, motion analysis and human computer interaction. Even though various algorithms have come out, it is still a challenge problem due to complex object motion, heavy occlusion, illumination change and background clutter.
Visual tracking algorithms can be roughly categorized into two major categories: discriminative methods and generative methods. Discriminative methods (e.g., Liu et al. (2009); Babenko et al. (2009); Hare et al. (2011)) view object tracking as a binary classification problem in which the goal is to separate the target object from the background. Generative methods (e.g., Jepson et al. (2003); Ross et al. (2008); Liu et al. (2014b); Zhang et al. (2014); Liu et al. (2014a)) employ a generative appearance model to represent the target’s appearance.
We focus on the generative one and will briefly review the relevant work below. Recently, sparse representation has been successfully applied to visual tracking (e.g., Mei and Ling (2009); Liu et al. (2010); Zhang et al. (2013); Jin et al. (2014)
). The trackers based on sparse representation are under the assumption that the appearance of a tracked object can be sparsely represented by a over-complete dictionary which can be dynamically updated to maintain holistic appearance information. Traditionally, the over-complete dictionary is a series of redundant object templates, however, a set of basis vectors from target subspace as dictionary is also used because an orthogonal dictionary performs as efficient as the redundant one. In visual tracking, we will call theregularized object representation ”sparse coding” (e.g., Mei and Ling (2009)), and the regularized object representation ”sparse counting” (e.g., Pan et al. (2013)). Mei and Ling (2009) has been shown to be robust against partial occlusions, which improves the tracking performance. However, because of using redundant dictionary, heavy computational overhead in minimization hampers the tracking speed. Very recent efforts have been made to improve this method in terms of both speed and accuracy by using accelerated proximal gradient (APG) algorithm Bao et al. (2012) or modeling the similarity between different candidates Zhang et al. (2013). Different from Mei and Ling (2009), IVT Ross et al. (2008) incrementally learns a low-dimensional PCA subspace representation, which adapts online to the appearance changes of the target. To get rid of image noise, Lu et al. Wang et al. (2013b) introduce noise regularization into the PCA reconstruction, which is able to handle partial occlusion and other challenging factors. Pan et al. Pan et al. (2013) employs
norm to regularize the linear coefficients of incrementally updated linear basis (sparse counting) to remove the redundant features of the basis vectors. However, sparse counting will cause unstable solutions because of its nonconvexity and discontinuity. Although the sparse coding has good performance, it may cause biased estimation since it penalizes true large coefficients more, and produce over-penalization. Consequently, it is necessary to find a way to overcome the disadvantages of spare coding and sparse counting.
From the viewpoint of statistics, sparse representation are similar to variable selection when the dictionary is fixed. Besides, it is a bonus that Bayesian framework has been successfully applied to select variables by enforcing appropriate priors. Laplace priors were used to avoid overfitting and enforce sparsity in sparse linear model, which derives sparse coding problem. To further enforce sparsity and reduce over-penalization of sparse coding, each coefficient is assigned with a Bernoulli variable. Therefore, a novel model interpreted from a Bayesian perspective by carrying maximum a posteriori (MAP) is proposed, which turns out to be a combination of sparse coding and counting model. In paper Lu et al. (2013), Lu et al. also consider and norm under a Bayesian perspective. However, considering that there will be occlusion, illumination change and background clutter in tracking, we restraint the noise with norm. Besides, We use an orthogonal dictionary to replace the redundant object templates as similar atoms of redundant templates may cause mistake of coefficients and huge computational complexity. Lastly, We propose closed solution of regularization which is the combination of the norm and norm. However Lu et al. obtain the approximate solution by using he Greedy Coordinate Descent.
Tracking results by using unconstrained regularization, sparse counting, sparse coding and our model under the same dictionary are shown in Fig. 1, respectively. As shown in Fig. 1, one can see that the coefficients of unconstrained regularization and sparse coding are actually not sparse and the target object is not tracked well. Similarly, sparse counting with sparsity coefficients sometimes cannot obtain appropriate linear combination of the orthogonal basis vectors, which will interfere with the tracking accuracy. However, we note that our method is able to reconstruct the object well and find the good candidate, then facilitating the tracking results. We also compare our model with unconstrained regularization, sparse counting, sparse coding over all 50 sequences in benchmark, the precision and success plots are shown in Fig. 2. One can see the parameter setting in the section Experimental Results.
: The contributions of this work are threefold.
(1) We propose a sparse coding and counting model from a novel Bayesian perspective for visual tracking. Compared to the state-of-the-art algorithms, the proposed method achieves more reliable tracking results.
(2) We propose closed solution of combining the norm and norm based regularization in a unique one.
(3) Although the sparse coding and counting related minimization is an NP-hard problem,we show that the proposed model can be efficiently estimated by the proposed APG method. This makes our tracking method computationally attractive in general and comparable in speed with SP method Wang et al. (2013b) and the accelerated tracker Bao et al. (2012).
Visual Tracking based on the Particle Filter
In this paper, we employ a particle filter to track the target object. The particle filter provides an estimate of posterior distribution of random variables related to Markov chain. Given a set of observed image vectorsup to the -th frame and target state variable that describes the six affine motion parameters, the posterior distribution based on the Bayesian theorem is estimated by:
where is the observation model that estimates the likelihood of an observed image patch belonging to the object class, and is the motion model that describes the state transition between consecutive frames.
The Motion Model: The motion model
models the parameters by independent Gaussian distribution around the counterpart in, where
is a diagonal covariance matrix whose elements are the variances of the affine parameters. In the tracking framework, the optimal target state
is obtained by the maximal approximate posterior (MAP) probability:, where indicates the -th sample of the state .
The observation model: In this paper, we assume that the tracked target object is generated by a subspace (spanned by and centered at ) with corruption (i.i.d Gaussian Laplacian noise),
where denotes an observation vector centered at , the columns of are orthogonal basis vectors of the subspace, indicates the coefficients of basis vectors, and
stand for the Gaussian noise and Laplacian noise vector respectively. the Gaussian component models small dense noise and the Laplacian one aims to handle outliers. As proposed byWang et al. (2013a), under the i.i.d Gaussian-Laplacian noise assumption, the distance between the vector and the subspace is the least soft threshold squares distance:
Thus, for each observation corresponding to a predicted state , the observation model that is set to be
where and are the optimal solution of Eq. (5) which will be introduced in detail in next section, and is a constant controlling the shape of the Gaussian kernel.
Model Update: It is essential to update the observation model for handling appearance change of the target in visual tracking. Since the error term can be used to identify some outliers (e.g., Laplacian noise, illumination), we adopt the strategy proposed by Wang et al. (2013a) to update the appearance model using the incremental PCA with mean update Ross et al. (2008) as follows,
where , , and are the i-th elements of , , and , respectively, is the mean vector computed the same as Ross et al. (2008).
Object Representation under Bayesian Framework
Based on the discussion in aforementioned Section, If is viewed as the vectorized target region, it can be represented by an image subspace with corruption,
Pan et al. (2013) shows that sparse counting can remove redundant features (e.g., background portions) while selecting useful parts in the subspace. However, sparse counting will cause unstable solutions because of its nonconvexity and discontinuity. Sparse coding may produce over-penalization, despite its good stability. Considering that Bayesian framework has the capacity to encode prior knowledge and to make valid estimation of uncertainty, a novel model combining sparse coding and sparse counting is proposed for visual tracking. The model is
where , denotes the norm which counts the number of non-zero elements, and denote and norms, respectively, , and are regularization parameters, and
is an identity matrix. The termis used to reject outliers (e.g., occlusions), while and are used to select the useful subspace features.
Next we will introduce the aforementioned model under Bayesian framework in detail. The joint posterior distribution of and based on the Bayesian theorem can be written as
where , , , , , denote the priors on the noisy vectorized target region, the coefficient vector , the index vector (), the Laplacian noise, and the noise level, respectively. In Eq. (6), the parameters , , , , and are the relevant constant parameters of the priors.
With the definition of the index variable , Eq. (4) can be rewritten as
We generally assume that the noise follows the Gaussian distribution, . We treat the Laplacian noise term as missing values with the same Laplacian prior. Therefore, the Prior has the follow distribution:
To enforce sparsity, the coefficients are assumed to follow Laplace distribution.
Our goal is to remove redundant features while preserving the useful parts in the dictionary. As Laplace priors resulting sparse coding may lead to over penalization on the large coefficients, we assume the index variable of each coefficient to be a Bernoulli variable to enforce sparsity and reduce over penalization.
where . Here, the Bernoulli prior on means that will have probability to be 1 and to be 0, if the prior information is known.
The noise is aims at handling outliers, so it follows Laplace distribution:
The variances of noises are assigned with Inverse Gamma prior as follow:
where denotes the gamma function.
Then, the optimal are obtained by the MAP probability. After taking the negative logarithm, the formula is
With fixing , Eq. (14) can be rewritten as
where . With and , Eq. (15) can be rewritten as
By observing the objective function in Eq. (16), it can be found that the essential regularization in Eq. (16) is a combination of the sparse coding and the sparse counting. With a fixed appropriate orthogonal dictionary D, Eq. (16) can be written as an optimization problems Eq. (5).
Theory of Fast Numerical Algorithm
As we know, APG is an excellent algorithm for convex programming Lin et al. (2009); Tseng (2008) and has been used in visual tracking. In this section, we propose a fast numerical algorithm for solving the proposed nonconvex and nonsmooth model by using APG approach. The experimental results show that it can converge to a solution quickly and achieve attractive performance. Besides, the closed solution of the combining and based regularization is provided.
APG Algorithm for Solving Eq. (17)
Eq. (5) contains two subproblem: one is solving given fixed , the other one is solving given fixed , the formula is shown as follow
Solving Eq. (17) is an NP-hard problem because it involves a discrete counting metric. We adopt a special optimization strategy based on the APG approach Lin et al. (2009), which ensures each step be solved easily. In APG Algorithm, we need to solve
where , , , and is a Lipschitz constant.
The solutions of Eq. (18) can be obtained by
where , and is defined as
Closed Solution of combining and regularization
This subsection mainly focus on a sparse combinatory model which combines and norm together as the regularizer term
where , and denotes norm: if , then, and , otherwise.
lemma. The optimal solution of the Eq. (21) is defined as
The proof can be found in Supporting Information. If , the Eq. (21) changes into
where and . It is obvious that Eq. (21) can be turned into
So it can be seen as a sequence of optimization of , and each can be solved by Lemma. More analysis about combination of and regularization can be found in Supporting Information.
1.1 Analysis of the combinatory model Eq. (23)
(a) shows the closed solutions of linear regression,, , regularized regression, respectively. (b) shows the sparsity threshold changes of , and regularized regression, respectively.
In Eq. (23), if we set and , the model degenerates to the linear regression. If we set , Eq. (23) reduces to regularized regression, while becoming regularized regression when . S2 Fig. 3 (a) shows the closed solutions of these four cases. We set in Eq. (23) ( regularized regression), in regularized regression, and in regularized regression. We note that regularized regression has the same sparsity as regularized regression, while causing little over penalization than regularized regression. In S2 Fig. 3 (b), sparsity threshold changes of , and regularized regression are shown, respectively. When changes from 0 to 1, the sparsity threshold of varies from that of to the threshold of . Besides, it is obvious that the threshold of is larger than those of and in interval .
Orthogonal Dictionary learning for Visual Tracking
In this section, we demonstrate dictionary learning in detail through three parts: dictioanry initialization, orthogonal dictionary update and dictionary reinitialization.
Dictioanry Initialization: There are two schemes to initialize the orthogonal dictionary, one is doing PCA for the set of initial first frames , the other is doing RPCA for . When initial frames do not undergo corruption (e.g., occlusion or illumination), we do PCA for instead of RPCA. The whole process of PCA is doing skinny SVD for and get the basis vectors of column space as the initial dictionary. However, when initial frames have large sparse noise, RPCA is selected to get the intrinsic low-rank features , which can be obtained by solving Zhang et al. (2014):
When solving Eq. (25), the skinny SVD of is readily available: , and is the initial orthogonal dictionary. Fig. 4 (a) shows that PCA initialization and RPCA initialization both perform well when the initial first frames have little noise. The initial frames is generally clean, therefore, we choose PCA initialization as the default.
Orthogonal Dictionary Update: As the appearance of a target may change drastically, it is necessary to update the orthogonal dictionary . Here we adopt an incremental PCA algorithm Levey and Lindenbaum (2000) to update the dictionary.
Dictionary reinitialization: When the tracker is prone to drift, dynamically reinitializing dictionary to obtain the intrinsic subspace features is needed. We adopt the strategy proposed by Zhang et al. (2014). The reinitialization is performed at -th frame if , where is the noise item at -th frame, is the length of vector, and is a threshold parameter (generally 0.5). If , we reinitialize the dictionary in the same way as initialization of dictionary by doing RPCA, but in Eq. (25) is different. Here, consists of optimal candidate observations respectively from the initial (generally 10) frames and the latest frames (we set ). Fig. 4 (b) compares the tracking performance within and without RPCA reinitialization when the object undergoes variable illumination. After reinitializing dictionary, our tracker retracks the object, so reinitializing dictionary is efficient to improve the reconstruction ability. In Algorithm 2, we summarize the overall tracking process for frame .
In this section, we compare the performance of our proposed tracker with several state-of-the-art tracking algorithms, such as TLD Kalal et al. (2012), IVT Ross et al. (2008), ASLA Jia et al. (2012), APG Bao et al. (2012), MTT Zhang et al. (2013), SP Wang et al. (2013b), SPOT Zhang and Maaten (2013), FOT Vojíř and Matas (2014), SST Zhang et al. (2015), SCM Zhong et al. (2012), MIL Babenko et al. (2009), and Struck Hare et al. (2011), on a benchmark Wu et al. (2013) with 50 challenge video sequences. Our tracker is implemented in MATLAB and runs at 4.2 fps on an Intel 2.53 GHz Dual-Core CPU with 8GB memory, running Windows 7 and Matlab (R2013b). We empirically set , , , and the Lipschitz constant L = 2. Before solving Eq. (5), all the candidates are centralized. Considering the efficiency, the updated orthogonal dictionary is taken columns corresponding to the
largest eigenvalues of PCA or RPCA, 600 particles are adopted, and the model is incrementally updated everyframes. In the following, we present both qualitative and quantitative comparisons of above mentioned methods.
Fig. 5 were taken the frames of the 50 videos to show the Qualitative results for our method, compared with the top-performing SP and SST. We choose some examples from part of 50 sequences to illustrate the effectiveness of our method. Fig. 6 shows the visualization results.
Heavy Occlusion: Fig. 6 (a) and (b) show four challenging sequences with heavy occlusion. In Faceocc1 and Faceocc2, the targets undergo with heavy occlusion and in-plane rotation, it can be seen that our method outperforms the other tracking algorithms. Freeman4 and David3 demonstrate that the proposed method can capture the accurate location of objects in terms of position, and scale when the target undergoes severe occlusion (e.g., Freeman4 #0144 and David3 #0085). However, IVT, APG, MIL, SP, SCM, ASLA, TLD, SPOT, FOT, SST, MTT, and Struck methods drift away from the target object when occlusion occurs. For these four sequences, the IVT method performs poorly since conventional PCA is not robust to occlusions. Although APG and SP utilize sparsity to model outliers, it is observed that their occlusion detection are not stable when drastic change of appearance happens. In contrast, our method is robust to heavy occlusion. This is because our combination of and regularized appearance model can exactly reconstruct the object.
Fast Motion: Fig. 6 (c) show the sequences Boy and Jumping with fast motion. It is difficult to predict the locations of the tracked objects when they undergo abrupt motion. In Boy, the captured images are blurred seriously, but Struck and our method track the target faithfully throughout the images. IVT, MTT, ALSA, SCM and SST methods drift away seriously. We note that most of the other trackers have drift problem due to the abrupt motion in sequence Jumping. In contrast, the SST and our method successfully track the target for whole video.
Drastic Pose, Scale and Illumination Changes: In Fig. 6 (d) and (e), we test five challenging sequences with drastic pose, scale and illumination change. Fish and Tiger1 chips contain significant illumination variation. We can see that the APG, MTT, and MIL methods are less effective in these cases (e.g., Fish #0305 and Tiger1 #0240). In Singer2 and Jogging-2, other trackers drift away when objects under variable illumination, and pose variation (e.g., Singer2 #0110 and Jogging-2 #0100 ), however, our method still performs well. Our method also achieves good performance in CarScale with scale variation (e.g., CarScale #0204). For subspace-based approaches, they may fail to update the appearance model as the calculation of coefficients in their models may have redundant background features. Our method can successfully adapt to variable drastic changes since the combination of sparse coding and sparse counting is not merely stable but also applicable to obtain the intrinsic features of the subspace.
Background Clutters: Fig. 6 (f) demonstrates the tracking results in Deer, Baskerball, and Football with background clutter. Baskerball is a difficult sequence because it contains cluttered background, illumination change, heavy occlusion and non-rigid pose variation. Unless our tracker, none of the compared algorithms can work well on it(e.g., Baskerball #0486 and #0614). As shown in Deer and Football, our tracker performs relatively well (e.g., Deer #0031 and Football #304) as it has excluded background clutters in the sparse errors, but TLD, FOT, and MIL fail.
We use two metrics to evaluate the proposed algorithm with other state-of-the-art methods. The first metric is the center location error measured with manually labeled ground truth data. The second one is the overlap rate, i.e., , where is the tracking bounding box and is the ground truth bounding box. The larger average scores mean more accurate results.
Table 1 shows the average overlap rates. Table 2 reports the average center location errors (in pixels) where a smaller average error means a more accurate result. As can be seen from the table, the most sequences generated by our method have lower average error and higher overlap rate values. We provide the precision and success plots in Fig. 7 to evaluate our performance over all the 50 sequences. The evaluation parameters are set as default in Wu et al. (2013). We note that the our algorithm performs well for the videos with occlusion, deformation, in plane rotation, and out of plane rotation based on the precision metric and the success rate metric as shown in Fig. 8 and Fig. 9 respectively. Both table and figures show that our method achieves favorable performance against other state-of-the-art methods.
To further compare the running time of four subspace-based tracking algorithms (i.e. IVT, APG, SP and our method), we calculated the average Frames Per Second (FPS) for image patch (see the last row of Table 1). For APG, we reported FPS for its APG acceleration. It can be seen that IVT is quite faster than other trackers as its computation only involves matrix-vector multiplication. Both SP and our method are faster than APG. It is also observed that our method is much faster than SP. This is due to the different choices of the optimization scheme. SP adopts a naive altering minimization strategy, in contrast, our method is efficiently solved by APG.
In this paper, we propose sparse coding and counting method under Bayesian framwork for robust visual tracking. The proposed method combines regularization and regularized sparse representation in a unique one, therefore, it has better ability to sparsely represent an object and the reconstruction result are also better. Besides, to solve the proposed model, we develop a fast and efficient APG algorithm. Moreover, the closed solution of the combination of norm and norm regularization is provided. Extensive experiments testify to the superiority of our method over state-of-the-art methods, both qualitatively and quantitatively.
This work is partially supported by the National Natural Science Foundation of China (Nos. 61300086, 61432003, 61301270, 61173103, 91230103), the Fundamental Research Funds for the Central Universities (DUT15QY15), the Open Project Program of the State Key Laboratory of CAD&CG, Zhejiang University, Zhejiang, China (No. A1404), and National Science and Technology Major Project (Nos. 2013ZX04005-021, 2014ZX001011).
Appendix: Proof of Lemma Closed Solution of combining and regularization
First, we denote . It is obvious that if , then . Then we need to discuss the case that :
if , then . Writing its K.K.T condition, we get , and the objective value is .
if , then . It is easy to get , and the objective value is .
Then, we need to compare these three cases, if , we have . Combining with , we have . Similarly, if , then we have . And , otherwise.
- Babenko et al. (2009) B. Babenko, M.H. Yang, S.J. Belongie, Visual tracking with online multiple instance learning, in: CVPR, pp. 983–990.
- Bao et al. (2012) C. Bao, Y. Wu, H. Ling, H. Ji, Real time robust tracker using accelerated proximal gradient approach, in: CVPR, pp. 1830–1837.
- Hare et al. (2011) S. Hare, A. Saffari, P.H.S. Torr, Struck: Structured output tracking with kernels, in: ICCV, pp. 263–270.
- Jepson et al. (2003) A.D. Jepson, D.J. Fleet, T.F. El-Maraghi, Robust online appearance models for visual tracking, IEEE TPAMI 25 (2003) 1296–1311.
- Jia et al. (2012) X. Jia, H. Lu, M.H. Yang, Visual tracking via adaptive structural local sparse appearance model, in: CVPR, pp. 1822–1829.
- Jin et al. (2014) W. Jin, R. Liu, Z. Su, C. Zhang, S. Bai, Robust visual tracking using latent subspace projection pursuit, in: ICME, pp. 1–6.
- Kalal et al. (2012) Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE TPAMI 34 (2012) 1409–1422.
- Levey and Lindenbaum (2000) A. Levey, M. Lindenbaum, Sequential karhunen-loeve basis extraction and its application to images, IEEE Trans. on IP 9 (2000) 1371–1374.
- Lin et al. (2009) Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, Y. Ma, Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix, Technical Report, UIUC, 2009.
- Liu et al. (2010) B. Liu, L. Yang, J. Huang, P. Meer, L. Gong, C. Kulikowski, Robust and fast collaborative tracking with two stage sparse optimization, in: ECCV, 2010, pp. 624–637.
- Liu et al. (2009) R. Liu, J. Cheng, H. Lu, A robust boosting tracker with minimum error bound in a co-training framework, in: ICCV, pp. 1459–1466.
- Liu et al. (2014a) R. Liu, W. Jin, Z. Su, C. Zhang, Latent subspace projection pursuit with online optimization for robust visual tracking, IEEE MultiMedia 21 (2014a) 47–55.
- Liu et al. (2014b) R. Liu, Z. Lin, Z. Su, J. Gao, Linear time principal component pursuit and its extensions using filtering, Neurocomputing 142 (2014b) 529–541.
Lu et al. (2013)
X. Lu, Y. Wang, Y. Yuan, Sparse coding from a bayesian perspective, IEEE Transactions on Neural Networks and Learning Systems 24 (2013) 929–939.
- Mei and Ling (2009) X. Mei, H. Ling, Robust visual tracking using minimization, in: ICCV, pp. 1436–1443.
- Pan et al. (2013) J. Pan, J. Lim, Z. Su, M.H. Yang, -regularized object representation for visual tracking, BMVC (2013).
- Ross et al. (2008) D.A. Ross, J. Lim, R.S. Lin, M.H. Yang, Incremental learning for robust visual tracking, IJCV 77 (2008) 125–141.
- Tseng (2008) P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, submitted to SIAM J. Optimiz., 2008.
- Vojíř and Matas (2014) T. Vojíř, J. Matas, The enhanced flock of trackers, in: Registration and Recognition in Images and Videos, Springer, 2014, pp. 113–136.
- Wang et al. (2013a) D. Wang, H. Lu, M.H. Yang, Least soft-thresold squares tracking, in: CVPR, pp. 2371–2378.
- Wang et al. (2013b) D. Wang, H. Lu, M.H. Yang, Online object tracking with sparse prototypes, IEEE TIP 22 (2013b) 314–325.
- Wu et al. (2013) Y. Wu, J. Lim, M.H. Yang, Online object tracking: A benchmark, in: CVPR, pp. 2411–2418.
- Zhang et al. (2014) C. Zhang, R. Liu, T. Qiu, Z. Su, Robust visual tracking via incremental low-rank features learning, Neurocomputing 131 (2014) 237–247.
- Zhang and Maaten (2013) L. Zhang, L. Maaten, Structure preserving object tracking, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1838–1845.
- Zhang et al. (2013) T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust visual tracking via structured multi-task sparse learning, IJCV 101 (2013) 367–383.
- Zhang et al. (2015) T. Zhang, S. Liu, C. Xu, S. Yan, B. Ghanem, N. Ahuja, M.H. Yang, Structural sparse tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 150–158.
- Zhong et al. (2012) W. Zhong, H. Lu, M.H. Yang, Robust object tracking via sparsity-based collaborative model, in: CVPR, pp. 1838–1845.