Robust visual tracking is an important topic in computer vision, with applications to a wide variety of fields, including video surveillance, motion analysis, object recognition,etc. Given the initial state (e.g., bounding box) of a target in a video sequence, a tracking task aims to infer the states of the target in the succeeding frames. Despite significant progress made recently [1, 2, 3, 4, 5, 6, 7, 8], there still exist challenges from various appearance changes of the tracking object to diverse background disturbance. The benchmark work of  identifies the influential factors of a test sequence to tracking performance into 11 categories, including illumination variation, occlusion, deformation, motion blur, background clutters, to name a few.
To address the issue of appearance and background variations, many sophisticated appearance models have been proposed, which may roughly be categorized into generative and discriminative based models. Generative models based trackers try to build a robust appearance model of the tracking object and search for the best matched candidate regions. Examples that fall into this category are incremental subspace learning , sparse representation based tracking [1, 11, 12, 13, 8], distribution fields representation based tracking , etc
. In contrast, tracking methods based upon discriminative learning typically model the tracking object as well as the background, followed by a classification decision to distinguish the target from its surroundings. Representatives can be the support vector machines (SVM), boosting ensemble tracking , online multiple instance learning , bootstrapping binary classifier tracker , structured output tracking , etc. These methods usually solve the tracking problem as detection (tracking-by-detection). Our proposed tracker applies unsupervised feature learning in an online fashion to model both the tracking target appearance as well as the background, followed by a linear SVM for classification. Hence it belongs to this category.
In recent years, unsupervised feature learning methods have been successfully applied to many vision tasks such as image classification [18, 19, 20], object recognition , scene categorization 
. The classical feature learning pipeline mainly consists of three steps: (a) learning an over-complete dictionary; (b) encoding the features with the learned dictionary; (c) spatially pooling the encoded features over a pyramid of regular spatial grids. The dictionary learning process is typically unsupervised. Methods such as K-means, K-SVD
, sparse coding, sparse/denoising autoencoder, or even random sampling, can be employed. As for the encoding method, soft threshold, soft assignment, sparse coding, locality-constrained linear coding are commonly applied. It has been shown in  that using different dictionary learning methods, even random sampling, has little influence on the classification performance when the dictionary size is sufficiently large, and the pivotal procedure lies in the encoding step. They proved that with a simple soft threshold encoding method, state-of-the-art performance can be achieved in image classification.
The success in those work has inspired us to adapt the image classification pipeline to object tracking. We highlight the main contributions of this work as follows:
We propose a feature learning based tracker using the online dictionary learning method . The online dictionary learning can adapt to the foreground and background appearance and effectively update the dictionary words. This is important for online problems like tracking. Despite the simplicity of the proposed tracker, it outperforms almost all state-of-the-art trackers in the literature.
We evaluate the performance of a few widely-used dictionary learning and feature encoding methods in the proposed tracking framework. Due to the nature of tracking problems (such as efficiency requirement and relatively simpler classification compared with generic image classification), some helpful conclusions are made, which deviates from the case of image classification .
To further demonstrate the superior performance of the learned features over traditional hand-crafted features in visual tracking, we incorporate the feature learning part into the Struck tracker  and obtain improved tracking accuracy.
2 Related work
As a crucial component of the tracking system, the appearance model has been extensively studied. Besides the traditional hand-crafted features, like texture , HOG , Haar-like features [16, 17, 2], etc., the sparse representation has been widely used in tracking, which is closely related to our feature learning based tracker proposed here.
minimization) problem to sparsely represent the tracking object using a set of target templates and trivial templates. Note that their methods, representations are holistic, and the dictionary is usually constructed using simple methods like sampling or principal component analysis. In contrast, our method is based on local patches. Also no pooling is applied in their methods, which can often significantly improve the accuracy, as shown in our experiments. In their work, theminimization problem needs to be solved many times, although [1, 8] applied faster computation to speed up the computation procedure.
Later, the work of 
contains learning a dictionary on SIFT features extracted from general images (e.g., the VOC2010 and Caltech101 datasets) by solving the sparse coding problem, encoding the feature using the
sparse coding, then applying max-pooling and training a logistic regression classifier. In addition to the aforementioned issue of extremely expensive computational cost, their method yields a final representation of high dimension (in their case, it is 14336), which can severely hinder its pragmatic value in tracking. The work of proposes to use the histograms of sparse coefficients based on a local sparse dictionary learned from image patches sampled from the first frame of the sequence and then applies mean shift for tracking. Similar work can be found in [11, 12], although the work of  adopts a different alignment pooling strategy and in , it directly concatenates the learned sparse coefficients instead of pooling. Compared with the methods reviewed above, we show that using online dictionary learning with simple but extremely efficient encoding method, rather than solving the much more expensive minimization problem, we can outperform most state-of-the-art trackers.
3 Unsupervised feature learning for tracking
We follow the well-known tracking-by-detection framework , which attempts to learn a classifier to discriminate the target object from its background. First, we learn a dictionary of size (each column denotes a basis111We call the element in a dictionary basis, although it is not necessarily orthogonal. vector; if , then is over-complete.) based on the image patches222It can also be other local descriptors. We simply use raw pixels of image patches in this work. We actually found that feature learning on raw pixels usually works better than feature learning on low-level image descriptors like local binary patterns. extracted from the current frame, and update it online during the tracking when necessary.
Due to its efficiency and being easy-to-implement, the soft threshold (ST) coding strategy is applied here, which writes
Therefore, are the encoded features, and is a predefined threshold. We mainly use soft threshold to encode the original features ( denotes a vector by stacking all pixel values of an image patch). Then we perform the max-pooling operation to produce the final feature vectors, which are used to train a linear SVM for detection. As based on the theoretical and empirical evaluation of , max-pooling generally yields more discriminative features for classification, compared to sum or average pooling. The framework of our feature learning based tracking is illustrated in Figure 1 and the algorithm is summarized in Algorithm 1.
3.1 Online dictionary learning
Various dictionary learning techniques exist in the literature, including K-means, K-SVD , sparse coding, etc. Recent studies have shown that using relatively simple dictionary learning methods, such as K-means or even random sampling, offers surprisingly promising results in image classification [18, 19]. This is true only when the dictionary size is sufficiently large (typically a few thousand), which leads to high dimensional feature as the dimension of the feature vector is linearly proportional to the dictionary size after the encoding process. For the application of real-time tracking, it requires that the feature dimension cannot be very high for computational efficiency. On the other hand, due to temporal changes in the tracking video, a fixed dictionary is generally not sufficient to cope with the appearance changes of the tracking object as well as the background. We employ online dictionary learning of  to build a relatively small-size dictionary by taking both the computational efficiency and online update into consideration.
Given a training set of image patches , many classical dictionary learning methods learn an optimized dictionary by (either exactly or approximately) solving the following objective function:
where are the sparse codes; is a regularization parameter; and are the and norm respectively. The latter enforces sparsity. Problem (1) is not jointly convex with respect to and , so it is commonly solved by alternating between the two variables. The online dictionary learning method follows this vein, assuming the training set composed of i.i.d. samples. At each round , the algorithm draws one or more () and alternates between the classical sparse coding step for computing the sparse code of over the dictionary , with the dictionary update step for obtaining .
The sparse code is solved by the LARS-Lasso  with fixed:
While the dictionary is updated by optimizing:
both of which are also updated online. Here denotes the trace of a matrix, and , . The optimization problem (3) is solved by sequentially updating the -th column of through an orthogonal projection onto the constrained set:
where , with denotes the -th row and -th column element of , and , is the -th column of and respectively.
The algorithm is summarized in Algorithm 2. It is worth noting that the method can also be used in an off-line fashion to train on fixed-size data by cycling over a randomly permuted training set to draw . In the tracking task, the dictionary can be off-line learned from natural images or the first frame of the sequence. We provide a comparison of these three cases in the experiment section.
To avoid the unstable performance caused by too frequent update as well as to ensure efficiency, we apply some heuristic strategies here. To capture the appearance change of the object, we introduce a weighting scheme for each basis in, which is defined as the normalized norm of the encoded features. Specifically, the -th basis is weighted as . It indicates the relative importance of the bases in the encoding process, and essentially the appearance of the region. According to this weighting scheme, we can sort the bases from the most important to the least important. During the tracking, if the overlap between the top half bases of the two detected target regions in consecutive frames below a threshold ( in our experiment), then there is possibly appearance change happening, and the dictionary is updated. We give an illustration by visualizing the ordered learned bases (100 in total) with their corresponding encoded responses in figure 2. As can be seen, the ranked bases provide some intuitive insights into the feature learning approach.
3.2 Re-training of linear classification
To build an appearance discriminative model, we train a linear least-squared SVM (LS-SVM) classifier on the learned features, mainly due to its fast closed-form solution. Of course many other classifiers can be used here.
Given a set of training examples , where and , the LS-SVM learns a classifier by optimizing the following objective function :
where is the norm and is the trade-off parameter. To simplify notation, we define as an vector of all ones, to be the data matrix, be the positive and negative sample number respectively, be the positive and negative sample mean, and be the mean of all training samples. Obviously we have and . Then the closed form solution of (6) can be formulated as:
is an identity matrix andis the covariance matrix formulated as . During tracking, we use an online reservoir of boxes from a maximum number of frames (30 in our experiment) for training. Generally, the earliest tracking results are more accurate, while the latest ones capture the recent appearance of the tracking target. Based on these two considerations, we select the boxes from the first 10 together with the most recent 20 frames to maintain the reservoir.
In this section, we offer a comprehensive evaluation of the proposed tracker on twenty sequences, most of which can be found at the website of the first author of . These sequences contain various challenging situations in object tracking, like illumination variation, occlusion, deformation, background clutters, fast motion etc. For a detailed attribute description, please refer to . Two widely-used evaluation criteria are utilized here, namely, the center location error (CLE) and the PASCAL VOC overlap ratio (VOR), with the latter defined as , where is the tracking result box and the ground truth bounding box.
We use a search radius of 30 for tracking and 60 for training classifier, as did in Struck . The dictionary is initially learned from image patches of the first frame and then online updated. We extract patches at a step size 4 for large tracking objects and
with stride 2 for small targets. The patches are then normalized by subtracting the mean and dividing by the standard deviation for contrast normalization. Note that we do not do the unit length normalization here as it degrades performance. We use a dictionary size of 100 () and soft threshold (ST) coding with three-level max pooling (), which yields a feature dimension of 1400. As we do not do the unit length normalization, we empirically set the threshold of ST as and use it throughout all the sequences. We use the optimization toolbox  for online updating the dictionary and solving the sparse coding problem.
During the tracking, we maintain a reservoir of 30 frames (the first 10 and the most recent 20; fixed for all the sequences) for re-training the LS-SVM. The classifier is initially trained with the first two labelled frames and updated every four frames. Our unoptimized Matlab implementation runs around 4 frames per second with no dictionary updating and around 2.5 frames per second with dictionary update, on a standard PC machine using a single core.
4.1 Comparison with state-of-the-art trackers
We first compare our tracker with eight state-of-the-art trackers, which are Struck (structured output tracker ), SCM (sparsity-based collaborative model ), ASLA (adaptive structural local appearance model ), L1APG ( tracker using accelerated proximal gradient approach ), DFT (distribution field tracker ), MTT (multi-task sparse learning tracker ), TLD (bootstrapping binary classifier tracker ), IVT (incremental subspace tracker ). The publicly available benchmark code of  with initial settings are used for evaluating their results. We report the average VORs and CLEs in Table 1 and Table 2 respectively. For our method, due to the randomness introduced by the dictionary learning process, we run 5 times and report the median results. The results of our tracker both with and without dictionary update process are included in the table. From the results, we can see that our tracker with online dictionary update achieves the best overall performance across all the twenty sequences, especially on the david3, box, iceball and bolt, where the other trackers lose the target at different frames. One more notable conclusion is that even without dictionary update, our tracker performs surprisingly well, which may result from the fact that most tracking scenes consist of relatively simple image patterns. We will give more discussions on the dictionary update later.
|Sequence||Ours||Ours_U||Struck ||SCM ||ASLA ||L1APG ||DFT ||MTT ||TLD ||IVT |
|Sequence||Ours||Ours_U||Struck ||SCM ||ASLA ||L1APG ||DFT ||MTT ||TLD ||IVT |
4.2 Analysis of feature learning
In this section, we examine several factors that have impact on the performance of the proposed tracker.
Evaluation of different dictionary learning methods
We compare the online dictionary learning algorithm  used in this paper with two other typical dictionary learning methods, namely, K-means and K-SVD . We also include results using random sampled (RS) patches as dictionary, and all the methods use the image patches extracted from the first frame. One may suspect using patches obtained from natural images may yield better performance, as they may provide more general patterns. To justify this point, we also run the ODL method by utilizing 100000 image patches randomly selected from a segmentation database and use it through out all the sequences. The dictionary size is fixed at 100 for all the methods. Table 3 shows the average VORs and CLEs on eight sequences. The results indicate that the random sample method performs bad in the case of small dictionary size, and using different dictionary learning methods has little influence in the tracking performance, which is in accordance with the conclusion of  in image classification. The reason why we use ODL rather than the other two is that K-SVD is more time consuming and K-means suffers from unstable performance in case of online update. One more conclusion can be made from Table 3 is that using image patches directly from the sequence can better capture the patterns of the tracking object as well as the background, especially when the dictionary size is not large enough.
Evaluation of different encoding schemes
Besides the soft threshold (ST) and sparse coding (SC), there are several encoding schemes exist in the literature, which include soft assignment (SA), localized soft assignment (LSA)  and triangle K-means (TK)  etc. Given a learned dictionary , the encoding process provides a feature mapping from to . We summarize the formulations of the five encoding methods in Table 4. After obtaining the dictionary using online dictionary learning, we compare the tracking results with the five different encoding methods. The threshold in ST is empirically chose as . The smoothing factor in SA and LSA is set to 10 as suggested in  and the neighborhood size in LSA is tuned from . The trade-off parameter in SC is optimally chose from . Table 5 reports the average VORs and CLEs on eight sequences. As can be observed, the simple soft threshold encoding performs on par with sparse coding, while better than the other three. While sparse coding needs to solve an -regularized linear least-squares problem every time (with fixed dictionary), soft threshold coding only requires a max operation.
Evaluation of different dictionary sizes and pooling levels
Generally, using larger dictionary and pooling more levels would improve classification accuracy, which however would inevitably lead to higher dimensional features. Thousands of dictionary bases are typically used in the image classification task. In the case of visual tracking, due to the real time limitation, the features can not be too high dimensional. Fortunately, the image patterns appeared in a tracking scene are relatively simple, which means that hundreds of dictionary words would be enough to yield good results. We thus evaluate four dictionary sizes (64, 100, 144, 196) as well as four pooling levels in terms of VOR scores in figure 3. It can be seen that using a dictionary of size 100 greatly promotes the performance on most of the sequences compared to size 64 except the faceocc2. Further enlarging it does not gain any significant improvement or even deteriorate the performance. As for the pooling levels, using more layers improves the tracking accuracy on most of the sequences. However, the difference gets less notable with the increase of the pooling levels except on the board sequence where the tracker with two-level pooling features lost the target. When the pooling level increases to 4, the performance even get worse due to overfitting. Based on these observations, we choose a dictionary size of 100 and 3-level pooling as a compromise for accuracy and efficiency to report the tracking results in Table 1 and 2.
Comparing with other features in Struck
To further demonstrate the strength of the learned features, we incorporate the feature learning into the Struck framework and compare with three other types of features originally used in , which are raw pixel, Haar and histogram features. Linear kernel is used here for evaluation. All the other settings are the same with  for all the sequences. Table 6 reports the average VORs and CLEs on eight sequences. As can be can be observed, different hand-crafted features perform well in particular scenarios as they capture different information of the tracking scene, while as the learned feature achieves the overall best performance and outperforms its counterparts significantly. In conclusion, the features learned in a principled fashion is superior than the traditional hand-crafted features in tracking tasks.
4.3 Discussions on dictionary update
From the reported results, we can see that using the dictionary simply learned from the patches extracted in the first frame yields surprisingly satisfactory results, almost as good as its counterpart with updating scheme. We conjecture that this advantage comes from the fact that most of the tracking sequences consists of relatively simple patterns. Even with various changes, the scenes are similar. To demonstrate the effectiveness of the dictionary update scheme proposed here, we find a sequence with drastic scene changes as well as in-plane rotation of the object, which is motorRolling. Without update, our tracker lost the target at the frame 38 and yield a final VOR of 0.11 with the CLE 160.3. While equipped with the dictionary updating scheme, it tracks the target during the whole process giving an average VOR of 0.49 with the CLE 24.9, although not accurate enough due to the severe variations of the object appearance. Figure 4 shows the center location error plots of our method both with and without dictionary update compared to Struck on four sequences.
We have presented an online feature learning based tracker in this work. The proposed tracker follows the classical feature learning pipeline, which consists of dictionary learning, feature encoding and spatial pooling. The online dictionary learning method is applied to account for the appearance variations of the tracking target. We also evaluate the roles of several commonly used dictionary learning as well as encoding approaches in the proposed tracking framework, and achieve similar conclusions with previous studies on image classification. When combined with Struck, the learned features help improve the tracking accuracy compared to traditional hand-crafted features. Experimental results on various challenging videos demonstrate that the proposed tracker outperforms the state-of-the-art. Future work may take into consideration incorporating motion models of the target and tracking multiple objects.
-  C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust l1 tracker using accelerated proximal gradient approach,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012, pp. 1830–1837.
-  S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” in Proc. IEEE Int. Conf. Comp. Vis., 2011.
-  L. Zhang and L. van der Maaten, “Structure preserving object tracking,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013.
-  X. Li, C. Shen, A. Dick, and A. van den Hengel, “Learning compact binary codes for visual tracking,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013.
-  R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel, “Part-based visual tracking with online latent structural learning,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013.
-  S. Avidan, “Support vector tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 8, pp. 1064–1072, 2004.
-  R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel, “Robust tracking with weighted online structured learning,” in European Conf. Comp. Vis., 2012, vol. 7574, pp. 158–172.
-  H. Li, C. Shen, and Q. Shi, “Real-time visual tracking using compressive sensing,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2011, pp. 1305–1312.
-  Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013.
-  D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” Int. J. Comp. Vis., vol. 77, no. 1-3, pp. 125–141, 2008.
-  X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012, pp. 1822–1829.
-  W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparsity-based collaborative model,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012, pp. 1838–1845.
-  T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via multi-task sparse learning,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012, pp. 2042–2049.
-  L. Sevilla-Lara and E. G. Learned-Miller, “Distribution fields for tracking,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012, pp. 1910–1917.
-  S. Avidan, “Ensemble tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, pp. 261–271, Feb. 2007.
-  B. Babenko, M.-H. Yang, and S. J. Belongie, “Visual tracking with online multiple instance learning,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009, pp. 983–990.
-  Z. Kalal, J. Matas, and K. Mikolajczyk, “P-N learning: Bootstrapping binary classifiers by structural constraints,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2010, pp. 49–56.
-  A. Coates, A. Y. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” Proc. Int. Conf. Artificial Intell. & Stat., vol. 15, pp. 215–223, 2011.
-  A. Coates and A. Y. Ng, “The importance of encoding versus training with sparse coding and vector quantization,” in Proc. Int. Conf. Mach. Learn., 2011, pp. 921–928.
-  Y. Jia, C. Huang, and T. Darrell, “Beyond spatial pyramids: Receptive field learning for pooled image features,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012, pp. 3370–3377.
-  K. Sohn, D. Y. Jung, H. Lee, and A. O. Hero, “Efficient learning of sparse, distributed, convolutional feature representations for object recognition,” in Proc. IEEE Int. Conf. Comp. Vis., 2011, pp. 2643–2650.
-  A. Shabou and H. L. Borgne, “Locality-constrained and spatially regularized coding for scene categorization,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012, pp. 3618–3625.
-  M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006.
-  J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2010.
-  J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, 2010.
-  F. Tang, S. Brennan, Q. Zhao, and H. Tao, “Co-tracking using semi-supervised support vector machines,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2007, pp. 1–8.
-  X. Mei and H. Ling, “Robust visual tracking and vehicle classification via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 11, pp. 2259–2272, Nov. 2011.
-  Q. Wang, F. Chen, J. Yang, W. Xu, and M.-H. Yang, “Transferring visual prior for online object tracking,” IEEE Trans. Image Proc., vol. 21, no. 7, pp. 3296–3305, 2012.
-  B. Liu, J. Huang, L. Yang, and C. Kulikowsk, “Robust tracking using local sparse appearance model and k-selection,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2011, pp. 1313–1320.
-  Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-level features for recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2010, pp. 2559–2566.
-  B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics, vol. 32, pp. 407–499, 2004.
-  J. Ye and T. Xiong, “SVM versus least squares svm.,” Proc. Int. Conf. Artificial Intell. & Stat., vol. 2, 2007.
-  L. Liu, L. Wang, and X. Liu, “In defense of soft-assignment coding,” in Proc. IEEE Int. Conf. Comp. Vis., 2011, pp. 2486–2493.