1 Introduction
Due to the effectiveness and efficiency in object motion analysis, keypointbased object tracking [Lucas and Kanade1981, Santner et al.2010, Maresca and Petrosino2013, Nebehay and Pflugfelder2014] is a popular and powerful tool of video processing, and thus has a wide range of applications such as augmented reality (AR), object retrieval, and video compression. By encoding the local structural information on object appearance [Li et al.2013], it is generally robust to various appearance changes caused by several complicated factors such as shape deformation, illumination variation, and partial occlusion [Mikolajczyk and Schmid2005, Bouachir and Bilodeau2014]. Motivated by this observation, we focus on constructing effective and robust keypoint models to well model the intrinsic spatiotemporal structural properties of object appearance in this paper.
Typically, keypoint model construction consists of keypoint representation and statistical modeling. For keypoint representation, a variety of keypoint descriptors are proposed to encode the local invariance information on object appearance, for example, SIFT [Lowe2004] and SURF [Bay et al.2008]
. To further speed up the feature extraction process, a number of binary local descriptors emerge, including BRIEF
[Calonder et al.2010], ORB [Rublee et al.2011], BRISK [Leutenegger, Chli, and Siegwart2011], FREAK [Alahi, Ortiz, and Vandergheynst2012], etc. Since the way of feature extraction is handcrafted and fixed all the time, these keypoint descriptors are usually incapable of effectively and flexibly adapting to complex timevarying appearance variations as tracking proceeds.In general, statistical modeling is cast as a trackingbydetection problem, which seeks to build an object locator based on discriminative learning such as randomized decision trees
[Lepetit and Fua2006, Özuysal et al.2010] and boosting [Grabner, Grabner, and Bischof2007, Guo and Liu2013]. However, these approaches usually generate the binary classification output for object tracking, and thus ignore the intrinsic structural or geometrical information (e.g., geometric transform across frames) on object localization and matching during model learning. To address this issue, Hare et al. [Hare, Saffari, and Torr2012]propose a structured SVMbased keypoint tracking approach that incorporates the RANSACbased geometric matching information into the optimization process of learning keypointspecific SVM models. As a result, the proposed tracking approach is able to simultaneously find correct keypoint correspondences and estimate underlying object geometric transforms across frames. In addition, the model learning process is independently carried out frame by frame, and hence ignores the intrinsic crossframe interaction information on temporal model coherence, leading to instable tracking results in complicated scenarios.
In this work, we propose a joint learning approach that is capable of well balancing the following three important parts: temporal model coherence across frames, spatial model consistency within frames, and discriminative feature construction. As illustrated in Figure 1, the joint learning approach ensures the temporal model coherence by building a multitask structured model learning scheme, which encodes the crossframe interaction information by simultaneously optimizing a set of mutually correlated learning subtasks (i.e., a common model plus different biases) over several successive frames. As a result, the interaction information induced by multitask learning can guide the tracker to produce stable tracking results. Moreover, the proposed approach explores the keypointspecific structural information on spatial model consistency by performing geometric verification based structured output learning, which aims to estimate a geometric transformation while associating crossframe keypoints. In order to make the keypoint descriptors well adapt to timevarying tracking situations, the proposed approach naturally embeds metric learning to the structured SVM learning process, which enhances the discriminative power of interclass separability.
In summary, we propose a keypoint tracking approach that learns an effective and robust keypoint model through metric learningdriven multitask structured output optimization. The main contributions of this work are as follows:

We propose a multitask joint learning scheme to learn structured keypoint models by simultaneously considering spatial model consistency, temporal model coherence, and discriminative feature learning. An online optimization algorithm is further presented to efficiently and effectively solve the proposed scheme. To our knowledge, it is the first time that such a joint learning scheme is proposed for learningbased keypoint tracking.

We create and release a new benchmark video dataset containing four challenging video sequences (covering several complicated scenarios) for experimental evaluations. In these video sequences, the keypoint tracking results are manually annotated as ground truth. Besides, the quantitative results on them are also provided in the experimental section.
2 Approach
Our tracking approach is mainly composed of two parts: learning part and prediction part. Namely, an object model is first learned by a multitask structured learning scheme in a discriminative feature space (induced by metric learning). Based on the learned object model, our approach subsequently produces the tracking results through structured prediction. Using the tracking results, a set of training samples are further collected for structured learning. The above process is repeated as tracking proceeds.
2.1 Preliminary
Let the template image be represented as a set of keypoints , where each keypoint is defined by a location and associated descriptor . Similarly, let denote the input frame with keypoints. Typically, the traditional approaches construct the correspondences between the template keypoints and the input frame keypoints. The correspondences are scored by calculating the distances between and . Following the model learning approaches [Hare, Saffari, and Torr2012, Stückler and Behnke2012]
, we learn a model parameterized by a weight vector
for the template keypoint to score each correspondence. The set of the hypothetical correspondences is defined as , where is a correspondence score and is the inner product.Similar to [Hare, Saffari, and Torr2012, Pernici and Del Bimbo2013], we estimate the homography transformation for planar object tracking as the tracking result based on the hypothetical correspondences.
2.2 Multitask Structured Learning
During the tracking process, the keypoints in the successive frames corresponding to the th keypoint in the template image form a tracklet . Based on the observation that the adjacent keypoints in a tracklet are similar to each other, the models learned for the frames should be mutually correlated. So we construct learning tasks over several adjacent frames. For example, task learns a model over the training samples collected from the frames to , where is the column concatenation of the model parameter vectors. We model each as a linear combination of a common model and an unique part [Zheng and Ni2013]:
(1) 
where all the vectors are “small” when the tasks are similar to each other.
To consider the spatial model consistency in the model learning process, the transformation which maps the template to the location of the input frame is regarded as a structure, which can be learned in a geometric verification based structured learning framework. In our approach, the expected transformation is expressed as , where is a compatibility function, scoring all possible transformations generated by using the RANSAC [Fischler and Bolles1981] method. Before introducing the compatibility function, we give the definition of the inlier set with a specific transformation :
(2) 
where is the transformed location in the input frame of the template keypoint location , is a spatial distance threshold, and denotes the Euclidean norm.
The compatibility function with respect to task is then defined as the total score of the inliers:
(3) 
where is a joint feature mapping vector concatenated by which is defined as:
(4) 
Given training samples (each is the hypothetical correspondences of the frame , and is the predicted transformation), a structured output maximum margin framework [Taskar, Guestrin, and Koller2003, Tsochantaridis et al.2005] is used to learn all the multitask models, which can be expressed by the following optimization problem:
(5)  
s.t.  
where and
is a loss function which measures the difference of two transformations (in our case, the loss function
is the difference in number of two inlier sets). The nonnegative is the weight parameter for multiple tasks, and the weighting parameter determines the tradeoff between accuracy and regularization.To better describe the contribution of the multitask learning, example tracking results of the trackers with and without multitask learning are shown in Figure 2. From Figure 2(b), we observe that the independent model fails to match the keypoints in the case of drastic rotations, while the multitask model enables the temporal model coherence to capture the information of rotational changes, thus produces a stable tracking result.
2.3 Discriminative Feature Space
In order to make the keypoint descriptors well adapt to timevarying tracking situations, we wish to learn a mapping function that maps the original feature space to another discriminative feature space, in which the semantically similar keypoints are close to each other while the dissimilar keypoints are far away from each other, that can be formulated as a metric learning process [Weinberger and Saul2009, Park et al.2011]. We then use the mapped feature to replace the original feature in the structured learning process, to enhance its discriminative power of interclass separability.
Figure 3 shows an example of such feature space transformation. Before the mapping procedure, the object keypoints and the background keypoints can not be discriminated in the original feature space. After the transformation, the keypoints in different frames corresponding to the same keypoint in the template, which are semantically similar, get close to each other in the mapped feature space, while the features of the other keypoints have a distribution in another side with a large margin.
The following describes how to learn the mapping function. For a particular task , given the learned model , the distance between a doublet is defined as follows:
(6) 
We assume that the binary matrix indicates whether or not the features and are semantically similar (if they are similar, ). Therefore, the hinge loss function on a doublet is defined as:
(7) 
where .
To learn the effective feature consistently in our mapping process, we wish to find the groupsparsity of the features. So we utilize norm [Cai et al.2011, Li et al.2012]
to learn the discriminative information and feature correlation consistently. Since we use a linear transformation
as our mapping function, the norm for the mapping matrix is defined as: .Given all the keypoint features from the video frames , we collect all possible combinations of the features as the training set, which is denoted as . We obtain the binary matrix by using the tracking results (if and from different frames correspond to the same keypoint in the template, is set to 1; otherwise, is set to 0). We wish to minimize the following cost function consisting of the empirical loss term and the norm regularization term:
(8) 
The cost function is incorporated into our multitask structured learning framework, and then a unified joint learning scheme for object tracking is obtained. The final optimization problem of our approach is expressed in the following form:
(9)  
After all the models are learned, we use the last model to predict the result of new frame . We use the RANSAC method to generate hypothetical transformations. Based on the model , we predict the expected transformation from all hypothetical transformations by maximizing Eq. (3). The hypothetical correspondence set of the frame and the predicted transformation are then added to our training set. We use all the training samples collected from the results of previous frames ( to ) to update our model. Then the above process is repeated as tracking proceeds.
2.4 Online Optimization
The optimization problem presented in Eq. (9) can be solved online effectively. We adopt an alternating optimization algorithm to solve the optimization problem.
Unconstrained form
Let and . Therefore, Eq. (9) can be rewritten to an unconstrained form:
(10)  
For descriptive convenience, let denote the term of .
Fix and , solve
Firstly, we fix all and , and learn the transformation matrix by solving the following problem:
(11) 
Let denote the th row of , and denote the trace operator. In mathematics, the Eq. (11) can be converted to the following form:
(12) 
where is the diagonal matrix of , and each diagonal element is . We use an alternating algorithm to calculate and respectively. We calculate with the current by using gradient descent method, and then update according to the current . The details of solving Eq. (12) are shown in the supplementary file.
Fix and , solve
Secondly, after is learned, let have been the optimal solution of Eq. (10). Then can be obtained by the combination of according to [Evgeniou and Pontil2004]:
(13) 
The proof can be found in our supplementary material.
Fix and , solve
Finally, can be learned one by one using gradient descent method. In fact, we learn instead of for convenience. Let be the average vector of all . Then the optimization problem for each can be rewritten as:
(14) 
where and (the derivation proof is given in the supplementary material).
Given training samples at time , the subgradient of Eq. (14) with respect to is calculated, and we perform a gradient descent step according to:
(15) 
where is the step size (the details of the term is described in the supplementary material). We repeat the procedure to obtain an optimal solution until the algorithm converges (on average converges after iterations).
All the above is summarized in Algorithm 1, and the details are described in the supplementary material.
3 Experiments and Results
3.1 Experimental Settings
Dataset
The video dataset used in our experiments consists of nine video sequences. Specifically, the first five sequences are from [Hare, Saffari, and Torr2012], and the last four sequences (i.e., “chart”, “keyboard”, “food”, “book”) are recorded by ourselves. All these sequences cover several complicated scenarios such as background clutter, object zooming, object rotation, illumination variation, motion blurring and partial occlusion (example frames can be found in the supplementary material).
Implementation Details
For keypoint feature extraction, we use FAST keypoint detector [Rosten and Drummond2006] with 256bit BRIEF descriptor [Calonder et al.2010]. For metric learning, the linear transformation matrix
is initialized to be an identity matrix. For multitask learning, the number of tasks
is chosen as and we update all the multitask models frame by frame. All weighting parameters are set to , and remain fixed throughout all the experiments. Similar to [Hare, Saffari, and Torr2012], we consider the tracking process of estimating homography transformation on the planar object as a trackingbydetection task.Evaluation Criteria
We use the same criteria as [Hare, Saffari, and Torr2012] with a scoring function between the predicted homography and the groundtruth homography :
(16) 
where is a normalized square. For each frame, it is regarded as a successfully detected frame if , and a falsely detected frame otherwise. The average success rate is defined as the number of successfully detected frames divided by the length of the sequence, which is used to evaluate the performance of the tracker. To provide the tracking result frame by frame, we present a criterion of the accumulated false detection number, which is defined as the accumulated number of falsely detected frames as tracking proceeds.
3.2 Experimental Results
Comparison with Stateoftheart Methods
We compare our approach with some stateoftheart approaches, including boosting based approach [Grabner, Grabner, and Bischof2007], structured SVM (SSVM) approach [Hare, Saffari, and Torr2012] and a baseline static tracking approach (without model updating). All these approaches are implemented by making use of their publicly available code. We also implement our approach in C++ and OPENCV. On average, our algorithm takes 0.0746 second to process one frame with a quadcore 2.4GHz Intel Xeon E52609 CPU and 16GB memory. Table 1 shows the experimental results of all four approaches in the average success rate. As shown in this table, our approach performs best on all sequences.
Sequence  Average Success Rate(%)  

Static  Boosting  SSVM  Ours  
barbapapa  19.7138  89.0302  94.1176  94.4356 
comic  42.5000  57.6042  98.1250  98.8542 
map  81.1295  82.0937  98.7603  98.7603 
paper  05.0267  03.8502  82.7807  88.2353 
phone  88.1491  84.9534  96.6711  98.4021 
chart  13.1461  01.9101  53.0337  77.5281 
keyboard  27.8607  57.7114  62.3549  94.5274 
food  32.8173  67.4923  85.7585  99.6904 
book  08.5616  08.9041  55.8219  81.6781 
To provide an intuitive illustration, we report the detection result on each frame in Figure 4. We observe that both the “Boosting” and “SSVM” approaches obtain a number of incorrect detection results on some frames of the test sequences, while our approach achieves stable tracking results in most situations (the curve corresponding to our approach grows slowly and is almost horizontal).
Figure 5 shows the tracking results on some sample frames (more experimental results can be found in our supplementary materials). These sequences containing background clutter are challenging for keypoint based tracking. In terms of metric learning and multitask learning, our approach still performs well in some complicated scenarios with drastic object appearance changes.
Evaluation of Our Individual Components
To explore the contribution of each component in our approach, we compare the performances of the approaches with individual parts, including SSVM(structured SVM), SML(SSVM metric learning), SMT(SSVM multitask learning), and SMM (SSVM ML MT, which is exactly our approach). The experimental results of all these approaches in the average success rate are reported in Table 2.
Sequence  Average Success Rate(%)  

SSVM  SML  SMT  SMM  
barbapapa  94.1176  94.4356  94.2766  94.4356 
comic  98.1250  98.5417  98.6458  98.8542 
map  98.7603  98.6226  98.7603  98.7603 
paper  82.7807  86.2032  87.3797  88.2353 
phone  96.6711  97.2037  97.6032  98.4021 
chart  53.0337  62.0225  61.1236  77.5281 
keyboard  62.3549  73.6318  76.6169  94.5274 
food  85.7585  88.0805  99.3808  99.6904 
book  55.8219  71.5753  74.8288  81.6781 
From Table 2, we find that the geometric verification based structured learning approach achieves good tracking results in most situations. Furthermore, we observe from Figure 6 that multitask structured learning guides the tracker to produce a stable tracking result in the complicated scenarios, and metric learning enhances the capability of the tracker to separate keypoints from background clutter. Our approach consisting of all these components then generates a robust tracker.
4 Conclusion
In this paper, we have presented a novel and robust keypoint tracker by solving a multitask structured output optimization problem driven by metric learning. Our joint learning approach have simultaneously considered spatial model consistency, temporal model coherence, and discriminative feature construction during the tracking process.
We have shown in extensive experiments that geometric verification based structured learning has modeled the spatial model consistency to generate a robust tracker in most scenarios, multitask structured learning has characterized the temporal model coherence to produce stable tracking results even in complicated scenarios with drastic changes, and metric learning has enabled the discriminative feature construction to enhance the discriminative power of the tracker. We have created a new benchmark video dataset consisting of challenging video sequences, and experimental results performed on the dataset have shown that our tracker outperforms the other stateoftheart trackers.
5 Acknowledgments
All correspondence should be addressed to Prof. Xi Li. This work is in part supported by the National Natural Science Foundation of China (Grant No. 61472353), National Basic Research Program of China (2012CB316400), NSFC (61472353), 863 program (2012AA012505), China Knowledge Centre for Engineering Sciences and Technology and the Fundamental Research Funds for the Central Universities.
References

[Alahi, Ortiz, and Vandergheynst2012]
Alahi, A.; Ortiz, R.; and Vandergheynst, P.
2012.
Freak: Fast retina keypoint.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
.  [Bay et al.2008] Bay, H.; Ess, A.; Tuytelaars, T.; and Gool, L. V. 2008. Speededup robust features (surf). Computer Vision and Image Understanding (CVIU) 110(3):346–359.
 [Bouachir and Bilodeau2014] Bouachir, W., and Bilodeau, G.A. 2014. Structureaware keypoint tracking for partial occlusion handling. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV).

[Cai et al.2011]
Cai, X.; Nie, F.; Huang, H.; and Ding, C.
2011.
Multiclass l2,1norm support vector machine.
In Proceedings of the IEEE Conference on Data Mining (ICDM).  [Calonder et al.2010] Calonder, M.; Lepetit, V.; Strecha, C.; and Fua, P. 2010. Brief: Binary robust independent elementary features. In Proceedings of the European Conference on Computer Vision (ECCV).
 [Evgeniou and Pontil2004] Evgeniou, T., and Pontil, M. 2004. Regularized multi–task learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
 [Fischler and Bolles1981] Fischler, M. A., and Bolles, R. C. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM (CACM) 24(6):381–395.
 [Grabner, Grabner, and Bischof2007] Grabner, M.; Grabner, H.; and Bischof, H. 2007. Learning features for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [Guo and Liu2013] Guo, B., and Liu, J. 2013. Realtime keypointbased object tracking via online learning. In Proceedings of the International Conference on Information Science and Technology (ICIST).
 [Hare, Saffari, and Torr2012] Hare, S.; Saffari, A.; and Torr, P. H. S. 2012. Efficient online structured output learning for keypointbased object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [Lepetit and Fua2006] Lepetit, V., and Fua, P. 2006. Keypoint recognition using randomized trees. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 28(9):1465–1479.
 [Leutenegger, Chli, and Siegwart2011] Leutenegger, S.; Chli, M.; and Siegwart, R. 2011. Brisk: Binary robust invariant scalable keypoints. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[Li et al.2012]
Li, Z.; Yang, Y.; Liu, J.; Zhou, X.; and Lu, H.
2012.
Unsupervised feature selection using nonnegative spectral analysis.
InProceedings of the AAAI Conference on Artificial Intelligence (AAAI)
.  [Li et al.2013] Li, X.; Hu, W.; Shen, C.; Zhang, Z.; Dick, A.; and Hengel, A. V. D. 2013. A survey of appearance models in visual object tracking. ACM Transactions on Intelligent Systems and Technology (TIST) 4(4):58:1–58:48.
 [Lowe2004] Lowe, D. 2004. Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision (IJCV) 60(2):91–110.
 [Lucas and Kanade1981] Lucas, B. D., and Kanade, T. 1981. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
 [Maresca and Petrosino2013] Maresca, M., and Petrosino, A. 2013. Matrioska: A multilevel approach to fast tracking by learning. In Proceedings of the International Conference on Image Analysis and Processing (ICIAP).
 [Mikolajczyk and Schmid2005] Mikolajczyk, K., and Schmid, C. 2005. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 27(10):1615–1630.
 [Nebehay and Pflugfelder2014] Nebehay, G., and Pflugfelder, R. 2014. Consensusbased matching and tracking of keypoints for object tracking. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV).
 [Özuysal et al.2010] Özuysal, M.; Calonder, M.; Lepetit, V.; and Fua, P. 2010. Fast keypoint recognition using random ferns. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 32(3):448–461.
 [Park et al.2011] Park, K.; Shen, C.; Hao, Z.; and Kim, J. 2011. Efficiently learning a distance metric for large margin nearest neighbor classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
 [Pernici and Del Bimbo2013] Pernici, F., and Del Bimbo, A. 2013. Object tracking by oversampling local features. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) PP(99):1–1.
 [Rosten and Drummond2006] Rosten, E., and Drummond, T. 2006. Machine learning for highspeed corner detection. In Proceedings of the European Conference on Computer Vision (ECCV).
 [Rublee et al.2011] Rublee, E.; Rabaud, V.; Konolige, K.; and Bradski, G. 2011. Orb: An efficient alternative to sift or surf. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
 [Santner et al.2010] Santner, J.; Leistner, C.; Saffari, A.; Pock, T.; and Bischof, H. 2010. Prost: Parallel robust online simple tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 [Stückler and Behnke2012] Stückler, J., and Behnke, S. 2012. Model learning and realtime tracking using multiresolution surfel maps. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
 [Taskar, Guestrin, and Koller2003] Taskar, B.; Guestrin, C.; and Koller, D. 2003. Maxmargin markov networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
 [Tsochantaridis et al.2005] Tsochantaridis, I.; Joachims, T.; Hofmann, T.; and Altun, Y. 2005. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR) 6:1453–1484.
 [Weinberger and Saul2009] Weinberger, K. Q., and Saul, L. K. 2009. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research (JMLR) 10:207–244.
 [Zheng and Ni2013] Zheng, J., and Ni, L. M. 2013. Timedependent trajectory regression on road networks via multitask learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
Comments
There are no comments yet.