Visual tracking plays an essential role for many higher level understanding of video contents such like traffic surveillance, analysis of human behaviours and interactions between targets of interest, etc. During the past decade, some quite efficient object tracking methods [1, 2] have been widely distributed in various applications. However, designing a robust tracking algorithm for realistic task is still a major challenge. The problems arise not only from intra-class variation of appearance caused by deformation and viewpoint change, but also from partial occlusion and cluttered background, etc.
For deformation and viewpoint change, recently, researchers tend to address the problem with online learning method to update the target model [3, 4]. Such methods provide an effective way to handle universal tracking problems by achieving a synergy between tracking and recognition. However, for vast majority of common objects in daily life, e.g., pedestrians and vehicles, the object tracking by human eyes actually follows the recognition of target at the first glimpse. The leverage of massive experience in this recognition process brings high-level auxiliary knowledge to handle various problems in tracking. Motivated by this intuition, we propose to track objects via high performance object detection models, Deformable Part based Models (DPMs) , in this letter. The similar inspiration also exists in other state-of-the-art work in the community , although here we track the whole target as well as deformable parts simultaneously.
The proposed tracking framework in this letter consists of several components which correspond to specific views of object. As shown in Figure. 1, each component is a Dynamic Conditional Random Field (DCRF)  over consecutive frames to describe the details of objects on different resolutions. Each vertex in the graph is connected with its spatial and temporal neighbors by pre-defined pairwise potentials which formulate the deformation of object. On bottom of that, a pyramid based representation of image effectively handle the illumination and scale change of target over frames.
For partial occlusion in cluttered background, part based models have yielded attractive results in recent progress of object tracking [8, 9, 10]. A series of solutions attempt a sparse representation of objects [9, 11] to track parts of target and thus handle partial occlusion problems. Differing from these decomposition based methods, our method can originally describe the status that some parts are absent from sight while a hypothesis of object is still credible due to other observed parts, and thus can handle occlusion more directly and flexibly.
The main contributions of this work are threefold: (1) we integrate high performance object detection method with dynamic graph based model, implementing an efficient object tracking with structured outputs; (2) we propose some novel temporal pairwise potentials to model the transition between parts over frames; and (3) we implement an efficient unparameterized logistic regression based mechanism, combining it with prior knowledge from previous frame to handle partial occlusion. Experiments on challenging video sequences prove the efficacy of our proposed method.
Ii Deformable Part Based Model
DPM has been proven as quite effective model to formulate the significant intra-class variation of objects in challenging object detection problems. A representation of object by DPM can be considered as a mixture of star-shaped Conditional Random Fields (CRFs)  as components. Each component consists of one root part and deformable parts as vertices of graph. The unary potential of vertex in , which models the part appearance, is the output of Histogram of Oriented Gradients (HOG) features filtered by a template function , where is the HOG feature pyramid, . The pairwise potential between root and part, which models the deformation, penalizes the displacement of part from the anchor position of trained model with a Gaussian kernel parameterized by a four-tuple .
By considering s and s as input, as well as s and
s as parameters, we can realize the CRF output from the perspective of linear perceptron:
is often termed as bias constant in this context. The correspondence between CRF and linear perceptron leads to a Support Vector Machine (SVM) based training framework in. The efficacy of DPM arises from 3 building blocks: (1) the HOG pyramid handles the illumination and scale changes; (2) the mixture model takes multiple views of objects into account simultaneously; (3) The deformation penalty which is formulated by pairwise potentials in CRF tackles non-rigid deformations and intra-class variation in shape directly.
Iii Occlusion Handling
Despite the great success that DPM has been witnessed, it has been reported that detecting partial occluded objects with DPM remains a challenging problem . In this letter, we propose a similar but more efficient strategy to that of  to handle the partial occlusion problem. From Eq. 1, one can see that in the CRF of DPM, the related to each vertex can be computed separately as
By using logistic regression over the SVM output on every vertex 
, we can model the probability of the hypothesis that a vertex appears at current siteas
If an object is partially occluded, aggregating the scores of all parts as in conventional DPM is apparently unsuitable. Therefore we only select a subset of parts from , finding the optimal to maximize the mean of normalized scores of vertices in as follows:
For common pedestrian tracking, similar to , we only take four possible subsets of parts into consideration as in Figure 2. It has been proven that such limited choices are representative enough in most practical scenarios . For more universal object tracking problems, a simple greedy search algorithm can be employed here to add parts into sequentially with trivial overhead of computation. Differing from the parameterized logistic regression in , our method directly projects the output of SVM from to without any requirement of training stage. Such simpler formulation is more flexible in various realistic tracking applications. From an empirical analysis as shown in Figure 2, our proposed occlusion handling strategy actually introduces noises into final detection results of DPM. However, it is still very promising since it compresses the distribution of DPM scores and allows some parts of object contribute to the result equally as the whole star-shaped model.
Iv Tracking via Dynamic Conditional Random Fields
Iv-a Dynamic Conditional Random Field
DCRF was proposed in  to implement an accurate foreground segmentation in video sequences. Here we introduce it into object tracking by integrating it with pre-defined potential functions from DPM. DCRF models the states of two random fields and over consecutive frames via Bayes’ rule:
where is the partition function. Since indicates a random field which contains vertices here, to enumerate all possible states of in Eq. 5 is intractable. Inspired by the derivation in , we attempt to restrict the problem to every single vertex in .
According to the Markov property and Hammersley-Clifford theorem, the state transition probability in Eq. 5 can be given by a Gibbs distribution as follows:
where and are vertices in the graph. The temporal neigborhood denotes the vertices at the th frame which can impact at the th frame, and the spatial neighbourhood refers to the spatially related vertices at the same frame to . Here stands for the state of neighboring vertex , and are clique potentials related to the vertex .
Due to the star shape of DPM, the adopted graph model in our proposed DCRF framework retains a facile structure. The posterior distribution for a site at the th frame can be directly factorized as
With similar conditional independence assumption in , the observation model the can also be evaluated by product of likelihoods of vertices:
Here we only consider corresponding vertex at previous frame as in Figure 1, so , can be simply replaced by . As shown in Eq. IV-A, the summation of unary potentials and pairwise potentials at every vertex corresponds to the output of DPM . Therefore, the equation has a very clear explanation: the local energy on a vertex of DCRF consists of DPM score as observation, temporal potential as transition function, and result from previous frame as posterior distribution. Since each vertex only has two possible states , which indicate whether it occurs, Eq. IV-A can be computed very efficiently especially in the logarithmic form.
Iv-B Temporal Potential Function
To model the status that the object is partially occluded, we propose a novel transition function to impose the temporal connectivity between same parts over different frames.
The proposed temporal potential ensures the consistency between neighboring vertices. If the part is assumed to be observed at last frame, a normalised Gaussian kernel, is adopted to measure the motions of object. Here is a three-dimensional covariance matrix constraining the object into a relevant range on HOG feature pyramid. Otherwise if the part is assumed to be occluded, which means the direct prior knowledge about current part from last frame is absent, we keep the temporal connectivity with the difference of part deformation instead. It implies that if the pose of object changes drastically over frames, the final tracking result should be biased more on observation model rather than prior knowledge.
V Empirical Results
We empirically testified the proposed graph model based tracking framework with three experiments. In experiments we adopted the DPMs trained for PASCAL VOC 2009 , which contain six components consisting of nine deformable parts. The algorithm is initialized by detecting the optimal window which overlaps with ground-truth by at least 70% at the first frame. Only related HOG features at neighboring levels in pyramid are extracted for tracking. This configuration implies that the efficiency of our method is decided by both shown object size as well as image size. Zooming in frames of video directly will not bring any impact to the speed of tracking.
V-a From Detection to Tracking
Since tackling long-term partial occlusion is a main concern in this letter, we carefully evaluate the influences of proposed novel occlusion handling mechanisms in this section. A challenging video sequence, the “Woman” sequence , is used to evaluate the performances of four different configurations: detection by DPM directly (DPM), detection by DPM and occlusion handling (DPM+OH), tracking by DCRF merely with Gaussian kernel in Eq. IV-B (DCRF), and tracking with complete temporal potential function (DCRF+OH). Since there is no tracking failure problem for detection methods, we follow the evaluation protocol proposed by  in Figure 3.
It is easy to observe in Figure 3(a) that the proposed tracking method brought significant improvement to DPM based detection, despite that using unparameterized occlusion handling actually leads to worse result. Note that tracking with DCRF without occlusion handling achieved a desirable result at the beginning of the sequence. However the method failed to follow the target around frame #125, where a long-term partial occlusion occurs, and finally leaded to mitigated result. An implementation with MATLAB on a Pentium 3.3 GHz CPU can process one frame in 0.7 second on this sequence, which is relatively much faster than detecting directly (2.5 second per frame).
V-B Comparison on “Woman” Sequence
We also compared the performance of our proposed method with other leading tracking methods on the “Woman” Sequence. We took several representative state-of-the-art methods into account, i.e., Frag tracker , SRPCA tracker , IVT tracker  and MIL tracker . It can be observed from Figure 4(a) that only our method successfully followed the pedestrian through the whole “Woman” sequence, while other methods drifted away for various problems. It has been reported that the Fragment based tacker  can follow the target by initialising at frame #69 since it is specifically designed for handling long-term occlusion. However, from Figure 5 one can see that the method failed to follow the target from the beginning of the sequence due to the serious scale change during frame #1 to frame #69.
V-C Comparison on “Car 4” Sequence
In the last experiment, we evaluate our method on the “Car 4” sequence , which contains some serious illumination and scale variation. The algorithm can process one frame of this sequence around 0.4 second. We illustrate the center errors of different methods in Figure 4(b). The Frag and MIL methods failed to follow the car since they are lack of effective mechanism for handling scale change. Our proposed method has no problem to track the target, however SRPCA and IVT methods show more accurate results than ours. As shown in the last instance of Figure 5, our method meets some trivial problems for accurately evaluating the correct component of target, which leads to small drifts of tracking results. We would like to introduce prior knowledge of component from previous frames to solve this weakness in future work.
In this letter, we propose a novel model based tracking method which exploits the high performance DPM in a DCRF framework. By utilising suitable temporal potential functions, the method can simultaneously handle challenging problems in tracking tasks such as variation of illumination, scale, perspective, drastic shape deformation and partial occlusion. In future work, we plan to improve the efficiency of the method with a C++ implementation. We also would like to extend current system to multiple target tracking by integrating other visual cues to discriminate targets from each other.
-  D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,” in
-  M. Isard and A. Blake, “Condensation—conditional density propagation for visual tracking,” International journal of computer vision, vol. 29, no. 1, pp. 5–28, 1998.
-  B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 8, pp. 1619–1632, 2011.
-  D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” International Journal of Computer Vision, vol. 77, no. 1-3, pp. 125–141, 2008.
-  P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 9, pp. 1627–1645, sept. 2010.
-  J. Fan, X. Shen, and Y. Wu, “What are we tracking: A unified approach of tracking and recognition,” Image Processing, IEEE Transactions on, vol. 22, no. 2, pp. 549–560, 2013.
-  Y. Wang, K.-F. Loe, and J.-K. Wu, “A dynamic conditional random field model for foreground and shadow segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 2, pp. 279–289, feb. 2006.
-  A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 1, 2006, pp. 798–805.
-  D. Wang, H. Lu, and M.-H. Yang, “Online object tracking with sparse prototypes,” Image Processing, IEEE Transactions on, vol. 22, no. 1, pp. 314–325, 2013.
-  D. Wang and H. Lu, “Object tracking via 2dpca and -regularization,” Signal Processing Letters, IEEE, vol. 19, no. 11, pp. 711–714, 2012.
-  X. Mei and H. Ling, “Robust visual tracking and vehicle classification via sparse representation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 11, pp. 2259–2272, 2011.
-  J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001.
-  G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah, “Part-based multiple-person tracking with partial occlusion handling,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 1815–1821.
-  R. Salakhutdinov, A. Torralba, and J. Tenenbaum, “Learning to share visual appearance for multiclass object detection,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, june 2011, pp. 1481–1488.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2009 (VOC2009) Results,” http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html.