Online Adaptive Hidden Markov Model for Multi-Tracker Fusion

In this paper, we propose a novel method for visual object tracking called HMMTxD. The method fuses observations from complementary out-of-the box trackers and a detector by utilizing a hidden Markov model whose latent states correspond to a binary vector expressing the failure of individual trackers. The Markov model is trained in an unsupervised way, relying on an online learned detector to provide a source of tracker-independent information for a modified Baum- Welch algorithm that updates the model w.r.t. the partially annotated data. We show the effectiveness of the proposed method on combination of two and three tracking algorithms. The performance of HMMTxD is evaluated on two standard benchmarks (CVPR2013 and VOT) and on a rich collection of 77 publicly available sequences. The HMMTxD outperforms the state-of-the-art, often significantly, on all datasets in almost all criteria.



There are no comments yet.


page 7

page 20


Simple Online and Realtime Tracking

This paper explores a pragmatic approach to multiple object tracking whe...

A Distilled Model for Tracking and Tracker Fusion

Visual object tracking was generally tackled by reasoning independently ...

AAA: Adaptive Aggregation of Arbitrary Online Trackers with Theoretical Performance Guarantee

For visual object tracking, it is difficult to realize an almighty onlin...

Integration of Regularized l1 Tracking and Instance Segmentation for Video Object Tracking

We introduce a tracking-by-detection method that integrates a deep objec...

An Exploration of Target-Conditioned Segmentation Methods for Visual Object Trackers

Visual object tracking is the problem of predicting a target object's st...

Information-Maximizing Sampling to Promote Tracking-by-Detection

The performance of an adaptive tracking-by-detection algorithm not only ...

DeepMOT: A Differentiable Framework for Training Multiple Object Trackers

Multiple Object Tracking accuracy and precision (MOTA and MOTP) are two ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last thirty years, a large number of diverse visual tracking methods has been proposed Yilmaz2006 ; smeulder2013 . The methods differ in the formulation of the problem, assumptions made about the observed motion, in optimization techniques, the features used, in the processing speed, and in the application domain. Some methods focus on specific challenges like tracking of articulated or deformable objects Kwon2009 ; godec2011 ; Cehovin2013 , occlusion handling grabner2010 , abrupt motion Zhou2010 or long-term tracking Pernici2013 ; Kalal2012 .

Three observations motivate the presented research. First, most trackers perform poorly if run outside the scenario they were designed for. Second, some trackers make different and complementary assumptions and their failures are not highly correlated (called complementary trackers in the paper). And finally, even fairly complex well performing trackers run at frame rate or faster on standard hardware, opening the possibility for multiple trackers to run concurrently and yet in or near real-time.

We propose a novel methodology that exploits a hidden Markov model (HMM) for fusion of non-uniform observables and pose prediction of multiple complementary trackers using an on-line learned high-precision detector. The non-uniform observables, in this sense, means that each tracker can produce its own type of ”confidence estimate” which may not be directly comparable between each other.

The HMM, trained in an unsupervised manner, estimates the state of the trackers – failed, operates correctly – and outputs the pose of the tracked object taking into account the past performance and observations of the trackers and the detector. The HMM treats the detector output as correct if it is not in contradiction with its current most probable state in which the majority of trackers are correct. This limits the cases where the HMM would be wrongly updated by a false detection. For the potentially many frames where reliable detector output is not available, it combines the trackers. The detector is trained on the first image and interacts with the learning of the HMM by partially annotating the sequence of HMM states in the time of verified detections. The recall of the detector is not critical but it affects the learning rate of the HMM and the long-term properties of the HMMTxD method, i.e. its ability to reinitialize trackers after occlusions or object disappearance.

Related work. The most closely related approaches include Santner et al. Santner2010 , where three tracking methods with different rates of appearance adaptation are combined to prevent drift due to incorrect model updates. The approach uses simple, hard-coded rules for tracker selection. Kalal et al. Kalal2012

combine a tracking-by-detection method with a short-term tracker that generates so called P-N events to learn new object appearance. The output is defined either by the detector or the tracker based on visual similarity to the learned object model. Both these methods employ pre-defined rules to make decisions about object pose and use one type of measurement, a certain form of similarity between the object and the estimated location. In contrary, HMMTxD learns continuously and causally the performance statistics of individual parts of the systems and fuses multiple ”confidence” measurements in the form of probability densities of observables in the HMM. Zhang et al. 


use a pool of multiple classifiers learned from different time spans and choose the one that maximize an entropy-based cost function. This method addresses the problem of model drifting due to wrong model updates, but the failure modes inherent to the classifier itself remains the same. This is unlike the proposed method which allows to combine diverse tracking methods with different inherent failure modes and with different learning strategies to balance their weaknesses.

Similarly to the proposed method, Wang et al. wangg14 and Bailer et at. Bailer2014 fuse different out-of-the box tracking methods. Bailer et al. combine offline the outputs of multiple tracking algorithms. There is no interaction between trackers, which for instance implies that the method avoids failure only if one method correctly tracks the whole sequence. Wang et al. use a factorial hidden Markov model and a Bayesian approach. The state space of their factorial HMM is the set of potential object positions, therefore it is very large. The model contains a probability description of the object motion based on a particle filter. Trackers interact by reinitializing those with low reliability to the pose of the most confident one. The Yuan et al. Yuan2015 using HMM in the same setup, but rather than merging multiple tracking method, they focus on modeling the temporal change of the target appearance in the HMM framework by introducing a observational dependencies. In contrast, the HMMTxD method is online with tracker interaction via a high precision object detector that supervises tracker reinitializations which happen on the fly. The appearance modeling is performed inside of each tracker and the HMMTxD capture the relation of the confidence provided by tracker and its performance, validated by the object detector, by the observable distributions. Moreover, the HMMTxD confidence estimation is motion-model free and this prevents biases towards support of trackers with a particular motion model.

Yoon et al. Yoon2012 combines multiple trackers in a particle filter framework. This approach models observables and transition behavior of individual trackers, but the trackers are self-adapting which makes it prone to wrong model updates. The adaptation of HMMTxD model is supervised by a detector method set to a specific mode of operation – near precision – alleviating the incorrect update problem.

The contributions of the paper are: a novel method for fusion of multiple trackers based on HMMs using non-uniform observables, a simple, and so far unused, unsupervised method for HMMs training in the context of tracking, tunable feature-based detector with very low false positive rate, and the creation of a tracking system that shows state-of-the-art performance.

2 Fusing Multiple Trackers

HMMTxD uses a hidden Markov model (HMM) to integrate pose and observational confidence of different trackers and a detector, and updates its own confidence estimates that in turn define the pose that it outputs. In the HMM, each tracker is modeled as working correctly (1) or incorrectly (0). The HMM poses no constraints on the definition of tracker correctness, we adopted target overlap above a threshold. Having at our disposal trackers, the set of all possible states is and the initial state . Note that the trackers are not assumed to be independent, because an independence of tracker correctness is not a realistic assumption. For example, if the tracking problem is relatively easy, all trackers tend to be correct and in the case of occlusion all tend to be incorrect (see the analysis in votpamiarxiv2015 ). The number of states grows exponentially with the number of trackers. However, we do not consider this a significant issue – due to ”real-time” requirements of tracking, the need to combine more than a small number of trackers, say , is unlikely.

The HMMTxD method overview is illustrated in Fig. 1. Each tracker provides an estimate of the object pose () and a vector of observables (), which may contain a similarity measure to some model (such as normalized cross-correlation to the initial image patch, distance of template and current histograms at given position, etc.) or any other estimates of the tracker performance. The serve as observables to relate the tracker current confidence to the HMM. Each individual observable depends only on one particular tracker and its correctness, hence, they are assumed to be conditionally independent conditioned on the state of the HMM (which encodes the tracker correctness).

Figure 1: The structure of the HMMTxD. For each frame, the detector and trackers are run. Each tracker outputs a new object pose and observables and the detector outputs either the verified object pose or nothing. If detector fires, HMM is updated and trackers are reinitialized and the final output is the , otherwise, HMM estimate the most probable state and outputs an average bounding box of trackers that are correct in the estimated state .

In general, there are no constraints on observable values, however, in the proposed HMM the observable values are required to be normalized to the

interval. The observables are modeled as beta-distributed random variables (Eq. 

1) and its parameters are estimated online. The beta distribution was chosen for its versatility, where practically any kind of unimodal random variable on

can be modeled by the beta distribution, i.e. for any choice of any lower and upper quantiles, a beta distribution exists satisfying the given quantile constraint 

gupta2004 .

Learning the parameters of the beta distributions online is crucial for the adaptability to particular tracking scenes, where the observable values from a different trackers may be biased due to scene properties, or to adapt to a different types of observables of trackers and their correlations to the ”real” tracker performance. For example, taking correlation with the initial target patch as an observable for one tracker and color histogram distance to a initial target for a second tracker, the correlation between their values and the performance of the tracker may differ depending on object rigidity and color distribution of object and background.

The HMM is parameterized by the pair , where are the probabilities of state transition and are the beta distributions of observables with shape parameters and density defined for


Since the goal is real-time tracking without any specific pre-processing, learning of HMM parameters has to be done online. Towards this goal, the object detector, which is set to operating mode with low false positive rate, is utilized to partially annotate the sequence of hidden states. In contrast to classical HMM, where only a sequence of observations is available, we are in a semi-supervised setting and have a time sequence

of observed states of a Markov chain

, and Markov chain starting again in state , all trackers correct, at any time , since there are reinitialized to common object pose. This information is provided by the detector, where is a sequence of detection times. The HMM parameters are learned by a modified Baum-Welch algorithm run on the observations and the annotated sequence of states . The partial annotation and HMM parameter estimation update is done strictly online.

The output of the HMMTxD is an average bounding box of correct trackers of the current most probable state . For the forward-backward procedure Rabiner1989 for HMM is used to calculate probability of each state at time (see Eq. 15-21) and the state is the state for which


is maximal. This equation is computed using Eq. 19 and maximized w.r.t . For the Eq. 2 holds with . This ensures that the algorithm outputs a pose for each frame which is required by most benchmark protocols. Illustration of the tracking process and HMM insight is shown in Fig. 2. Theoretically the parameters of HMM could be updated after each frame. However, in our implementation, learning takes place only at frames where the detector positively detects the object, i.e. the sequence of states starting and ending with observed state inferred by the detector111If pure online fusion is not required, future observations can also be used to determine the probability of each state.. The detector is used only if the detection pose is not in contradiction with the pose of the current most probable state in which the majority of trackers are correct. This ensure that even when the detector makes a mistake, the HMM is not wrongly updated. When we are in the state that one or none of the trackers are correct, the detector get precedence.

Figure 2: Illustration of HMM state and trackers probability estimation during tracking. The bottom graph shows the marginal probabilities for each tracker being correct and the detection times (green spikes). Above the graph the inferred states with color encoded correct trackers (1) are displayed. The final output is defined by the state and the bounding box is highlighted by white color. Best viewed zoomed in color.

3 Learning the Hidden Markov Model

For learning of the parameters of the HMM a MLE inference is employed, however maximizing the likelihood function is a complicated task that cannot be solved analytically. In the proposed method, the Baum-Welch algorithm baum1970

is adapted. The Baum-Welch algorithm is a widespread iterative procedure for estimating parameters of HMM where each iteration increases the likelihood function but, in general, the convergence to the global maximum is not guaranteed. The Baum-Welch algorithm is in fact an application of the EM (Expectation-Maximization) algorithm 

Dempster77 .

3.1 Classical Baum-Welch Algorithm

Let us assume the HMM with possible states , the matrix of state transition probabilities , the vector of initial state probabilities , the initial state , a sequence of observations and the system of conditional probability densities of observations conditioned on


where are random variables representing the state at time , and is denoting the parameter set of the model.

Let us denote


where is a set of all possible T-tuples of states and
is one sequence of states. According to Theorem 2.1. in  baum1970


and the equality holds if and only if . The classical Baum-Welch algorithm repeats the following steps until convergence:

  1. Compute

  2. Set .

3.2 Modified Baum-Welch Algorithm

We propose the modified Baum-Welch algorithm that exploits the partially annotated sequence of states, where the known states are inferred from the detector output. Let be a sequence of detection times, be observed states of Markov chain, marked by the detector, and for . So the sequence of observations of the HMM is divided into independent subsequences, each with a fixed initial state , the first subsequences with a known terminal state defined by the detector and the last subsequence with an unknown terminal state.

The following equations are obtained by employing the modification to the Baum-Welch algorithm,


The maximization of the can be separated to maximization w.r.t. transition probability matrix by maximizing the first term and w.r.t. observable densities by maximizing the second term.

The maximization of Eq. 7 w.r.t. constrained by is obtained by re-estimating the parameters as follows:


This equation is computed using modified forward and backward variables of the Baum-Welch algorithm to reflect the partially annotated states. For the exact derivation of formulas for computation of see the A.

3.2.1 Learning Observable Distributions

The maximization of Eq. 7 w.r.t. depends on assumptions on the system of probability densities . It is usually assumed (e.g. in Rabiner1989 ; baum1970 ) that

is a system of probability distributions of the same type and differ only in their parameters.

In the HMMTxD the -dimensional observed random variables
are assumed conditionally independent and to have the beta-distribution, so are products of one-dimensional beta distributions with parameters of shape . In this case maximization of the second term of the Eq. 7 is an iterative procedure using inverse digamma function which is very computationally expensive gupta2004 .

We propose to estimate the shape parameters of the beta distributions with a generalized method of moments. The classical method of moments is based on the fact that sample moments of independent observations converge to its theoretical ones due to the law of large numbers for independent random variables. In the HMMTxD observations

are not independent. The generalized method of moments is based on the fact that is a sequence of martingale differences for which the law of large numbers also holds. Using the generalized method of moments gives estimates of the parameters of shape








Let us denote the system of probability densities with re-estimated parameters as . The generalized method of moments is described in detail in the B.

3.2.2 Algorithm Overview

The complete modified Baum-Welch algorithm is summarized in Alg. 1, where after each iteration and we repeat these steps until convergence. Note that is a maximum likelihood estimate of therefore always increases (shown in Rabiner1989 ) but is estimated by the method of moments so the test on likelihood increase is required (”if statement” in the Alg. 1). In fact, this algorithm structure match to the generalized EM algorithm (GEM) introduced in  Dempster77 .

       Compute likelihood Estimate by Eq. 8 and by Eq. 910 if  then
until convergence max number of iteration
Algorithm 1 Algorithm for HMM parameters learning

4 Feature-Based Detector

The requirements for the detector are: adjustable operation mode (e.g. set for high precision but possibly low recall), (near) real-time performance and the ability to model pose transformations up to at least similarity (translation, rotation, isotropic scaling). Basically, any detector-like approach can be used and it may vary based on application. We choose to adapt a feature-based detector which has been shown to perform well in image retrieval, object detection and object tracking Pernici2013 tasks.

There are many possible combinations of features and their descriptors with different advantages and drawbacks. We exploit multiple feature types: specifically, Hessian keypoints with the SIFT Lowe2004 descriptor, ORB Rublee2011 with BRISK and ORB with FREAK Ortiz2012 . Each feature type is handled separately, up to the point where point correspondences are established. A weight is assigned to each feature type and is set to be inversely proportional to the number of features on the reference template, to balance the disparity in individual feature numbers.

The detector works as follows. In the initialization step, features are extracted from the inside and the outside of the region specifying the tracked object. Descriptors of the features outside of the region are stored as the background model.

Usually, the input region is not occupied by the target; therefore, fast color segmentation Kristan2014a attempts to delineate the object more precisely than the axis-aligned bounding box to remove the features that are most likely not on the target. The step is not critical for the function of the detector, since the bounding box is a fall-back option. We assume that at least of the bounding box is filled with pixels that belong to the target, if the segmentation fails (returns a region containing less than of area of the bounding box), all features in the initial bounding box are used.

Additionally, for each target feature, we use a normal distribution

to model the similarity of the feature to other features. The parameters and are estimated in the first frame by randomly sampling 100 features, other than , and computing distances to the feature , from which the mean and variation are computed. This allows defining the quality of correspondence matches in a probabilistic manner for each feature, thus getting rid of global static threshold for the acceptable correspondence distance.

In the detection phase, features are detected and described in the whole image. For each feature from the image the nearest neighbour (in Euclidean space or in Hamming distance metric space, depending on the feature type) feature from the background model and the nearest neighbour feature

from the foreground model are computed. A tentative correspondence is formed if the feature match passes the second nearest neighbour test and a probability that the correspondence distance belongs to the outlier distribution is lower than a predefined significance set to

. So


where is a c.d.f. of the normal distribution with parameters and of a distance distribution of features not corresponding to . The significance corresponds to the threshold. Finally, RANSAC estimates the target current pose using a sum of weighted inliers as a cost function for model support


which takes into account the different numbers of features per feature type on the target.

The decision whether the detected pose is considered correct depends on the number of weighted inliers that supports the RANSAC-selected transformation and it controls the trade-of between precision and recall of the method. This threshold is automatically computed in the first frame of the sequence as

. The threshold interval (5,10) and the feature multiplier (0.03) were set experimentally to have the false positive rate close to zero for the most of the testing sequences. Furthermore, majority voting is used to verify that the detection is not in contradiction to the estimated HMM state, i.e. if we are in the state where two or more (majority) trackers are correct and the detector is not consistent with them, the detection is not used. This mitigates the false positive detections, therefore HMM updates, when the trackers works correctly.

The true and false positives for 77 sequences are shown in Fig. 3, where the detector works on almost all sequences with zero false positive rate ( average false positive rate on the dataset) and recall rate. The failure cases of this feature-based detector are mostly caused by the imprecise initial bounding box, which contains large portion of structured background (i.e. background where the detector finds features) and due to the presence of similar object in the scene, e.g. sequences hand2, basketball, singer2.

Figure 3: Frames with the detections for 77 sequences dataset. The green marks show the true positive detection and red marks are false positive. The blue line shows the recall of the detector and blue dashed line shows the average recall over all sequences. The length of each sequence is normalized to range .

5 HMMTxD Implementation

To demonstrate the performance of the proposed framework, a pair and a triplet of published short-term trackers were plugged into the framework to show the performance gain by combination of a different number of trackers. As Bailer et al. Bailer2014 pointed out, not all trackers when combined can improve the overall performance (i.e. adding tracking method with similar failure mode will not benefit).

We therefore choose methods that have a different designs and work with different assumptions (e.g. rigid global motion vs. color mean-shift estimation vs. maximum correlation response). These trackers are the Flock of Trackers (FoT) Vojir2014 , scale adaptive mean-shift tracker (ASMS) Vojir2013 and kernelized correlation filters (KCF) henriques2015 . This choice shows that superior performance can be achieved by using simple, fast trackers (above 100fps) that may not represent the state-of-the-art. The trackers can be arbitrarily replaced depending on the user application or requirements.


The Flock of Trackers (FoT) Vojir2014 evenly covers the object with patches and establishes frame-to-frame correspondence by the Lucas-Kanade method Lucas1981 . The global motion of the target is estimated by RANSAC.

The second tracker is a scale adaptive mean-shift tracker (ASMS) Vojir2013 where the object pose is estimated by minimizing the distance between RGB histograms of the reference and the candidate bounding box. The KCF henriques2015

tracker learns a correlation filter by ridge regression to have high response to target object and low response on background. The correlation is done in the Fourier domain which is very efficient.

These three trackers have been selected since they are complementary by design. FoT enforces a global motion constrain and works best for rigid object with texture. On the other hand, ASMS does not enforce object rigidity and is well suited for articulated or deformable objects assuming their color distribution is discriminative w.r.t. the background. KCF can be viewed as a tracking-by detection approach using sliding window like scanning.

For each tracker position, two global observable measurements are computed, namely the Hellinger distance between the target template histogram and the histogram of the current position and normalized cross-correlation score of the current patch and the target model patch. These target models are initialized in the first frame and then updated exponentially with factor of during each positive detection of the detector part. Additionally, each tracker produces its own estimate of performance. For FoT it is the number of predicted correspondences (for details please see Vojir2014 ) that support the global model. For ASMS it is the Hellinger distance between its histogram model and current neighbourhood background (i.e. color similarity of the object and background) and for KCF it is a correlation response of the tracking procedure.

Figure 4: CVPR2013 OPE benchmark comparison of individual trackers and their combination in the proposed HMMTxD. The 2-HMMTxD denotes the combination of FoT and ASMS trackers and 3-HMMTxD is a combination of FoT, ASMS and KCF trackers. Det stands for the proposed detector. The right plot show simple combination of individual trackers with the proposed detector. Suffix ”-D” refers to the combination with detector.

6 Experiments

The HMMTxD was compared with state-of-the-art methods on two standard benchmarks and on a dataset TV77222 containing 77 public video sequences collected from tracking-related publications. The dataset exhibits wider diversity of content and variability of conditions than the benchmarks.

Parameters of the method were fixed for all the experiments. In the HMM, the initial beta distribution shape parameters were set to for correct state (1) and for fail state (0) for all observations and the transition matrix was set to prefer staying in the current state. The transition matrix has on diagonal, in fist column, in last column, in last row and otherwise. The matrix is normalized so that rows sum to one. States in the matrix are binary encoded starting from the left column which corresponds to the state . The number of iteration for Baum-Welch alg. was set to .

The processing speed on the VOT2015 dataset is (in frames per second) minimum 1.03, maximum 33.72 and average 10.83 measured on a standard notebook with Intel Core-i7 processor. This speed is mostly affected by the number of features detected in the images which correlates to the resolution of the image (in the dataset the range is from 320x180 to 1280x720).

First, we compare the performance of individual parts of the HMMTxD framework (i.e. KCF, ASMS, FoT trackers) and their combination via HMM as proposed in this paper. Two variants of HMMTxD are evaluated – 2-HMMTxD refers to combination of FoT and ASMS trackers and the 3-HMMTxD to combination of all mentioned trackers. We also show the benefit of the proposed detector when simply combined with the individual trackers in such way that if detector fires the tracker is reinitialized. The Figure 4 shows the benefit gained from the detector and further consistent improvement achieved by the combination of the trackers. More detailed per sequence analysis on the TV77 dataset (Fig. 5 and Fig. 6) shows more clearly the efficiency of learning tracker performance online. In almost all sequences the HMMTxD is able to identify and learn which trackers works correctly and achieve the performance of at least the best tracker or higher (e.g. motocross1, skating1(low), Volkswagen, singer1, pedestrian3, surfer). Most notable failure cases are caused by the detector failure, e.g. in sequences singer2, woman, skating1, basketball, girl_mov.

In all other experiments, the abbreviation HMMTxD refers to the combination of all 3 trackers.

Figure 5: Per sequence analysis of the single trackers (i.e. KCF, ASMS, FoT) and the proposed HMMTxD. The average recall is shown by the dashed lines (precise number is in the legend). Black circles mark grayscale sequences. The sequences are ordered by HMMTxD performance.
Figure 6: Per sequence analysis of the single trackers combined with the detector (i.e. KCF-D, ASMS-D, FoT-D) and the proposed HMMTxD. The average recall is shown by the dashed lines (precise number is in the legend). Black circles mark grayscale sequences. The sequences are ordered by HMMTxD performance.

Evaluation on the CVPR2013 Benchmark Wu2013 that contains 50 video sequences. Results on the benchmark have been published for about 30 trackers. The benchmark defines three types of experiments: (i) one-pass evaluation (OPE) – a tracker initialized in the first frame is run to the end of the sequence, (ii) temporal robustness evaluation (TRE) – the tracker is initialized and starts at a random frame, and (iii) spatial robustness evaluation (SRE) – the initialization is perturbed spatially. Performance is measured by precision (spatial accuracy, i.e. center distance of ground truth and reported bounding box) and success rate (the number of frames where overlap with the ground truth was higher than a threshold). The results are visualized in Fig. 7 where only results of the 10 top performing trackers are plotted. Together with the tracker from this benchmark, we also added the MEEM zhang2014 tracker, which is a recent state-of-the-art tracker. The proposed HMMTxD outperforms all trackers in the success rate in all three experiments. Its precision is comparable to MEEM zhang2014 the top performing tracker in terms of precision. HMMTxD outperforms significantly the OPE results reported in Wang et al. wangg14 , where 5 top performing trackers from this particular benchmark were used for combination (other experiments were not reported in the paper).

Figure 7: Evaluation of HMMTxD on the CVPR2013 Benchmark Wu2013 . The top row shows the success rate as a function of the overlap threshold. The bottom row shows the precision as a function of the localization error threshold. The number in the legend is AUC, the area under ROC-curve, which summarizes the overall performance of the tracker for each experiment.

VOT2013 benchmark Kristan2013 evaluates trackers on a collection containing 16 sequences carefully selected from a large pool by a semi- automatic clustering method. For comparison, results of 27 tracking methods are available and the added MEEM tracker was evaluated by us using default setting from the publicly available source code. The performance is measured by accuracy, average overlap with the ground truth, and robustness, the number of re-initialization of the tracker so that it is able to track the whole sequence. Average rank of trackers is used as an overall performance indicator.

(a) baseline experiment
(b) region-noise experiment
Figure 8: Evaluation of HMMTxD on the VOT 2013 Benchmark Kristan2013 . HMMTxD result is shown as the red circle. The left plot shows the ranking in accuracy (vertical axis) and robustness (horizontal axis) and the right plot shows the raw average values of accuracy and robustness (normalized to the interval). For both plots the top right corner is the best performance.

In this benchmark, the proposed HMMTxD achieves clearly the best accuracy (Fig. 8). With less than one re-initialization per sequence it performs slightly worse in terms of robustness due to two reasons.

Firstly, the HMM recognizes a tracker problem with a delay and switching to other tracker (here even one frame where the overlap with ground truth is zero leads to penalization) and secondly the VOT evaluation protocol, which require re-initialization after failure and to forget all previously learned models (the VOT2013 refer to this as causal tracking), therefore the learned performance of the trackers is forgotten and has to be learned from scratch.

The results for the baseline and region-noise experiments are shown in Fig. 8. Note that the ranking of the methods differs from the original publication since two new methods (HMMTxD and MEEM) were added and the relative ranking of the methods changed. The top three performing trackers and their average ranks are HMMTxD (), PLT (), LGTpp Xiao2013 (). MEEM tracker ends up at the fifth place with average rank . The rankings were obtained by the toolkit provided by the VOT in default settings for baseline and region noise experiments.

The second best performing method on the VOT2013 is the unpublished PLT for which just a short description is available in Kristan2013 . PLT is a variation of structural SVM that uses multiple features (color, gradients). STRUCK Hare2011 and MEEM zhang2014 are similar method to the PLT based on SVM classification. We compared these method with HMMTxD on the diverse 77 videos along with the TLD Kalal2012 which has a similar design as HMMTxD. HMMTxD outperforms all these methods by a large margin on average recall – measured as number of frames where the tracker overlap with ground truth is higher than averaged over all sequences. Results are shown in Fig. 9. Qualitative comparison of these state-of-the-art methods is shown in Fig. 10. Even for sequences with lower recall (e.g. bird_1, skating2), the HMMTxD is able to follow the object of interest.

Figure 9: Evaluation of state-of-the-art trackers on the TV77 dataset in terms of recall, i.e. number of correctly tracked frames. The average recall is shown by the dashed lines (precise number is in the legend). Black circles mark grayscale sequences. The sequences are ordered by HMMTxD performance.
Figure 10: Qualitative comparison of the state-of-the-art trackers on challenging sequences from the TV77 dataset (from top bird_1, drunk2, singer1, skating2, surfer, Vid_J).

7 Conclusions

A novel method called HMMTxD for fusion of multiple trackers has been proposed. The method utilizes an on-line trained HMM to estimate the states of the individual trackers and to fuse a different types of observables provided by the trackers. The HMMTxD outperforms its constituent parts (FoT, ASMS, KCF, Detector and its combinations) by a large margin and shows the efficiency of the HMM with combination of three trackers.

HMMTxD outperforms all methods included in the CVPR2013 benchmark and perform favorably against most recent state-of-the-art tracker. The HMMTxD also outperforms all method of the VOT2013 benchmark in accuracy, while maintaining very good robustness, and ranking in the first place in overall ranking. Experiments conducted on a diverse dataset TV77 show that the HMMTxD outperforms state-of-the-art MEEM, STRUCK and TLD methods, which are similar in design, by a large margin. The processing speed of the HMMTxD is frames per second on average, which is comparable with other complex tracking methods.


The research was supported by the Czech Science Foundation Project GACR P103/12/G084 and by the Technology Agency of the Czech Republic project TE01020415 (V3C – Visual Computing Competence Center).

Appendix A Forward-Backward Procedure for Modified Baum-Welch Algorithm

Let us assume the HMM with possible states , the matrix of state transition probabilities , the vector of initial state probabilities , the initial state , a sequence of observations and the system of conditional probability densities of observations conditioned on .

Let be a sequence of detection times, be observed states of Markov chain, marked by the detector, and for .

The forward variable for the Baum-Welch algorithm is defined as follows. Let and


and for


For the forward variable is in principle the same as above with . So


The backward variable for is


where and for and


For the backward variable is in principle the same as above where for .

Given the forward and backward variables, we get the following probabilities, that are used to update parameters of HMM. For


and for


The final equation for the update of transition probabilities of HMM is as follows.


Appendix B Generalized Method of Moments

For a simplification let us assume HMM with one-dimensional observed random variables . The sequence is a martingale difference series where


Under the assumption that are uniformly bounded random variables i.e. for all , the strong law of large numbers for a sum of martingale differences can be used(see Theorem 2.19 in  hall1980 ). So


Let us denote for and the estimate of based on the modified method of moments. The estimate is a solution of a following equation w.r.t.


Having one equation for unknown variables it is necessary to add some constrains to get a unique solution. We propose to minimize


w.r.t. giving


which satisfy the moment equation (32). The same way of reasoning can be used for higher moments of . For example using we get estimates for for ,


In the HMMTxD -dimensional observed random variables are assumed, each of them having beta- distribution and being conditionally independent. There are well-known relations for a mean value

and a variance

of a random variable having beta distribution and its shape parameters




Using the modified method of moments gives








If we assume in our model that for some then










  • (1) A. Yilmaz, O. Javed, M. Shah, Object tracking: A survey, ACM Computing Surveys, 2006.
  • (2) A. Smeulder, D. Chu, R. Cucchiara, S. Calderara, A. Deghan, M. Shah, Visual tracking: an experimental survey, IEEE Transactions on Pattern Analysis and Machine Intelligence (2013).
  • (3)

    J. Kwon, K. M. Lee, Tracking of a non-rigid object via patch-based dynamic appearance modeling and adaptive basin hopping monte carlo sampling., in: Computer Vision and Pattern Recognition, 2009, pp. 1208–1215.

  • (4) M. Godec, P. M. Roth, H. Bischof, Hough-based tracking of non-rigid objects, in: International Conference on Computer Vision, 2011.
  • (5) L. Cehovin, M. Kristan, A. Leonardis, Robust visual tracking using an adaptive coupled-layer visual model, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (4) (2013) 941–953.
  • (6) H. Grabner, J. Matas, L. Van Gool, P. Cattin, Tracking the invisible: Learning where the object might be, in: Computer Vision and Pattern Recognition, 2010, pp. 1285–1292.
  • (7) X. Zhou, Y. Lu, Abrupt motion tracking via adaptive stochastic approximation Monte Carlo sampling, in: Computer Vision and Pattern Recognition, 2010, pp. 1847–1854.
  • (8) F. Pernici, A. D. Bimbo, Object tracking by oversampling local features, IEEE Transactions on Pattern Analysis and Machine Intelligence 99 (PrePrints) (2013) 1.
  • (9) Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (7) (2012) 1409–1422.
  • (10) J. Santner, C. Leistner, A. Saffari, T. Pock, H. Bischof, PROST Parallel Robust Online Simple Tracking, in: Computer Vision and Pattern Recognition, San Francisco, CA, USA, 2010.
  • (11) J. Zhang, S. Ma, S. Sclaroff, MEEM: robust tracking via multiple experts using entropy minimization, in: Proc. of the European Conference on Computer Vision, 2014.
  • (12)

    N. Wang, D. yan Yeung, Ensemble-based tracking: Aggregating crowdsourced structured time series data, in: T. Jebara, E. P. Xing (Eds.), Proceedings of the 31st International Conference on Machine Learning, 2014, pp. 1107–1115.

  • (13) C. Bailer, A. Pagani, D. Stricker, A superior tracking approach: Building a strong tracker through fusion, in: European Conference on Computer Vision, Lecture Notes in Computer Science, 2014.
  • (14) Y. Yuan, H. Yang, Y. Fang, W. Lin, Visual object tracking by structure complexity coefficients, Multimedia, IEEE Transactions on 17 (8) (2015) 1125–1136.
  • (15) J. H. Yoon, D. Y. Kim, K.-J. Yoon, Visual tracking via adaptive tracker selection with multiple features, in: Proceedings of the 12th European Conference on Computer Vision - Volume Part IV, ECCV’12, 2012, pp. 28–41.
  • (16) M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. P. Pflugfelder, G. Fernández, G. Nebehay, F. Porikli, L. Cehovin, A novel performance evaluation methodology for single-target trackers, arXiv 2015 abs/1503.01313.
  • (17) A. Gupta, S. Nadarajah, Handbook of Beta Distribution and Its Applications, Statistics: A Series of Textbooks and Monographs, 2004.
  • (18) L. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE 77 (2) (1989) 257–286.
  • (19) L. E. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, The Annals of Mathematical Statistics 41 (1) (1970) 164–171.
  • (20) A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, series B 39 (1) (1977) 1–38.
  • (21) D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110.
  • (22) E. Rublee, V. Rabaud, K. Konolige, G. Bradski, Orb: An efficient alternative to sift or surf, in: International Conference on Computer Vision, ICCV ’11, Washington, DC, USA, 2011, pp. 2564–2571.
  • (23) R. Ortiz, Freak: Fast retina keypoint, in: Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 2012, pp. 510–517.
  • (24) M. Kristan, J. Perš, V. Sulic, S. Kovacic, A graphical model for rapid obstacle image-map estimation from unmanned surface vehicles, in: Asian Conference on Computer Vision. Accepted, to be published., 2014.
  • (25) T. Vojir, J. Matas, The enhanced flock of trackers, in: Registration and Recognition in Images and Videos, Vol. 532 of Studies in Computational Intelligence, 2014, pp. 113–136.
  • (26) T. Vojir, J. Noskova, J. Matas, Robust scale-adaptive mean-shift for tracking, in: Image Analysis, Vol. 7944 of Lecture Notes in Computer Science, 2013, pp. 652–663.
  • (27) J. F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3) (2015) 583–596.
  • (28)

    B. D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: International Joint Conference on Artificial Intelligence, 1981.

  • (29) Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in: Computer Vision and Pattern Recognition, 2013, pp. 2411–2418.
  • (30) M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, F. Porikli, L. Cehovin, G. Nebehay, G. Fernandez, T. Vojir, et al., The visual object tracking vot2013 challenge results, in: The IEEE International Conference on Computer Vision (ICCV) Workshops, 2013.
  • (31) J. Xiao, R. Stolkin, A. Leonardis, An enhanced adaptive coupled-layer lgtracker++, in: Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, 2013, pp. 137–144.
  • (32) S. Hare, A. Saffari, P. H. S. Torr, Struck: Structured output tracking with kernels, in: International Conference on Computer Vision, 2011, pp. 263–270.
  • (33) P. Hall, C. Heyde, Martingale limit theory and its application, Probability and mathematical statistics, Academic Press, 1980.