1 Introduction
In the last thirty years, a large number of diverse visual tracking methods has been proposed Yilmaz2006 ; smeulder2013 . The methods differ in the formulation of the problem, assumptions made about the observed motion, in optimization techniques, the features used, in the processing speed, and in the application domain. Some methods focus on specific challenges like tracking of articulated or deformable objects Kwon2009 ; godec2011 ; Cehovin2013 , occlusion handling grabner2010 , abrupt motion Zhou2010 or longterm tracking Pernici2013 ; Kalal2012 .
Three observations motivate the presented research. First, most trackers perform poorly if run outside the scenario they were designed for. Second, some trackers make different and complementary assumptions and their failures are not highly correlated (called complementary trackers in the paper). And finally, even fairly complex well performing trackers run at frame rate or faster on standard hardware, opening the possibility for multiple trackers to run concurrently and yet in or near realtime.
We propose a novel methodology that exploits a hidden Markov model (HMM) for fusion of nonuniform observables and pose prediction of multiple complementary trackers using an online learned highprecision detector. The nonuniform observables, in this sense, means that each tracker can produce its own type of ”confidence estimate” which may not be directly comparable between each other.
The HMM, trained in an unsupervised manner, estimates the state of the trackers – failed, operates correctly – and outputs the pose of the tracked object taking into account the past performance and observations of the trackers and the detector. The HMM treats the detector output as correct if it is not in contradiction with its current most probable state in which the majority of trackers are correct. This limits the cases where the HMM would be wrongly updated by a false detection. For the potentially many frames where reliable detector output is not available, it combines the trackers. The detector is trained on the first image and interacts with the learning of the HMM by partially annotating the sequence of HMM states in the time of verified detections. The recall of the detector is not critical but it affects the learning rate of the HMM and the longterm properties of the HMMTxD method, i.e. its ability to reinitialize trackers after occlusions or object disappearance.
Related work. The most closely related approaches include Santner et al. Santner2010 , where three tracking methods with different rates of appearance adaptation are combined to prevent drift due to incorrect model updates. The approach uses simple, hardcoded rules for tracker selection. Kalal et al. Kalal2012
combine a trackingbydetection method with a shortterm tracker that generates so called PN events to learn new object appearance. The output is defined either by the detector or the tracker based on visual similarity to the learned object model. Both these methods employ predefined rules to make decisions about object pose and use one type of measurement, a certain form of similarity between the object and the estimated location. In contrary, HMMTxD learns continuously and causally the performance statistics of individual parts of the systems and fuses multiple ”confidence” measurements in the form of probability densities of observables in the HMM. Zhang et al.
zhang2014use a pool of multiple classifiers learned from different time spans and choose the one that maximize an entropybased cost function. This method addresses the problem of model drifting due to wrong model updates, but the failure modes inherent to the classifier itself remains the same. This is unlike the proposed method which allows to combine diverse tracking methods with different inherent failure modes and with different learning strategies to balance their weaknesses.
Similarly to the proposed method, Wang et al. wangg14 and Bailer et at. Bailer2014 fuse different outofthe box tracking methods. Bailer et al. combine offline the outputs of multiple tracking algorithms. There is no interaction between trackers, which for instance implies that the method avoids failure only if one method correctly tracks the whole sequence. Wang et al. use a factorial hidden Markov model and a Bayesian approach. The state space of their factorial HMM is the set of potential object positions, therefore it is very large. The model contains a probability description of the object motion based on a particle filter. Trackers interact by reinitializing those with low reliability to the pose of the most confident one. The Yuan et al. Yuan2015 using HMM in the same setup, but rather than merging multiple tracking method, they focus on modeling the temporal change of the target appearance in the HMM framework by introducing a observational dependencies. In contrast, the HMMTxD method is online with tracker interaction via a high precision object detector that supervises tracker reinitializations which happen on the fly. The appearance modeling is performed inside of each tracker and the HMMTxD capture the relation of the confidence provided by tracker and its performance, validated by the object detector, by the observable distributions. Moreover, the HMMTxD confidence estimation is motionmodel free and this prevents biases towards support of trackers with a particular motion model.
Yoon et al. Yoon2012 combines multiple trackers in a particle filter framework. This approach models observables and transition behavior of individual trackers, but the trackers are selfadapting which makes it prone to wrong model updates. The adaptation of HMMTxD model is supervised by a detector method set to a specific mode of operation – near precision – alleviating the incorrect update problem.
The contributions of the paper are: a novel method for fusion of multiple trackers based on HMMs using nonuniform observables, a simple, and so far unused, unsupervised method for HMMs training in the context of tracking, tunable featurebased detector with very low false positive rate, and the creation of a tracking system that shows stateoftheart performance.
2 Fusing Multiple Trackers
HMMTxD uses a hidden Markov model (HMM) to integrate pose and observational confidence of different trackers and a detector, and updates its own confidence estimates that in turn define the pose that it outputs. In the HMM, each tracker is modeled as working correctly (1) or incorrectly (0). The HMM poses no constraints on the definition of tracker correctness, we adopted target overlap above a threshold. Having at our disposal trackers, the set of all possible states is and the initial state . Note that the trackers are not assumed to be independent, because an independence of tracker correctness is not a realistic assumption. For example, if the tracking problem is relatively easy, all trackers tend to be correct and in the case of occlusion all tend to be incorrect (see the analysis in votpamiarxiv2015 ). The number of states grows exponentially with the number of trackers. However, we do not consider this a significant issue – due to ”realtime” requirements of tracking, the need to combine more than a small number of trackers, say , is unlikely.
The HMMTxD method overview is illustrated in Fig. 1. Each tracker provides an estimate of the object pose () and a vector of observables (), which may contain a similarity measure to some model (such as normalized crosscorrelation to the initial image patch, distance of template and current histograms at given position, etc.) or any other estimates of the tracker performance. The serve as observables to relate the tracker current confidence to the HMM. Each individual observable depends only on one particular tracker and its correctness, hence, they are assumed to be conditionally independent conditioned on the state of the HMM (which encodes the tracker correctness).
In general, there are no constraints on observable values, however, in the proposed HMM the observable values are required to be normalized to the
interval. The observables are modeled as betadistributed random variables (Eq.
1) and its parameters are estimated online. The beta distribution was chosen for its versatility, where practically any kind of unimodal random variable oncan be modeled by the beta distribution, i.e. for any choice of any lower and upper quantiles, a beta distribution exists satisfying the given quantile constraint
gupta2004 .Learning the parameters of the beta distributions online is crucial for the adaptability to particular tracking scenes, where the observable values from a different trackers may be biased due to scene properties, or to adapt to a different types of observables of trackers and their correlations to the ”real” tracker performance. For example, taking correlation with the initial target patch as an observable for one tracker and color histogram distance to a initial target for a second tracker, the correlation between their values and the performance of the tracker may differ depending on object rigidity and color distribution of object and background.
The HMM is parameterized by the pair , where are the probabilities of state transition and are the beta distributions of observables with shape parameters and density defined for
(1) 
Since the goal is realtime tracking without any specific preprocessing, learning of HMM parameters has to be done online. Towards this goal, the object detector, which is set to operating mode with low false positive rate, is utilized to partially annotate the sequence of hidden states. In contrast to classical HMM, where only a sequence of observations is available, we are in a semisupervised setting and have a time sequence
of observed states of a Markov chain
, and Markov chain starting again in state , all trackers correct, at any time , since there are reinitialized to common object pose. This information is provided by the detector, where is a sequence of detection times. The HMM parameters are learned by a modified BaumWelch algorithm run on the observations and the annotated sequence of states . The partial annotation and HMM parameter estimation update is done strictly online.The output of the HMMTxD is an average bounding box of correct trackers of the current most probable state . For the forwardbackward procedure Rabiner1989 for HMM is used to calculate probability of each state at time (see Eq. 1521) and the state is the state for which
(2) 
is maximal. This equation is computed using Eq. 19 and maximized w.r.t . For the Eq. 2 holds with . This ensures that the algorithm outputs a pose for each frame which is required by most benchmark protocols. Illustration of the tracking process and HMM insight is shown in Fig. 2. Theoretically the parameters of HMM could be updated after each frame. However, in our implementation, learning takes place only at frames where the detector positively detects the object, i.e. the sequence of states starting and ending with observed state inferred by the detector^{1}^{1}1If pure online fusion is not required, future observations can also be used to determine the probability of each state.. The detector is used only if the detection pose is not in contradiction with the pose of the current most probable state in which the majority of trackers are correct. This ensure that even when the detector makes a mistake, the HMM is not wrongly updated. When we are in the state that one or none of the trackers are correct, the detector get precedence.
3 Learning the Hidden Markov Model
For learning of the parameters of the HMM a MLE inference is employed, however maximizing the likelihood function is a complicated task that cannot be solved analytically. In the proposed method, the BaumWelch algorithm baum1970
is adapted. The BaumWelch algorithm is a widespread iterative procedure for estimating parameters of HMM where each iteration increases the likelihood function but, in general, the convergence to the global maximum is not guaranteed. The BaumWelch algorithm is in fact an application of the EM (ExpectationMaximization) algorithm
Dempster77 .3.1 Classical BaumWelch Algorithm
Let us assume the HMM with possible states , the matrix of state transition probabilities , the vector of initial state probabilities , the initial state , a sequence of observations and the system of conditional probability densities of observations conditioned on
(3) 
where are random variables representing the state at time , and is denoting the parameter set of the model.
Let us denote
(4) 
where is a set of all possible Ttuples of states and
is one sequence of states.
According to Theorem 2.1. in baum1970
(5) 
and the equality holds if and only if . The classical BaumWelch algorithm repeats the following steps until convergence:

Compute

Set .
3.2 Modified BaumWelch Algorithm
We propose the modified BaumWelch algorithm that exploits the partially annotated sequence of states, where the known states are inferred from the detector output. Let be a sequence of detection times, be observed states of Markov chain, marked by the detector, and for . So the sequence of observations of the HMM is divided into independent subsequences, each with a fixed initial state , the first subsequences with a known terminal state defined by the detector and the last subsequence with an unknown terminal state.
The following equations are obtained by employing the modification to the BaumWelch algorithm,
(6) 
(7) 
The maximization of the can be separated to maximization w.r.t. transition probability matrix by maximizing the first term and w.r.t. observable densities by maximizing the second term.
The maximization of Eq. 7 w.r.t. constrained by is obtained by reestimating the parameters as follows:
(8) 
This equation is computed using modified forward and backward variables of the BaumWelch algorithm to reflect the partially annotated states. For the exact derivation of formulas for computation of see the A.
3.2.1 Learning Observable Distributions
The maximization of Eq. 7 w.r.t. depends on assumptions on the system of probability densities . It is usually assumed (e.g. in Rabiner1989 ; baum1970 ) that
is a system of probability distributions of the same type and differ only in their parameters.
In the HMMTxD the dimensional observed random variables
are assumed conditionally
independent and to have the betadistribution, so
are products of onedimensional beta distributions with parameters of
shape . In this case
maximization of the second term of the Eq. 7 is an iterative
procedure using inverse digamma function which is very computationally
expensive gupta2004 .
We propose to estimate the shape parameters of the beta distributions with a generalized method of moments. The classical method of moments is based on the fact that sample moments of independent observations converge to its theoretical ones due to the law of large numbers for independent random variables. In the HMMTxD observations
are not independent. The generalized method of moments is based on the fact that is a sequence of martingale differences for which the law of large numbers also holds. Using the generalized method of moments gives estimates of the parameters of shape(9) 
and
(10) 
where
(11) 
and
(12) 
Let us denote the system of probability densities with reestimated parameters as . The generalized method of moments is described in detail in the B.
3.2.2 Algorithm Overview
The complete modified BaumWelch algorithm is summarized in Alg. 1, where after each iteration and we repeat these steps until convergence. Note that is a maximum likelihood estimate of therefore always increases (shown in Rabiner1989 ) but is estimated by the method of moments so the test on likelihood increase is required (”if statement” in the Alg. 1). In fact, this algorithm structure match to the generalized EM algorithm (GEM) introduced in Dempster77 .
4 FeatureBased Detector
The requirements for the detector are: adjustable operation mode (e.g. set for high precision but possibly low recall), (near) realtime performance and the ability to model pose transformations up to at least similarity (translation, rotation, isotropic scaling). Basically, any detectorlike approach can be used and it may vary based on application. We choose to adapt a featurebased detector which has been shown to perform well in image retrieval, object detection and object tracking Pernici2013 tasks.
There are many possible combinations of features and their descriptors with different advantages and drawbacks. We exploit multiple feature types: specifically, Hessian keypoints with the SIFT Lowe2004 descriptor, ORB Rublee2011 with BRISK and ORB with FREAK Ortiz2012 . Each feature type is handled separately, up to the point where point correspondences are established. A weight is assigned to each feature type and is set to be inversely proportional to the number of features on the reference template, to balance the disparity in individual feature numbers.
The detector works as follows. In the initialization step, features are extracted from the inside and the outside of the region specifying the tracked object. Descriptors of the features outside of the region are stored as the background model.
Usually, the input region is not occupied by the target; therefore, fast color segmentation Kristan2014a attempts to delineate the object more precisely than the axisaligned bounding box to remove the features that are most likely not on the target. The step is not critical for the function of the detector, since the bounding box is a fallback option. We assume that at least of the bounding box is filled with pixels that belong to the target, if the segmentation fails (returns a region containing less than of area of the bounding box), all features in the initial bounding box are used.
Additionally, for each target feature, we use a normal distribution
to model the similarity of the feature to other features. The parameters and are estimated in the first frame by randomly sampling 100 features, other than , and computing distances to the feature , from which the mean and variation are computed. This allows defining the quality of correspondence matches in a probabilistic manner for each feature, thus getting rid of global static threshold for the acceptable correspondence distance.In the detection phase, features are detected and described in the whole image. For each feature from the image the nearest neighbour (in Euclidean space or in Hamming distance metric space, depending on the feature type) feature from the background model and the nearest neighbour feature
from the foreground model are computed. A tentative correspondence is formed if the feature match passes the second nearest neighbour test and a probability that the correspondence distance belongs to the outlier distribution is lower than a predefined significance set to
. So(13) 
where is a c.d.f. of the normal distribution with parameters and of a distance distribution of features not corresponding to . The significance corresponds to the threshold. Finally, RANSAC estimates the target current pose using a sum of weighted inliers as a cost function for model support
(14) 
which takes into account the different numbers of features per feature type on the target.
The decision whether the detected pose is considered correct depends on the number of weighted inliers that supports the RANSACselected transformation and it controls the tradeof between precision and recall of the method. This threshold is automatically computed in the first frame of the sequence as
. The threshold interval (5,10) and the feature multiplier (0.03) were set experimentally to have the false positive rate close to zero for the most of the testing sequences. Furthermore, majority voting is used to verify that the detection is not in contradiction to the estimated HMM state, i.e. if we are in the state where two or more (majority) trackers are correct and the detector is not consistent with them, the detection is not used. This mitigates the false positive detections, therefore HMM updates, when the trackers works correctly.The true and false positives for 77 sequences are shown in Fig. 3, where the detector works on almost all sequences with zero false positive rate ( average false positive rate on the dataset) and recall rate. The failure cases of this featurebased detector are mostly caused by the imprecise initial bounding box, which contains large portion of structured background (i.e. background where the detector finds features) and due to the presence of similar object in the scene, e.g. sequences hand2, basketball, singer2.
5 HMMTxD Implementation
To demonstrate the performance of the proposed framework, a pair and a triplet of published shortterm trackers were plugged into the framework to show the performance gain by combination of a different number of trackers. As Bailer et al. Bailer2014 pointed out, not all trackers when combined can improve the overall performance (i.e. adding tracking method with similar failure mode will not benefit).
We therefore choose methods that have a different designs and work with different assumptions (e.g. rigid global motion vs. color meanshift estimation vs. maximum correlation response). These trackers are the Flock of Trackers (FoT) Vojir2014 , scale adaptive meanshift tracker (ASMS) Vojir2013 and kernelized correlation filters (KCF) henriques2015 . This choice shows that superior performance can be achieved by using simple, fast trackers (above 100fps) that may not represent the stateoftheart. The trackers can be arbitrarily replaced depending on the user application or requirements.
Trackers
The Flock of Trackers (FoT) Vojir2014 evenly covers the object with patches and establishes frametoframe correspondence by the LucasKanade method Lucas1981 . The global motion of the target is estimated by RANSAC.
The second tracker is a scale adaptive meanshift tracker (ASMS) Vojir2013 where the object pose is estimated by minimizing the distance between RGB histograms of the reference and the candidate bounding box. The KCF henriques2015
tracker learns a correlation filter by ridge regression to have high response to target object and low response on background. The correlation is done in the Fourier domain which is very efficient.
These three trackers have been selected since they are complementary by design. FoT enforces a global motion constrain and works best for rigid object with texture. On the other hand, ASMS does not enforce object rigidity and is well suited for articulated or deformable objects assuming their color distribution is discriminative w.r.t. the background. KCF can be viewed as a trackingby detection approach using sliding window like scanning.
For each tracker position, two global observable measurements are computed, namely the Hellinger distance between the target template histogram and the histogram of the current position and normalized crosscorrelation score of the current patch and the target model patch. These target models are initialized in the first frame and then updated exponentially with factor of during each positive detection of the detector part. Additionally, each tracker produces its own estimate of performance. For FoT it is the number of predicted correspondences (for details please see Vojir2014 ) that support the global model. For ASMS it is the Hellinger distance between its histogram model and current neighbourhood background (i.e. color similarity of the object and background) and for KCF it is a correlation response of the tracking procedure.
6 Experiments
The HMMTxD was compared with stateoftheart methods on two standard benchmarks and on a dataset TV77^{2}^{2}2http://cmp.felk.cvut.cz/~vojirtom/dataset/index.html containing 77 public video sequences collected from trackingrelated publications. The dataset exhibits wider diversity of content and variability of conditions than the benchmarks.
Parameters of the method were fixed for all the experiments. In the HMM, the initial beta distribution shape parameters were set to for correct state (1) and for fail state (0) for all observations and the transition matrix was set to prefer staying in the current state. The transition matrix has on diagonal, in fist column, in last column, in last row and otherwise. The matrix is normalized so that rows sum to one. States in the matrix are binary encoded starting from the left column which corresponds to the state . The number of iteration for BaumWelch alg. was set to .
The processing speed on the VOT2015 dataset is (in frames per second) minimum 1.03, maximum 33.72 and average 10.83 measured on a standard notebook with Intel Corei7 processor. This speed is mostly affected by the number of features detected in the images which correlates to the resolution of the image (in the dataset the range is from 320x180 to 1280x720).
First, we compare the performance of individual parts of the HMMTxD framework (i.e. KCF, ASMS, FoT trackers) and their combination via HMM as proposed in this paper. Two variants of HMMTxD are evaluated – 2HMMTxD refers to combination of FoT and ASMS trackers and the 3HMMTxD to combination of all mentioned trackers. We also show the benefit of the proposed detector when simply combined with the individual trackers in such way that if detector fires the tracker is reinitialized. The Figure 4 shows the benefit gained from the detector and further consistent improvement achieved by the combination of the trackers. More detailed per sequence analysis on the TV77 dataset (Fig. 5 and Fig. 6) shows more clearly the efficiency of learning tracker performance online. In almost all sequences the HMMTxD is able to identify and learn which trackers works correctly and achieve the performance of at least the best tracker or higher (e.g. motocross1, skating1(low), Volkswagen, singer1, pedestrian3, surfer). Most notable failure cases are caused by the detector failure, e.g. in sequences singer2, woman, skating1, basketball, girl_mov.
In all other experiments, the abbreviation HMMTxD refers to the combination of all 3 trackers.
Evaluation on the CVPR2013 Benchmark Wu2013 that contains 50 video sequences. Results on the benchmark have been published for about 30 trackers. The benchmark defines three types of experiments: (i) onepass evaluation (OPE) – a tracker initialized in the first frame is run to the end of the sequence, (ii) temporal robustness evaluation (TRE) – the tracker is initialized and starts at a random frame, and (iii) spatial robustness evaluation (SRE) – the initialization is perturbed spatially. Performance is measured by precision (spatial accuracy, i.e. center distance of ground truth and reported bounding box) and success rate (the number of frames where overlap with the ground truth was higher than a threshold). The results are visualized in Fig. 7 where only results of the 10 top performing trackers are plotted. Together with the tracker from this benchmark, we also added the MEEM zhang2014 tracker, which is a recent stateoftheart tracker. The proposed HMMTxD outperforms all trackers in the success rate in all three experiments. Its precision is comparable to MEEM zhang2014 the top performing tracker in terms of precision. HMMTxD outperforms significantly the OPE results reported in Wang et al. wangg14 , where 5 top performing trackers from this particular benchmark were used for combination (other experiments were not reported in the paper).
VOT2013 benchmark Kristan2013 evaluates trackers on a collection containing 16 sequences carefully selected from a large pool by a semi automatic clustering method. For comparison, results of 27 tracking methods are available and the added MEEM tracker was evaluated by us using default setting from the publicly available source code. The performance is measured by accuracy, average overlap with the ground truth, and robustness, the number of reinitialization of the tracker so that it is able to track the whole sequence. Average rank of trackers is used as an overall performance indicator.
In this benchmark, the proposed HMMTxD achieves clearly the best accuracy (Fig. 8). With less than one reinitialization per sequence it performs slightly worse in terms of robustness due to two reasons.
Firstly, the HMM recognizes a tracker problem with a delay and switching to other tracker (here even one frame where the overlap with ground truth is zero leads to penalization) and secondly the VOT evaluation protocol, which require reinitialization after failure and to forget all previously learned models (the VOT2013 refer to this as causal tracking), therefore the learned performance of the trackers is forgotten and has to be learned from scratch.
The results for the baseline and regionnoise experiments are shown in Fig. 8. Note that the ranking of the methods differs from the original publication since two new methods (HMMTxD and MEEM) were added and the relative ranking of the methods changed. The top three performing trackers and their average ranks are HMMTxD (), PLT (), LGTpp Xiao2013 (). MEEM tracker ends up at the fifth place with average rank . The rankings were obtained by the toolkit provided by the VOT in default settings for baseline and region noise experiments.
The second best performing method on the VOT2013 is the unpublished PLT for which just a short description is available in Kristan2013 . PLT is a variation of structural SVM that uses multiple features (color, gradients). STRUCK Hare2011 and MEEM zhang2014 are similar method to the PLT based on SVM classification. We compared these method with HMMTxD on the diverse 77 videos along with the TLD Kalal2012 which has a similar design as HMMTxD. HMMTxD outperforms all these methods by a large margin on average recall – measured as number of frames where the tracker overlap with ground truth is higher than averaged over all sequences. Results are shown in Fig. 9. Qualitative comparison of these stateoftheart methods is shown in Fig. 10. Even for sequences with lower recall (e.g. bird_1, skating2), the HMMTxD is able to follow the object of interest.
7 Conclusions
A novel method called HMMTxD for fusion of multiple trackers has been proposed. The method utilizes an online trained HMM to estimate the states of the individual trackers and to fuse a different types of observables provided by the trackers. The HMMTxD outperforms its constituent parts (FoT, ASMS, KCF, Detector and its combinations) by a large margin and shows the efficiency of the HMM with combination of three trackers.
HMMTxD outperforms all methods included in the CVPR2013 benchmark and perform favorably against most recent stateoftheart tracker. The HMMTxD also outperforms all method of the VOT2013 benchmark in accuracy, while maintaining very good robustness, and ranking in the first place in overall ranking. Experiments conducted on a diverse dataset TV77 show that the HMMTxD outperforms stateoftheart MEEM, STRUCK and TLD methods, which are similar in design, by a large margin. The processing speed of the HMMTxD is frames per second on average, which is comparable with other complex tracking methods.
Acknowledgements
The research was supported by the Czech Science Foundation Project GACR P103/12/G084 and by the Technology Agency of the Czech Republic project TE01020415 (V3C – Visual Computing Competence Center).
Appendix A ForwardBackward Procedure for Modified BaumWelch Algorithm
Let us assume the HMM with possible states , the matrix of state transition probabilities , the vector of initial state probabilities , the initial state , a sequence of observations and the system of conditional probability densities of observations conditioned on .
Let be a sequence of detection times, be observed states of Markov chain, marked by the detector, and for .
The forward variable for the BaumWelch algorithm is defined as follows. Let and
(15) 
(16) 
(17) 
and for
(18) 
(19) 
For the forward variable is in principle the same as above with . So
(20) 
(21) 
The backward variable for is
(22) 
where and for and
(23) 
For the backward variable is in principle the same as above where for .
Given the forward and backward variables, we get the following probabilities, that are used to update parameters of HMM. For
(24) 
(25) 
and for
(26) 
The final equation for the update of transition probabilities of HMM is as follows.
(27) 
(28) 
Appendix B Generalized Method of Moments
For a simplification let us assume HMM with onedimensional observed random variables . The sequence is a martingale difference series where
(29) 
(30) 
Under the assumption that are uniformly bounded random variables i.e. for all , the strong law of large numbers for a sum of martingale differences can be used(see Theorem 2.19 in hall1980 ). So
(31) 
Let us denote for and the estimate of based on the modified method of moments. The estimate is a solution of a following equation w.r.t.
(32) 
Having one equation for unknown variables it is necessary to add some constrains to get a unique solution. We propose to minimize
(33) 
w.r.t. giving
(34) 
which satisfy the moment equation (32). The same way of reasoning can be used for higher moments of . For example using we get estimates for for ,
(35) 
In the HMMTxD dimensional observed random variables are assumed, each of them having beta distribution and being conditionally independent. There are wellknown relations for a mean value
and a variance
of a random variable having beta distribution and its shape parameters(36) 
and
(37) 
Using the modified method of moments gives
(38) 
and
(39) 
Then
(40) 
and
(41) 
If we assume in our model that for some then
(42) 
and
(43) 
where
(44) 
and
(45) 
References
References
 (1) A. Yilmaz, O. Javed, M. Shah, Object tracking: A survey, ACM Computing Surveys, 2006.
 (2) A. Smeulder, D. Chu, R. Cucchiara, S. Calderara, A. Deghan, M. Shah, Visual tracking: an experimental survey, IEEE Transactions on Pattern Analysis and Machine Intelligence (2013).

(3)
J. Kwon, K. M. Lee, Tracking of a nonrigid object via patchbased dynamic appearance modeling and adaptive basin hopping monte carlo sampling., in: Computer Vision and Pattern Recognition, 2009, pp. 1208–1215.
 (4) M. Godec, P. M. Roth, H. Bischof, Houghbased tracking of nonrigid objects, in: International Conference on Computer Vision, 2011.
 (5) L. Cehovin, M. Kristan, A. Leonardis, Robust visual tracking using an adaptive coupledlayer visual model, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (4) (2013) 941–953.
 (6) H. Grabner, J. Matas, L. Van Gool, P. Cattin, Tracking the invisible: Learning where the object might be, in: Computer Vision and Pattern Recognition, 2010, pp. 1285–1292.
 (7) X. Zhou, Y. Lu, Abrupt motion tracking via adaptive stochastic approximation Monte Carlo sampling, in: Computer Vision and Pattern Recognition, 2010, pp. 1847–1854.
 (8) F. Pernici, A. D. Bimbo, Object tracking by oversampling local features, IEEE Transactions on Pattern Analysis and Machine Intelligence 99 (PrePrints) (2013) 1.
 (9) Z. Kalal, K. Mikolajczyk, J. Matas, Trackinglearningdetection, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (7) (2012) 1409–1422.
 (10) J. Santner, C. Leistner, A. Saffari, T. Pock, H. Bischof, PROST Parallel Robust Online Simple Tracking, in: Computer Vision and Pattern Recognition, San Francisco, CA, USA, 2010.
 (11) J. Zhang, S. Ma, S. Sclaroff, MEEM: robust tracking via multiple experts using entropy minimization, in: Proc. of the European Conference on Computer Vision, 2014.

(12)
N. Wang, D. yan Yeung, Ensemblebased tracking: Aggregating crowdsourced structured time series data, in: T. Jebara, E. P. Xing (Eds.), Proceedings of the 31st International Conference on Machine Learning, 2014, pp. 1107–1115.
 (13) C. Bailer, A. Pagani, D. Stricker, A superior tracking approach: Building a strong tracker through fusion, in: European Conference on Computer Vision, Lecture Notes in Computer Science, 2014.
 (14) Y. Yuan, H. Yang, Y. Fang, W. Lin, Visual object tracking by structure complexity coefficients, Multimedia, IEEE Transactions on 17 (8) (2015) 1125–1136.
 (15) J. H. Yoon, D. Y. Kim, K.J. Yoon, Visual tracking via adaptive tracker selection with multiple features, in: Proceedings of the 12th European Conference on Computer Vision  Volume Part IV, ECCV’12, 2012, pp. 28–41.

(16)
M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. P. Pflugfelder,
G. Fernández, G. Nebehay, F. Porikli, L. Cehovin,
A novel performance evaluation
methodology for singletarget trackers, arXiv 2015 abs/1503.01313.
URL http://arxiv.org/abs/1503.01313  (17) A. Gupta, S. Nadarajah, Handbook of Beta Distribution and Its Applications, Statistics: A Series of Textbooks and Monographs, 2004.
 (18) L. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE 77 (2) (1989) 257–286.
 (19) L. E. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, The Annals of Mathematical Statistics 41 (1) (1970) 164–171.
 (20) A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, series B 39 (1) (1977) 1–38.
 (21) D. G. Lowe, Distinctive image features from scaleinvariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110.
 (22) E. Rublee, V. Rabaud, K. Konolige, G. Bradski, Orb: An efficient alternative to sift or surf, in: International Conference on Computer Vision, ICCV ’11, Washington, DC, USA, 2011, pp. 2564–2571.
 (23) R. Ortiz, Freak: Fast retina keypoint, in: Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 2012, pp. 510–517.
 (24) M. Kristan, J. Perš, V. Sulic, S. Kovacic, A graphical model for rapid obstacle imagemap estimation from unmanned surface vehicles, in: Asian Conference on Computer Vision. Accepted, to be published., 2014.
 (25) T. Vojir, J. Matas, The enhanced flock of trackers, in: Registration and Recognition in Images and Videos, Vol. 532 of Studies in Computational Intelligence, 2014, pp. 113–136.
 (26) T. Vojir, J. Noskova, J. Matas, Robust scaleadaptive meanshift for tracking, in: Image Analysis, Vol. 7944 of Lecture Notes in Computer Science, 2013, pp. 652–663.
 (27) J. F. Henriques, R. Caseiro, P. Martins, J. Batista, Highspeed tracking with kernelized correlation filters, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3) (2015) 583–596.

(28)
B. D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: International Joint Conference on Artificial Intelligence, 1981.
 (29) Y. Wu, J. Lim, M.H. Yang, Online object tracking: A benchmark, in: Computer Vision and Pattern Recognition, 2013, pp. 2411–2418.
 (30) M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, F. Porikli, L. Cehovin, G. Nebehay, G. Fernandez, T. Vojir, et al., The visual object tracking vot2013 challenge results, in: The IEEE International Conference on Computer Vision (ICCV) Workshops, 2013.
 (31) J. Xiao, R. Stolkin, A. Leonardis, An enhanced adaptive coupledlayer lgtracker++, in: Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, 2013, pp. 137–144.
 (32) S. Hare, A. Saffari, P. H. S. Torr, Struck: Structured output tracking with kernels, in: International Conference on Computer Vision, 2011, pp. 263–270.
 (33) P. Hall, C. Heyde, Martingale limit theory and its application, Probability and mathematical statistics, Academic Press, 1980.
Comments
There are no comments yet.