1 Introduction
We study the tracking problem, which has numerous applications in AI, control and finance. In tracking, we are given noisy measurements over time, and the problem is to estimate the hidden state of an object. The challenge is to do this reliably, by combining measurements from multiple time steps and prior knowledge about the state dynamics, and the goal of tracking is to produce estimates that are as close to the true states as possible.
The most popular solutions to the tracking problem are the Kalman filter
[1], the particle filter [2], and their numerous extensions and variations (e.g. [3, 4]), which are based on a generative framework for the tracking problem. Suppose we want to track the state of an object at time, given only measurement vectors
for times . In the generative approach, we think of the state and measurementsas random variables. We represent our knowledge regarding the dynamics of the states using the transition process
and our knowledge regarding the (noisy) relationship between the states and the observations by the measurement process . Then, given only the observations, the goal of tracking is to estimate the hidden state sequence. This is done by calculating the likelihood of each state sequence and then using as the estimate either the sequence with the highest posterior probability (maximum a posteriori, or MAP) or the expected value of the state with respect to the posterior distribution (the Bayesian algorithm). In practice, one uses particle filters, which are an approximation to the Bayesian algorithm.
The problem with the generative framework is that in practice, it is very difficult to precisely determine the distributions of the measurements. Moreover, the Bayesian algorithm is very sensitive to model mismatches, so using a model which is slightly different from the model generating the measurements can lead to a large divergence between the estimated states and the true states.
To address this, we introduce an onlinelearningbased framework for tracking. In our framework, called the explanatory framework, we are given a set of state sequences or paths in the state space; but instead of assuming that the observations are generated by a measurement model from a path in this set, we think of each path as a mechanism for explaining the observations. We emphasize that this is done regardless of how the observations are generated. Suppose a path is proposed as an explanation of the observations . We measure the quality of this explanatory path using a predefined loss function
, which depends only on the measurements (and not on the hidden true state). The tracking algorithm selects its own explanatory path by taking a weighted average of the best explanatory paths according the past observations. The theoretical guarantee we provide is that the loss of the explanatory path generated in this online way by the tracking algorithm is close to that of the explanatory path with the minimum such loss; here, the loss is measured according to the loss function supplied to the algorithm. Such guarantees are analogous to competitive analysis used in online learning
[5, 6, 7], and it is important to note that such guarantees hold uniformly for any sequence of observations, regardless of any probabilistic assumptions.Our next contribution is to provide an onlinelearningbased algorithm for tracking in the explanatory framework. Our algorithm is based on NormalHedge [8], which is a general online learning algorithm. NormalHedge can be instantiated with any loss function. When supplied with a bounded loss function, it is guaranteed to produce a path with loss close to that of the path with the minimum loss, from a set of candidate paths. As it is inefficient to directly apply NormalHedge to tracking, we derive a Sequential Monte Carlo approximation to NormalHedge, and we show that this approximation is efficient.
To demonstrate the robustness of our tracking algorithm, we perform simulations on a simple onedimensional tracking problem. We evaluate tracking performance by measuring the average distance between the states estimated by the algorithms, and the true hidden states. We instantiate our algorithm with a simple clipping loss function. Our simulations show that our algorithm consistently outperforms the Bayesian algorithm, under high measurement noise, and a wide range of levels of model mismatch.
We note that Bayesian algorithm can also be interpreted in the explanatory framework. In particular, if the loss of a path is the negative loglikelihood (the logloss) under some measurement model, then, the Bayesian algorithm can be shown to produce a path with logloss close to that of the path with the minimum logloss. One may be tempted to think that our tracking solution follows the same approach; however, the point of our paper is that one can use loss functions that are different from logloss, and in particular, we show a scenario in which using other loss functions produces better tracking performance than the Bayesian algorithm (or its approximations).
The rest of the paper is organized as follows. In Section 2, we explain in detail our explanatory model for tracking. In Section 3, we present NormalHedge, on which our tracking algorithm is based. In Section 4, we provide our tracking algorithm. Section 5 presents the experimental comparison of our algorithm with the Bayesian algorithm. Finally, we discuss related work in Section 6.
The detailed bounds and proofs for NormalHedge are provided in the supplementary material. We feel that the algorithm NormalHedge may be of more general interest, and hence these details for NormalHedge have been submitted to NIPS in a companion paper.
2 The explanatory framework for tracking
In this section, we describe in more detail the setup of the tracking problem, and the explanatory framework for tracking. In tracking, at each time , we are given as input, measurements (or observations) , and the goal is to estimate the hidden state of an object using these measurements, and our prior knowledge about the state dynamics.
In the explanatory framework, we are given a set of paths (sequences) over the state space . At each time , we assign to each path in a loss function . The loss function has two parts: a dynamics loss and an observation loss .
The dynamics loss captures our knowledge about the state dynamics. For simplicity, we use a dynamics loss that can be written as
for a path . In other words, the dynamics loss at time depends only on the states at time and . A common way to express our knowledge about the dynamics is in terms of a dynamics function , defined so that paths with will have small dynamics loss.
For example, consider an object moving with a constant velocity. Here, if the state , where is the position and is the velocity, then we would be interested in paths in which . In these cases, the dynamics loss is typically a growing function of the distance from to .
The second component of the loss function is an observation loss . Given a path , and measurements , the observation loss function quantifies how well the path explains the measurements. Again, for simplicity, we restrict ourselves to loss functions that can be written as:
In other words, the observation loss of a path at time depends only on its state at time and the measurements at time . The total loss of a path is the sum of its dynamics and observation losses. We note that the loss of a path depends only on that particular path and the measurements, and not on the true hidden state. As a result, the loss of a path can always be evaluated by the algorithm at any given time.
The algorithmic framework we consider in this model is analogous to, and motivated by the decisiontheoretic framework for online learning [6, 5]. At time , the algorithm assigns a weight to each path in . The estimated state at time is the weighted mean of the states, where the weight of a state is the total weight of all paths in this state. The loss of the algorithm at time is the weighted loss of all paths in . The theoretical guarantee we look for is that the loss of the algorithm is close to the loss of the best path in in hindsight (or, close to the loss of the top
quantile path in
in hindsight). Thus, if has a small fraction of paths with low loss, and if the loss functions successfully capture the tracking performance, then, the sequence of states estimated by the algorithm will have good tracking performance.3 NormalHedge
In this section, we describe the NormalHedge algorithm. To present NormalHedge in full generality, we first need to describe the decisiontheoretic framework for online learning.
The problem of decisiontheoretic online learning is as follows. At each round, a learner has access to a set of actions; for our purposes, an action is any method that provides a prediction in each round. The learner maintains a distribution over the action at time . At each time period , each action suffers a loss which lies in a bounded range, and the loss of the learner is . We notice that this framework is very general – no assumption is made about the nature of the actions and the distribution of the losses. The goal of the learner is to maintain a distribution over the actions, such that its cumulative loss over time is low, compared to the cumulative loss of the action with the lowest cumulative loss. In some cases, particularly, when the number of experts is very large, we are interested in acheiving a low cumulative loss compared to the top quantile of actions. Here, for any , the top quantile of actions are the fraction of actions which have the lowest cumulative loss.
Starting with the seminal work of Littlestone and Warmuth [7], the problem of decisiontheoretic online learning has been wellstudied in the literature [6, 9, 5]. The most common algorithm for this problem is Hedge or Exponential Weights [6], which assigns to each action a weight exponentially small in its total loss. In this paper however, we consider a different algorithm NormalHedge for this problem [8], and it is this algorithm that forms the basis of our tracking algorithm. While the Bayesian averaging algorithm can be shown to be a variant of Hedge when the loss function is the logloss, such is not the case for NormalHedge, and it is a very different algorithm. A significant advantage of using NormalHedge is that it has no parameters to tune, yet acheives performance comparable to the best performance of previous online learning algorithms with optimally tuned parameters.
In the NormalHedge algorithm, for each action and time , we use to denote the NormalHedge weight assigned to action at time . At any time , we define the regret of our algorithm to an action as the difference between the cumulative loss of our algorithm and the cumulative loss of this action. Also, for any real number , we use the notation to denote . The NormalHedge algorithm is presented below.
The performance guarantees for the NormalHedge algorithm, as shown by [8] can be stated as follows.
Theorem 1.
If NormalHedge has access to actions, then for all loss sequences, for all , for all , the regret of the algorithm to the top quantile of the actions is .
Note that the actions which have total loss greater than the total loss of the algorithm, are assigned zero weight. Since the algorithm performs almost as well as the best action, in a scenario where a few actions are significantly better than the rest, the algorithm will assign zero weight to most actions. In other words, the support of the NormalHedge weights may be a very small set, which can significantly reduce its computational cost.
4 Tracking using NormalHedge
To apply NormalHedge directly to tracking, we set each action to be a path in the state space, and the loss of each action at time to be the loss of the corresponding path at time . To make NormalHedge more robust in a practical setting, we make a small change to the algorithm: instead of using cumulative loss, we use a discounted cumulative loss. For a discount parameter , the discounted cumulative loss of an action at time is
. Using discounted losses is common in reinforcement learning
[10]; intuitively, it makes the tracking algorithm more flexible, and allows it to more easily recover from past mistakes.However, a direct application of NormalHedge is prohibitively expensive in terms of computation cost. Therefore, in the sequel, we show how to derive a Sequential MonteCarlo based approximation to NormalHedge, and we use this approximation in our experiments.
The key observation behind our approximation is that the weights on actions generated by the NormalHedge algorithm induce a distribution over the states at each time . We therefore use a random sample of states in each round to approximate this distribution. Thus, just as particle filters approximate the posterior density on the states induced by the Bayesian algorithm, our tracking algorithm approximates the density induced on the states by NormalHedge for tracking.
The main difference between NormalHedge and our tracking algorithm is that while NormalHedge always maintains the weights for all the actions, we delete an action from our action list when its weight falls to . We then replace this action by our resampling procedure, which chooses another action which is currently in a region of the state space where the actions have low losses. Thus, we do not spend resources maintaining and updating weights for actions which do not perform well. Another difference between NormalHedge and our tracking algorithm is that in our approximation, we do not explicitly impose a dynamics loss on the actions. Instead, we use a resampling procedure that only considers actions with low dynamics loss. This also avoids spending resources on actions which have high dynamics loss anyway.
Our tracking algorithm is specified in Algorithm 2. Each action in our algorithm is a path in the state space . However, we do not maintain this entire path explicitly for each action; rather, Step 8 of the algorithm computes from using the dynamics function , so we only need to maintain the current state of each action. Recall, applying the dynamics function should ensure that the path incurs no or little dynamics loss (see Section 2).
We start with a set of actions
initially positioned at states uniformly distributed over the
, and a uniform weighting over these actions. In each round, like NormalHedge, each action incurs a loss determined by its current state, and the tracker incurs the expected loss determined by the current weighting over actions. Using these losses, we update the cumulative (discounted) regrets to each action. However, unlike NormalHedge, we then delete all actions with zero or negative regret, and replace them using a resampling procedure. This procedure replaces poorly performing actions with actions currently at high density regions of , thereby providing a better approximation to the intended weights.0: (number of actions), (discount factor), (resampling parameter) (dynamics function) 1: with randomly drawn from ; ; 2: for do 3: Obtain losses for each action and update regrets: where . 4: Delete poor actions: let , set . 5: Resample actions: . 6: Compute weight of each action : where is the solution to the equation . 7: Estimate: . 8: Update states: . 9: end for 
0: (actions to be resampled), (resampling parameter), (current time) 1: for do 2: Set . 3: If : set . Else: set and . 4: Draw . 5: Draw , and set . 6: end for 
The resampling procedure is explained in detail in Algorithm 3. The main idea is to sample from the regions of the state space with high weight. This is done by sampling an action proportional to its weight in the previous round. We then choose a state randomly (roughly) from an ellipsoid around the current state of the selected action; the new action inherits the history of the selected action, but has a current state which is different from (but close to) the selected action. This latter step makes the new state distribution smoother than the one in the previous round, which may be supported on just a few states if only a few actions have low losses. We note that can be set so that the resampling procedure only samples actions with low dynamics loss (and Step 8 of the algorithm ensures that the remaining actions in the set do not incur any dynamics loss); thus, our algorithm does not explicitly compute a dynamics loss for each action.
5 Simulations
For our simulations, we consider the task of tracking an object in a simple, onedimensional state space. To evaluate our algorithm, we measure the distance between the estimated states, and the true states of the object.
Our experimental setup is inspired by the application of tracking faces in videos, using a standard face detector [11]. In this case, the state is the location of a face, and each measurement corresponds to a score output by the face detector for a region in the current video frame. The goal is to predict the location of the face across several video frames, using these scores produced by the detector. The detector typically returns high scores for several regions around the true location of a face, but it may also erroneously produce high scores elsewhere. And though in some cases the detection score may have a probabilistic interpretation, it is often difficult to accurately characterize the distribution of the noise.
The precise setup of our simulations is as follows. The object to be tracked remains stationary or moves with velocity at most in the interval . At time , the true state is the position ; the measurements correspond to a dimensional vector for locations in a grid , generated by an additive noise process
Here, is the square pulse function of width around the true state : if and otherwise (see Figure 2, left). The additive noise is randomly generated independently for each and each , using the mixture distribution
(see Figure 2, right). The parameter represents how noisy the measurements are relative to the signal, and the parameter
represents the fraction of outliers. In our experiments, we fix
and vary and . The total number of time steps we track for is .Blue: , Red: .  Blue: , Red: . 
In the generative framework, the dynamics of the object is represented by the transition model , and the observations are represented by the measurement process . Thus, when , the observations are generated according to the measurement process supplied to the generative framework; for , a fraction of the observations are outliers.
For the explanatory framework, the expected state dynamics function is the identity function, and the observation loss of a path at time is given by
where clips the measurements to the range . That is, the observation loss for with respect to is the negative sum of thresholded measurement values for in an interval of width around .
Given only the observation vectors , we use three different methods to estimate the true underlying state sequence . The first is the Bayesian algorithm, which recursively applies Bayes’ rule to update a posterior distribution using the transition and observation model. The posterior distribution is maintained at each location in the discretization . For the Bayesian algorithm, we set to the actual value of used to generate the observations, and we set . The value of was obtained by tuning on measurement vectors generated with the same true state sequence, but with independently generated noise values. The prior distribution over states assigns probability one to the true value of (which is in our setup) and zero elsewhere. The second algorithm is our algorithm (NH) described in Section 4. For our algorithm, we use the parameters and . These parameters were also obtained by tuning over a range of values for and . We also compare our algorithm with the particle filter (PF), which uses the same parameters as with the Bayesian algorithm, and predicts using the expected state under the (approximate) posterior distribution. For our algorithm, we use actions, and for the particle filter, we use particles. For our experiments, we use an implementation of the particle filter due to [12].
Figure 3 shows the true state and the states predicted by our algorithm (Blue) and the Bayesian algorithm (Red) for two different values of for independent simulations. Table 1 summarizes the performance of these algorithms for different values of the parameter , for two different values of the noise parameter
. We report the average and standard deviation of the RMSE (rootmeansquarederror) between the true state and the predicted state. The RMSE is computed over the
state predictions for a single simulation, and these RMSE values are averaged over independent simulations.Our experiments show that the Bayesian algorithm performs well when , that is, it is supplied with the correct noise model; however, its performance degrades rapidly as increases, and becomes very poor even at . On the other hand, the performance of our algorithm does not suffer appreciably when increases. The degradation of performance of the Bayesian algorithm is even more pronounced, when the noise is high with respect to the signal (). The particle filter suffers a even higher degradation in performance, and has poor performance even when (that is, when of the observations are generated from the correct likelihood distribution supplied to the particle filter). Our results indicate that the Bayesian algorithm is very sensitive to model mismatches. On the other hand, our algorithm, when equipped with a clippedloss function, is extremely robust to model mismatches. In particular, our algorithm provides a RMSE value of even under high noise (), when is as high as .
Some additional experiments with our algorithm are included in the supplementary appendix; they illustrate how the performance of our algorithm varies with the parameters and , and tabulates the performance of our algorithm for higher values of .
Low Noise ()  High Noise ()  



,  , 
,  , 
,  , 
6 Related work
The generative approach to tracking has roots in control and estimation theory, starting with the seminal work of Kalman [1]. The most popular generative method used in tracking is the particle filter [2], and its numerous variants. The literature here is vast, and there have been many exciting developments in recent years (e.g. [4, 13]); we refer the reader to [14] for a detailed survey of the results.
The suboptimality of the Bayesian algorithm under model mismatch has been investigated in other contexts such as classification [15, 16]. The view of the Bayesian algorithm as an online learning algorithm for logloss is wellknown in various communities, including information theory / MDL [17, 18]
and computational learning theory
[19, 20]. In our work, we look beyond the Bayesian algorithm and logloss to consider other loss functions and algorithms that are more appropriate for our task.7 Conclusions
In this paper, we introduce an explanatory framework for tracking based on online learning, which broadens the space for designing algorithms that need not conform to the standard Bayesian approach to tracking. We propose a new algorithm for tracking in this framework that deviates significantly from the Bayesian approach. Experimental results show that our algorithm significantly outperforms the Bayesian algorithm, even when the observations are generated by a distribution deviating just slightly from the model supplied to the Bayesian algorithm. Our work reveals an interesting connection between decision theoretic online learning and Bayesian filtering.
References
References
 [1] R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME—Journal of Basic Engineering, 82(D):35–45, 1960.
 [2] A. Doucet, N. de Freitas, and N. J. Gordon. Sequential Monte Carlo Methods in Practice. SpringerVerlag, 2001.

[3]
M. Isard and A. Blake.
Condensation – conditional density propagation for visual tracking.
International Journal on Computer Vision
, 28(1):5–28, 1998.  [4] R. van der Merwe, A. Doucet, N. de Freitas, and E. Wan. The unscented particle filter. In Advances in Neural Information Processing Systems, 2000.
 [5] N. CesaBianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, 2006.
 [6] Y. Freund and R. E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997.
 [7] N. Littlestone and M. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994.
 [8] A. Anonymous. Anonymous submission, 2009.
 [9] Nicolò CesaBianchi, Yoav Freund, David P. Helmbold, David Haussler, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. In STOC, pages 382–391, 1993.
 [10] Leslie Pack Kaelbling, Michael L. Littman, and Andrew P. Moore. Reinforcement learning: A survey. J. Artif. Intell. Res. (JAIR), 4:237–285, 1996.

[11]
P. Viola and M. Jones.
Rapid object detection using a boosted cascade of simple features.
In
Conference on Computer Vision and Pattern Recognition
, 2001.  [12] N. de Freitas. Matlab codes for particle filtering, 2000. www.cs.ubc.ca/~nando/software/upf_demos.tar.gz.
 [13] M. Klaas, N. de Freitas, and A. Doucet. Toward practical monte carlo: The marginal particle filter. In UAI, 2005.
 [14] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. Technical report, 2008. www.cs.ubc.ca/~arnaud/doucet_johansen_tutorialPF.pdf.

[15]
P. Domingos.
Bayesian averaging of classifiers and the overfitting problem.
In ICML, 2000.  [16] P. Grünwald and J. Langford. Suboptimal behavior of bayes and mdl in classification under misspecification. Machine Learning, 66(2–3):119–149, 2007.
 [17] N. Merhav and M. Feder. Universal prediction. IEEE Transactions on Information Theory, 39(4):1280–1292, 1993.
 [18] Peter D. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.
 [19] Yoav Freund. Predicting a binary sequence almost as well as the optimal biased coin. In COLT, pages 89–98, 1996.
 [20] Sham M. Kakade and Andrew Y. Ng. Online bounds for bayesian algorithms. In NIPS, 2004.
 [21] M. Herbster and M. Warmuth. Tracking the best expert. Machine Learning, 32(2):151–178, 1998.
 [22] W. Koolen and S. de Rooij. Combining expert advice efficiently. In COLT, 2008.