1 Introduction
Visual tracking is a fundamental task in many computer vision applications including event analysis, visual surveillance, human behaviour analysis, and video retrieval
[18]. It is a challenging problem, mainly because the appearance of tracked objects changes over time. Designing an appearance model that is robust against intrinsic object variations (shape deformation and pose changes) and extrinsic variations (camera motion, occlusion, illumination changes) has attracted a large body of work [4, 24].Rather than relying on object models based on a single training image, more robust models can be obtained through the use of several images, as evidenced by the recent surge of interest in object recognition techniques based on imageset matching. Among the many approaches to imageset matching, superior discrimination accuracy, as well as increased robustness to practical issues (such as pose and illumination variations), can be achieved by modelling imagesets as linear subspaces [10, 11, 12, 20, 21, 22].
In spite of the above observations, we believe modelling via linear spaces is not completely adequate for object tracking. We note that all linear subspaces of one specific order have a common origin. As such, linear subspaces are theoretically robust against translation, meaning a linear subspace extracted from a set of points does not change if the points are shifted equally. While the resulting robustness against small shifts is attractive for object recognition purposes, the task of tracking is to generally maintain precise locations of objects.
To account for the above problem, in this paper we first propose to model objects, as well as candidate areas that the objects may occupy, through the use of generalised linear subspaces, affine subspaces, where the origin of subspaces can be varied. As a result, the tracking problem can be seen as finding the most similar affine subspace in a given frame to the object’s affine subspace. We furthermore propose a novel approach to measure distances between affine subspaces, via the use of nonEuclidean geometry of Grassmann manifolds, in combination with Mahalanobis distance between the origins of the subspaces. See Fig. 1 for a conceptual illustration of our proposed distance measure.
To the best of our knowledge, this is the first time that appearance is modelled by affine subspaces for object tracking. The proposed approach is somewhat related to adaptive subspace tracking [13, 19]. Ho [13] represent an object as a point in a linear subspace, which is constantly updated. As the subspace was computed using only recent tracking results, the tracker may drift if large appearance changes occur. In addition, the location of the tracked object is inferred via measuring pointtosubspace distance, which is in contrast to the proposed method, where a more robust subspacetosubspace distance is used.
Ross [19] improved tracking robustness against large appearance changes by modelling objects in a lowdimensional subspace, updated incrementally using all preceding frames. Their method also involves a pointtosubspace distance measurement to localise the object.
The proposed method should not be confused with subspace learning on Grassmann manifolds proposed by Wang [25]. More specifically, in [25] an online subspace learning scheme using Grassmann manifold geometry is devised to learn/update the subspace of object appearances. In contrast to the proposed method, they also consider the pointtosubspace distance to localise objects.
2 Proposed Affine Subspace Tracker (AST)
The proposed Affine Subspace Tracker (AST) is comprised of four components, overviewed below. A block diagram of the proposed tracker is shown in Fig. 2.

Motion Estimation
. This component takes into account the history of object motion in previous frames and creates a set of candidates as to where the object might be found in the new frame. To this end, it parameterises the motion of the object between consecutive frames as a distribution via particle filter framework [2]. Particle filters are sequential Monte Carlo methods and use a set of points to represent the distribution. As a result, instead of scanning the whole of the new frame to find the object, only highly probable locations will be examined.

Candidate Subspaces. This module encodes the appearance of a candidate (associated to a particle filter) by an affine subspace . This is achieved by taking into account the history of tracked images and learning the origin and basis of for each particle.

Decision Making. This module measures the likelihood of each candidate subspace to the stored object models in the bag . Since object models are encoded by affine subspaces as well, this module determines the similarity between affine subspaces. The most similar candidate subspace to the bag is selected as the result of tracking.

Bag of Models. This module keeps a history of previously seen objects in a bag. This is primarily driven by the fact that a more robust and flexible tracker can be attained if a history of variations in the object appearance is kept [15]. To understand the benefit of the bag of models, assume a person tracking is desired where the appearance of whole body is encoded as an object model. Moreover, assume at some point of time only the upper body of person is visible (due to partial occlusion) and the tracker has successfully learned the new appearance. If the tracking system is only aware of the very last seen appearance (upperbody in our example), upon termination of occlusion, the tracker is likely to lose the object. Keeping a set of models (in our example both upperbody and whole body) can help the tracking system to cope with drastic changes.
Each of the components is elucidated in the following subsections.
2.1 Motion Estimation
In the proposed framework, we are aiming to obtain the location , and the scale of an object in frame based on prior knowledge about previous frames. A blind search in the space of is obviously inefficient, since not all possible combinations of , and are plausible. To efficiently search the space, we use a sequential Monte Carlo method known as the Condensation algorithm [14] to determine which combinations in the space are most probable at time . The key idea is to represent the space by a density function and estimate it through a set of random samples (also known as particles). As the number of particles becomes large, the condensation method approaches the optimal Bayesian estimate of density function (combinations in the space). Below, we briefly describe how the condensation algorithm is used within the proposed tracking approach.
Let denote a particle at time . By the virtue of the principle of importance sampling [2], the density of space (or most probable candidates) at time is estimated as a set of particles using previous particles and their associated weights with . For now we assume the associated weights of particles are known and later discuss how they can be determined.
In the condensation algorithm, to generate , is first sampled (with replacement) times. The probability of choosing a given element is equal to the associated weight
. Therefore, the particles with high weights might be selected several times, leading to identical copies of elements in the new set. Others with relatively low weights may not be chosen at all. Next, each chosen element undergoes an independent Brownian motion step. Here, the Brownian motion of a particle is modelled by a Gaussian distribution with a diagonal covariance matrix. As a result, for a chosen particle
from the first step of condensation algorithm, a new particle is obtained as a random sample of where denotes a Gaussian distribution with mean and covariance . The covariance governs the speed of motion, and is a constant parameter over time in our framework.2.2 Candidate Templates
To accommodate variations in object appearance, this module models the appearance of particles^{1}^{1}1We loosely use “particle appearance” to mean the appearance of a candidate template described by a particle. by affine subspaces (see Fig. 3 for a conceptual example). An affine subspace is a subset of Euclidean space [23], formally described by a 2tuple {} as:
(1) 
where and are origin and basis of the subspace, respectively. Let
denote the vector representation of an
patch extracted from frame by considering the values of particle . That is, frame is first scaled appropriately based on the value and then a patch of pixels with the top left corner located at is extracted.The appearance model for is generated from a set of images by considering previous results of tracking. More specifically, let denote the result of tracking at time , is the most similar particle to the bag of models at time . Then set is used to obtain the appearance model for particle . More specifically, the origin of affine subspace associated to is the mean of
. The basis is obtained by computing the Singular Value Decomposition (SVD) of
and choosing the dominant leftsingular vectors.2.3 Bag of Models
Although affine subspaces accommodate object changes along with a set of images, to produce a robust tracker, the object’s model should be able to reflect the appearance changes during the tracking process. Accordingly, we propose to keep a set of object models for coping with deformations, pose variations, occlusions, and other variations of the object during tracking.
Fig. 4 shows two frames with a tracked object, the bag models used to localise the object, and the recent images of the image set used to generate each bag model.
A bag is defined as a set of object models, each is an affine subspace learned during the tracking process. The bag is updated every frames (see Fig. 5) by replacing the oldest model with the latest learned model ( latest result of tracking specified by ). The size of bag determines the memory of the tracking system. Thus, a large bag with several models might be required to track an object in a challenging scenario. In our experiments, a bag of size with the updating rate is used in all experiments.
Having a set of models at our disposal, we will next address how the similarity between a particle’s appearance and the bag can be determined.
2.4 Decision Making
Given the previously learned affine subspaces as the input to this module, the aim is to find the nearest affine subspace to the bag templates. Although the minimal Euclidean distance is the simplest distance measure between two affine subspaces (the minimum distance of any pair of points of the two subspaces), this measure does not form a metric [5] and it does not consider the angular distance between affine subspaces, which can be a useful discriminator [16]. However, the angular distance ignores the origin of affine subspaces and simplifies the problem to a linear subspace case, which we wish to avoid.
To address the above limitations, we propose a distance measure with the following form:
(2) 
where is the Geodesic distance between two points on a Grassmann manifold [7], is the Mahalanobis distance between origins of and , and is a mixing weight. The components in the proposed distance are described below.
A Grassmann manifold (a special type of Riemannian manifold) is defined as the space of all dimensional linear subspaces of for . A point on Grassmann manifold is represented by an orthonormal basis through a matrix. The length of the shortest smooth curve connecting two points on a manifold is known as the geodesic distance. For Grassmann manifolds, the geodesic distance is given by:
(3) 
where is the principal angle vector,
(4) 
subject to , , . The principal angles have the property of and can be computed through the SVD of [7].
We note that the linear combination of a Grassmann distance (distance between linear subspaces) and Mahalanobis distance (between origins) of two affine subspaces has roots in probabilistic subspace distances [9]
. More specifically, consider two normal distributions
and with as the covariance matrix, and as the mean vector. The symmetric KullbackLeibler (KL) distance between and under orthonormality condition () results in:=  (5)  
The term in is identified as the projection distance on Grassmann manifold (defined as ) [9], and the term is the Mahalanobis distance with .
Since the geodesic distance is a more natural choice for measuring lengths on Grassmann manifolds (compared to the projection distance), we have elected to combine it with the Mahalanobis distance from (5), resulting in the following instantiation of the general form given in Eqn. (2):
=  
We measure the likelihood of a candidate subspace , given template , as follows:
(6) 
where
indicates the standard deviation of the likelihood function and is a parameter in the tracking framework. The likelihoods are normalised such that
. To measure the likelihood between a candidate affine subspace and bag , the individual likelihoods between and bag templates should be integrated. Based on [17], we opt for the sum rule:(7) 
The object state is then estimated as:
(8) 
2.5 Computational Complexity
The computational complexity of the proposed tracking framework can be associated with generating a new model and comparing a target candidate with a model. The model generation step requires operations. Computing the geodesic distance between two points on requires operations. Therefore, comparing an affine subspace candidate against each bag template needs operations.
3 Experiments
In this section we evaluate and analyse the performance of the proposed AST method using eight publicly available videos^{6}^{6}6The videos and the corresponding ground truth are available at http://vision.ucsd.edu/~bbabenko/project_miltrack.shtml consisting of two main tracking tasks: face and object tracking. The sequences are: Occluded Face [1], Occluded Face 2 [4], Girl [6], Tiger 1 [4], Tiger 2 [4], Coke Can [4], Surfer [4], and Coupon Book [4]. Example frames from several videos are shown in Fig. 6.
Each video is composed of 8bit grayscale images, resized to pixels. We used the raw pixel values as image features. For the sake of computational efficiency in the affine subspace representation, we resized each candidate image region to
, and the number of eigenvectors (
) used in all experiments is set to three. Furthermore, we only consider 2D translation and scaling in the motion modelling component. The batch size () for the template update is set to five as a tradeoff between computational efficiency and effectiveness of modelling appearance change during fast motion.We evaluated the proposed tracker based on (i) average center location error, and (ii) precision [4]. Precision shows the percentage of frames for which the estimated object location is within a threshold distance of the ground truth. Following [4], we use a fixed threshold of 20 pixels.
To contrast the effect of affine subspace modelling against linear subspaces, we assessed the performance of the AST tracker against a tracker that only exploits linear subspaces, , an AST where for all models. The results, in terms of center location errors, are shown in Table 1. The proposed AST method significantly outperforms the linear subspaces approach, thereby confirming our idea of affine subspace modelling.
Video  proposed AST  linear subspace 

Surfer  
Coke Can  
Girl  
Tiger 1  
Tiger 2  
Coupon Book  
Occluded Face  
Occluded Face 2  
average error 
3.1 Quantitative Comparison
To assess and contrast the performance of AST tracker against stateoftheart methods, we consider six methods, here. The competitors are: fragmentbased tracker (FragTrack) [1], multiple instance boostingbased tracker (MILTrack) [4, 3], online Adaboost (OAB) [8], trackinglearningdetection (TLD) [15], incremental visual tracking (IVT) [19], and Sparsitybased Collaborative Model tracker (SCM) [26]. We use the publicly available source codes for FragTrack^{1}^{1}1http://www.cs.technion.ac.il/~amita/fragtrack/fragtrack.htm, MILTrack^{2}^{2}2http://vision.ucsd.edu/ bbabenko/project_miltrack.shtml, OAB^{2}^{2}footnotemark: 2, TLD^{3}^{3}3http://info.ee.surrey.ac.uk/Personal/Z.Kalal/, IVT^{4}^{4}4http://www.cs.toronto.edu/~dross/ivt/ and SCM^{5}^{5}5http://ice.dlut.edu.cn/lu/Project/cvpr12_scm/cvpr12_scm.htm.
Tables 3 and 3 show the performance in terms of precision and location error, respectively, for the proposed AST method as well as the competing trackers. Fig. 6 shows resulting bounding boxes for several frames from the Surfer, Coupon Book, Occluded Face 2 and Girl sequences. On average, the proposed AST method obtains notably better performance than the competing trackers, with TLD being the second best tracker.
Video  AST  TLD  MILTrack  SCM  OAB  IVT  FragTrack 

(proposed)  [15]  [4]  [26]  [8]  [19]  [1]  
Surfer  
Coke Can  
Girl  
Tiger 1  
Tiger 2  
Coupon Book  
Occluded Face  
Occluded Face 2  
average error 
Video  AST  TLD  MILTrack  SCM  OAB  IVT  FragTrack 

(proposed)  [15]  [4]  [26]  [8]  [19]  [1]  
Surfer  
Coke Can  
Girl  
Tiger 1  
Tiger 2  
Coupon Book  
Occluded Face  
Occluded Face 2  
average precision 
3.2 Qualitative Comparison
Heavy occlusions. Occlusion is one of the major issues in object tracking. Trackers such as SCM, FragTrack and IVT are designed to resolve this problem. Other trackers, including TLD, MIL and OAB, are less successful in handling occlusions, especially at frames 271, 529 and 741 of the Occluded Face sequence, and frames 176, 432 and 607 of Occluded Face 2. SCM can obtain good performance mainly as it is capable of handling partial occlusions via a patchbased model. The proposed AST approach can tolerate occlusions to some extent, thanks to the properties of the appearance model. One prime example is Occluded Face 2, where AST accurately localised the severely occluded object at frame 730.
Pose Variations. On the Tiger 2 sequence, most trackers, including SCM, IVT and FragTrack, fail to track the object from the early frames onwards. On Tiger 2, the proposed AST approach can accurately follow the object at frames 207 and 271 when all the other trackers have failed. In addition, compared to the other trackers, the proposed approach partly handles motion blurring (frame 344), where the blurring is a sideeffect of rapid pose variations. On Tiger 1, although TLD obtains the best performance, AST can successfully locate (in contrast to the other trackers) the object at frames 204 and 249, which are subject to occlusion and severe illumination changes.
Rotations. The Girl and Surfer sequences include drastic outofplane and inplane rotations. On Surfer, FragTrack and SCM fail to track from the start. The proposed AST approach consistently tracked the surfer and outperforms the other trackers. On Girl, the IVT, OAB, and FragTrack methods fail to track in many frames. While IVT is able to track in the beginning, it fails after frame 230. The AST approach manages to track the correct person throughout the whole sequence, especially towards the end where the other trackers fail due to heavy occlusion.
Illumination changes. The Coke Can sequence consists of dramatic illumination changes. FragTrack fails from frame 20 where the first signs of illumination changes appear. IVT and OAB fail from frame 40 where the frames include both severe illumination changes and slight motion blur. MILTrack fails after frame 179 where a part of the object is almost faded by the light. Since affine subspaces accommodate robustness to the illumination changes, the proposed AST approach can accurately locate the object throughout the whole sequence.
Imposters/Distractors. The Coupon Book sequence contains a severe appearance change, as well as an imposter book to distract the tracker. FragTrack and TLD fail mainly where the imposter book appears. AST successfully tracks the correct book with notably better accuracy than the other methods.
4 Main Findings and Future Directions
In this paper we investigated the problem of object tracking in a video stream where object appearance can drastically change due to factors such as occlusions and/or variations in illumination and pose. The selection of subspaces for target representation purposes, in addition to a regular subspace update, are mainly driven by the need for an adaptive object template reflecting appearance changes. We argued that modelling the appearance by affine subspaces and applying this notion on both the object templates and the query data leads to more robustness. Furthermore, we maintain a record of previously observed templates for a more robust tracker.
We also presented a novel subspacetosubspace measurement approach by reformulating the problem over Grassmann manifolds, which provides the target representation with more robustness against intrinsic and extrinsic variations. Finally, the tracking problem was considered as an inference task in a Markov Chain Monte Carlo framework using particle filters to propagate sample distributions over time.
Comparative evaluation on challenging video sequences against several stateoftheart trackers show that the proposed AST approach obtains superior accuracy, effectiveness and consistency, with respect to illumination changes, partial occlusions, and various appearance changes. Unlike the other methods, AST involves no training phase.
There are several challenges, such as drifts and motion blurring, that need to be addressed. A solution to drifts could be to formulate the update process in a semisupervised fashion in addition to including a training stage for the detector. Future research directions also include an enhancement to the updating scheme by measuring the effectiveness of a new learned model before adding it to the bag of models. To resolve the motion blurring issues, we can enhance the framework by introducing blurdriven models and particle filter distributions. Furthermore, an interesting extension would be multiobject tracking and how to join multiple object models.
Acknowledgements
NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence program.
References

[1]
A. Adam, E. Rivlin, and I. Shimshoni.
Robust fragmentsbased tracking using the integral histogram.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, volume 1, pages 798–805, 2006.  [2] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/nongaussian bayesian tracking. IEEE Trans. Signal Processing, 50(2):174–188, 2002.
 [3] B. Babenko, M. Yang, and S. Belongie. Visual tracking with online multiple instance learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 983–990, 2009.
 [4] B. Babenko, M. Yang, and S. Belongie. Robust object tracking with online multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1619–1632, 2011.
 [5] R. Basri, T. Hassner, and L. ZelnikManor. Approximate nearest subspace search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2):266–278, 2011.
 [6] S. Birchfield. Elliptical head tracking using intensity gradients and color histograms. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 232–237, 1998.
 [7] A. Edelman, T. Arias, and S. Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
 [8] H. Grabner, M. Grabner, and H. Bischof. Realtime tracking via online boosting. In British Machine Vision Conference, volume 1, pages 47–56, 2006.
 [9] J. Hamm and D. Lee. Extended Grassmann kernels for subspacebased learning. In Advances in Neural Information Processing Systems (NIPS), pages 601–608, 2009.
 [10] M. Harandi, C. Sanderson, C. Shen, and B. C. Lovell. Dictionary learning and sparse coding on Grassmann manifolds: An extrinsic solution. In Int. Conference on Computer Vision (ICCV), 2013.
 [11] M. Harandi, C. Sanderson, S. Shirazi, and B. C. Lovell. Graph embedding discriminant analysis on Grassmannian manifolds for improved image set matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2705–2712, 2011.
 [12] M. Harandi, C. Sanderson, S. Shirazi, and B. C. Lovell. Kernel analysis on Grassmann manifolds for action recognition. Pattern Recognition Letters, 34(15):1906–1915, 2013.
 [13] J. Ho, K. Lee, M. Yang, and D. Kriegman. Visual tracking using learned linear subspaces. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 782–789, 2004.
 [14] M. Isard and A. Blake. Contour tracking by stochastic propagation of conditional density. European Conference on Computer Vision (ECCV), pages 343–356, 1996.
 [15] Z. Kalal, K. Mikolajczyk, and J. Matas. Trackinglearningdetection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.
 [16] T. Kim, J. Kittler, and R. Cipolla. Discriminative learning and recognition of image set classes using canonical correlations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1005–1018, 2007.

[17]
J. Kittler, M. Hatef, R. Duin, and J. Matas.
On combining classifiers.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239, 1998.  [18] X. Li, A. Dick, C. Shen, A. van den Hengel, and H. Wang. Incremental learning of 3DDCT compact representations for robust visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4):863–881, 2013.
 [19] D. Ross, J. Lim, R. Lin, and M. Yang. Incremental learning for robust visual tracking. Int. Journal of Computer Vision (IJCV), 77(1):125–141, 2008.
 [20] C. Sanderson, M. Harandi, Y. Wong, and B. C. Lovell. Combined learning of salient local descriptors and distance metrics for image set face verification. In IEEE International Conference on Advanced Video and SignalBased Surveillance (AVSS), pages 294–299, 2012.
 [21] S. Shirazi, M. Harandi, C. Sanderson, A. Alavi, and B. C. Lovell. Clustering on Grassmann manifolds via kernel embedding with application to action analysis. In Int. Conference on Image Processing (ICIP), pages 781–784, 2012.
 [22] P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa. Statistical computations on Grassmann and Stiefel manifolds for image and videobased recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11):2273–2286, 2011.

[23]
U. Von Luxburg.
A tutorial on spectral clustering.
Statistics and Computing, 17(4):395–416, 2007.  [24] S. Wang, H. Lu, F. Yang, and M.H. Yang. Superpixel tracking. In Int. Conference on Computer Vision (ICCV), pages 1323–1330, 2011.
 [25] T. Wang, A. Backhouse, and I. Gu. Online subspace learning on Grassmann manifold for moving object tracking in video. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 969–972, 2008.
 [26] W. Zhong, H. Lu, and M.H. Yang. Robust object tracking via sparsitybased collaborative model. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1838–1845, 2012.