Visual Tracking via Reliable Memories

02/04/2016 ∙ by Shu Wang, et al. ∙ UNC Charlotte Rutgers University Columbia University 0

In this paper, we propose a novel visual tracking framework that intelligently discovers reliable patterns from a wide range of video to resist drift error for long-term tracking tasks. First, we design a Discrete Fourier Transform (DFT) based tracker which is able to exploit a large number of tracked samples while still ensures real-time performance. Second, we propose a clustering method with temporal constraints to explore and memorize consistent patterns from previous frames, named as reliable memories. By virtue of this method, our tracker can utilize uncontaminated information to alleviate drifting issues. Experimental results show that our tracker performs favorably against other state of-the-art methods on benchmark datasets. Furthermore, it is significantly competent in handling drifts and able to robustly track challenging long videos over 4000 frames, while most of others lose track at early frames.



There are no comments yet.


page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual tracking is one of the fundamental and challenging problems in computer vision and artificial intelligence. Though much progress has been achieved in recent years 

[Yilmaz et al.2006, Wu et al.2013]

, there are still unsolved issues due to its complexity on various factors, such as illumination and angle changes, clutter background, shape deformation and occlusion. Extensive studies on visual tracking employ a tracking-by-detection framework and achieve promising results by extending existing machine learning methods (usually discriminative) with online learning manner 

[Avidan2004, Avidan2007, Grabner et al.2008, Saffari et al.2009]. To adaptively model various appearance changes, they deal with a large amount of samples111 Here “samples” refers to positive (and negative) target patches for trackers based on generative (or discriminative) models. at both detection and updating stages. However, all of them face the same dilemma: While more samples grant better accuracy and adaptiveness, they also come with higher computational cost and risk of drifting. In addition to discriminative methods, [Ross et al.2008, Mei and Ling2011, Wang and Lu2014] utilize generative models with a fixed learning-rate to account for target appearance changes. The learning-rate is essentially a trade-off between adaptiveness and stability. However, even with very small rate, former samples’ influence on their models still drops exponentially through frames, and drift error may still accumulate. In order to alleviate drift error, [Babenko et al.2011, Hare et al.2011, Zhang et al.2014b] are designed to exploit hidden structured information around the target region. Other methods [Collins and Liu2003, Avidan2007, Kwon and Lee2010] try to avoid drifting by making the current model a combination of the labeled samples in the first frame and the learned samples from the tracking process. However, limited number of samples (e.g., the first frame) can be regarded as “very confident”, which in turn restrict their robustness in long-term challenging tasks. Recently, several methods [Bolme et al.2010, Danelljan et al.2014b, Henriques et al.2015] employ Discrete Fourier Transform (DFT) to perform extremely fast detection and achieve high accuracy with the least computational cost. However, same as other generative methods, the memory length of their models is limited by a fixed forgetting rate, and therefore they still suffer from accumulated drift error in long-term tasks.

A very important observation is that, when the tracked target moves smoothly, e.g., without severe occlusion or out-of-plane rotations, its appearances across frames share high similarity in the feature space (e.g., edge features). Contrarily, when it undergoes drastic movements such as in/out-of-plane rotations or occlusions, its appearances may not be that similar to previous ones. Therefore, if we impose a temporal constraint on clustering these samples, such that only temporally adjacent ones can be grouped together, the big clusters with large intra-cluster correlation can indicate the periods when the target experiences small appearance changes. We take human memory as an analogy for these clusters, using reliable memories to represent large clusters that have been consistently perceived for a long time. In this context, earlier memories supported by more samples have higher probability to be reliable than more recent ones with less support, especially when drift error accumulates across frames. Thus, a tracker may recover from drift error with preference to choose candidates that share high correlation to earlier memories.

Based on these motivations, we propose a novel tracking framework, which efficiently explores self-correlated appearance clusters across frames, and then preserves reliable memories for long-term robust visual tracking. First, we design a DFT-based visual tracker, which is capable of retrieving good memories from a vast number of tracked samples for accurate detection, while still ensures a fast speed for real-time performance. Second, we propose a novel clustering method with temporal constraints to discover distinct and reliable memories from previous frames to help our tracker resist drift error. This method harvests the inherent correlation of the streaming data, and is guaranteed to converge at a fast speed222Its computational complexity is , which costs less than 30 ms for frames. by carefully designing upon Integral Image. To the best of our knowledge, our temporally constrained clustering method is novel to vision streaming data analysis, and its high converging speed and promising performance show great potential in online streaming problems. Particularly, it is very competent in discovering clusters (i.e., reliable memories) consisted of uncontaminated sequential samples that are tracked before, and grants our tracker remarkable ability to resist drift error. Experimental results show that our tracker is considerably competent in handling drift error and performs favorably against other state-of-the-art methods on benchmark datasets. Further, it can robustly track challenging long videos with over frames, while most of the others lose track at early frames.

2 Circulant Structure based Visual Tracking

Recent works [Bolme et al.2010, Henriques et al.2012, Danelljan et al.2014b, Henriques et al.2015] achieve the state-of-the-art tracking accuracy with the least computational cost by exploiting the inherent relationship between DFT and the circulant structure of dense sampling on the target region. In this section, we briefly introduce these methods that are highly related to our work.


is a vector of an image patch with size

, centered at the target (), and denotes a 2D circular shift from by ( is an index for all possible shifts, . is a vector of a designed response map of size with a Gaussian pulse centered at the target, too. is a positive definite kernel function defined by mapping

. We aim to find a linear classifier

that minimizes the Regularized Least Square (RLS) cost function:


The first term is an empirical risk to minimize the difference between the designed gaussian response and the mapping , where . The second term is a regularization term. It is denoted by since it lies in the Kernel Hilbert Space reproduced by .

By Representer Theorem [Schölkopf et al.2001], cost can be minimized by a linear combination of inputs: . By defining kernel matrix , a much simpler form for Eq. 1 can be derived as:


This function is convex and differentiable, and has a closed form minimizer . As proved in [Henriques et al.2012], if the kernel is unitarily invariant, its kernel matrix is a circulant matrix, that , where vector , . is a permutation matrix that shifts vectors by -th element(s), is a circulant matrix from by concatenating all possible cyclic shifts of . and can be obtained without inverting by:


where and are DFT and its inverse, and is an by vector with all entries to be . Division in Eq. 3 is in Fourier domain, and is thus performed element-wise. In practice, there is no need to compute from , since fast detection can be performed on given image patch by , where with . is the learned target appearance. Pulse peak in shows the target translation in input image . Detailed derivation is in [Gray2005, Rifkin et al.2003, Henriques et al.2012].

Though recent methods MOSSE [Bolme et al.2010], CSK [Henriques et al.2012] and ACT [Danelljan et al.2014b], have different configurations of kernel functions and features (e.g., dot product kernel leads to MOSSE, and RBF kernel leads to the latter two), all of them employ a simple linear combination to learn target appearance model at current frame by


While CSK updates its classifier coefficients by Eq. 4 directly, MOSSE and ACT update the numerator and denominator of coefficients separately for stability purpose. The learning-rate is a trade-off parameter between long memory and model adaptiveness. After expanding Eq. 4 we obtain:


This shows that, all three methods have an exponentially decreasing pattern of memory: Though the learning-rate is usually small, e.g., , the impact of a sample at a certain frame is negligible after frames (). In other words, these learning-rate based trackers are unable to recourse to samples accurately tracked long before to help resist accumulated drift error.

Distance Matrix and Clustering Result             Six temporally constrained clusters with distinct appearances

Figure 2: Left: the distance matrix as described in Alg. 1. Right: Six representative clusters with corresponding colored bounding boxes are shown for intuitive understanding. The image patches in the big bounding boxes is an average appearance of a certain cluster (memory), while the small patches are samples chosen evenly on the temporal domain from each cluster.

3 Proposed Method

Aside from the convolution based visual trackers mentioned above, many other trackers [Jepson et al.2003, Nummiaro et al.2003, Ross et al.2008, Babenko et al.2011] also update their models in similar form as with a learning-rate parameter and suffers from the drifting problem.

We observe that smooth movements usually offer consistent appearance cues, which can be modeled as reliable memories to recover the tracker from drifting issues caused by drastic appearance change (illustrated in Fig. 1). In this section, we introduce our method that explores, preserves and makes use of the reliable memories for long-time robust visual tracking. First, we introduce our novel framework that is capable of handling vast number of samples while still ensures fast detection speed. Then, we elaborate the details of intelligently arranging past samples into distinct and reliable clusters that grant our tracker resistance to drift error.

3.1 The Circulant Tracker over Vast Samples

Given new positive sample at frame , we aim to build an adaptive model for fast detection in the coming frame with sample image by



is the response map which shows the estimated translation of the target position, vector

, with its -th entry . As we advocated, this model should be built upon vast samples for robustness and adaptiveness. Thus, should have the form:


As shown, the adaptive learned appearance is a combination of past samples with concentration on of a certain proportion . Coefficients represent the correlation between the current estimated appearance and the past appearances . A proper choice of should make the model: 1) adaptive to new appearance changes, and 2) consistent with past appearances to avoid risk of drifting. In this paper, we argue that the set with preference to previous reliable memories can provide our tracker with considerable robustness to resist drift error. We discuss how to find these reliable memories in Sec. 3.2, and their connections with are introduced in Sec. 3.3.

Now, we focus on finding a set of classifier coefficients that fit both the learned appearance for consistency and the current appearance for adaptiveness. Based on Eq. 1 and Eq. 2, we derive the following cost function to minimize:


where the kernel matrix , and vector entry (similar for and ). is a balance factor between the memory consistency and model adaptiveness. By setting the derivative , the accurate solution satisfies a complicated condition as follows:


We observe that the adaptively learned appearance should be very close to the current one , since it is a linear combination of close appearances in the past and the current appearance , as shown in Eq. 7. Notice both kernel matrix and (and their linear combination with ) is positive semidefinite. By relaxing Eq. 9 with , we obtain an approximate minimizer in a very simple form:


is an -dimensional vector in the form , with property that and ( is an L-dimension vector of ones). Note that in the bracket of , division is performed element-wise.

As long as we find a proper set of coefficients , we can build up our detection model by Eq. 7 and Eq. 10. In the next frame , fast detection can be performed by Eq. 6 with this learned model.

  Input: Integral image of Distance Matrix , s.t. ;; is a shifting matrix and ;Stoping factor , and .
  Output: .
  while  do
     for  do
        Evaluate using .
        if  then
           , remove from ;
        end if
     end for
  end while
Algorithm 1 Temporal Constrained Clustering Algorithm

3.2 Clustering with Temporal Constraints

In this subsection, we introduce our temporally constrained clustering, which learns distinct and reliable memories from the incoming samples in a very fast manner. Together with the ranked memories (Sec. 3.3), our tracker is robust to inaccurate tracking result, and is able to recover from drift error.

Suppose a set of positive samples are given at frame : , and we would like to divide them into subsets with indexing vector set , such that . Our objective are as follows: 1) Samples in each subset are highly correlated; 2) Samples from different subsets have relatively large appearance difference, so a linear combination of them is vague or even ambiguous to describe the tracked target (e.g., samples from different viewpoints of the target). Thus, it can be modeled as a very general clustering problem:


Function measures the average sample distance in feature space within subset , in the form: . Regularizer is a function based on the number of subsets , and is a balance factor. This is a discrete optimization problem and known as NP-hard. By fixing the number of subsets to a certain constant , -means clustering can converge to a local optimal.

However, during the process of visual tracking, we do not know the sufficient number of clusters. While too many clusters cause problem of over-fitting, too few clusters may lead to ambiguity. More importantly, as long as we allow random combinations of samples during clustering, any cluster has a risk of taking in contaminated samples with drift error, even wrongly labeled samples, which in turn will degrade the performance of models built upon them.

One important observation is that, target appearances closed to each other in the temporal domain may form a very distinguished and consistent pattern, i.e., reliable memories. E.g,, if a well-tracked target moves around without big rotation or large change in angle for a period of time, its edge-based feature would have much higher similarity compared with feature under different angles. In order to discover these memories, we add temporal constraints on Eq. 11:


Then Eq. 11 with Eq. 12 becomes segmenting into subsets , that each subset only contains timely continuous samples ( are certain frame numbers).

Still, the constraint of this new problem is discrete and the global optimal can hardly be reached. We carefully designed a greedy algorithm, as shown in Alg. 1, which starts from a trivial status of subsets. It tries to reduce the regularizer in the object function of Eq. 11 by combining temporally adjacent subsets and , while penalizing the increase of the average sample distance .

With an intelligent use of Integral Image [Viola and Jones2001], the evaluation operation in each combining step in Alg. 1 only takes running time with integral image , and each iteration takes linear operations. The whole algorithm processes in a bottom-up binary tree structure, and runs at in the worst case, and runs less than ms on a desktop for over samples. Designed experiments will show that the proposed algorithm is very competent in finding distinguished appearance clusters (reliable memories) for our tracker to learn.

3.3 The Workflow of Our Tracking Framework

Two feature pools are employed in our framework, one for coming positive samples across frames, and the second ( denoted by ) for the learned memories. Every memory contains a certain number of samples and a confidence :


where is the number of samples in memory and is the beginning frame number of memory . This memory confidence is consistent with our hypothesis that earlier memories with more samples are more stable and less likely to be affected by accumulated drift error. For each frame, we first detect the object using Eq. 6 to estimate the translation of the target, and then utilize this new sample to update our appearance model by Eq. 7 and Eq. 10.

The correlation coefficient is then calculated by:


where scalar is a normalization term that assures , and is the most similar memory to the current learned appearance in feature space .

To update memories, we use Alg. 1 to cluster positive samples in the first feature pool into ‘memories’, and import all except the last one into . Note when reaches its threshold, memories with the lowest confidence would be abandoned immediately.

4 Experiments

Our framework is implemented in Matlab with running speed ranges from fps to fps, on a desktop with an Intel Xeon(R) 3.5GHz CPU, a Tesla K40c video card and 32GB RAM. The adaptiveness ratio is empirically set as through all experiments. Stoping factor is decided adaptively as times the average covariance of the samples at the first frames on each video. HOG [Dalal and Triggs2005] is chosen as the feature . The maximum number of memories is set as and is set to .

Figure 3: Comparison on the video. Blue, magenta, red and white represent results from MOSSE, ACT, Struck and VTD. Our results are represented as bold colored boxes with a dot on the top-right corner, and each color means the active memory at that frame, shown in the bottom row. Learned appearances from MOSSE and ACT at each frame are also shown in the top row. As illustrated, the target experiences out-of-plane rotation in frame and brings about drift error (memory ). When he turns head back to the front (frame , ), our method uses reliable memories and respectively to recover from drifting. Note that these two memories were built before drift error accumulates at out-of-plane rotation period.

4.1 Evaluation of Temporally Constrained Clustering

In order to validate our assumption that temporally constrained clustering on sequentially tracked samples forms reliable and recognizable patterns, we perform Alg. 1 on the off-line positive samples based on our tracking results. Note that our algorithm gives exactly the same result in the online/offline manner, since previously clustered samples have no effect on clustering the unfixed sample afterwards. Due to space limitation, here we present illustrative results from sequences and , in Fig. 2 and Fig. 3.

Fig. 2 shows our detailed results on sequence , in which the target experiences illumination variation, in-plane and out-of-plane rotation through a long term of frames. The left part shows the distance matrix as described in Alg. 1, that . Pixel with dark blue (light yellow) implies small (large) distance between sample and in feature space . Different colored diagonal bounding boxes represent different temporally constrained clusters. The right part shows six representative clusters, corresponding to colored bounding boxes on the matrix. Memory and memory are two large clusters containing large amount of samples with high correlated appearance (blue color). Memory represents a cluster with only samples. Its late emergence and limited number of samples result in a very low confidence and thus it is not likely to replace any existing reliable memories.

Fig. 3 shows mid-term results on the sequence , compared with two very related trackers MOSSE and ACT, with their learned appearances at these frames. Several other state-of-the-art methods are also shown for comparison. Among them, MOSSE and ACT only keep one memory, which is generated from former appearances and gradually adapts to the current one. Though they are very robust to illumination changes, drift error still accumulates across frames. They can hardly recover from drifts due to the conciseness of their model. Our method, with learned reliable memories, is very robust to appearance changes, and can recover from drifts when it observes appearances familiar to its memories.

4.2 Boosting by Deep CNN

Our tracker’s inherent requirement to efficiently search familiar patterns (memories) at a global scale of the frame overlaps with object detection task [Girshick2015, He et al.2015]

. Recently, with the fast development of convolutional neural networks (CNN) 

[Krizhevsky et al.2012, Zeiler and Fergus2014], Faster-RCNN [Ren et al.2015] achieves fps detection speed by using shared convolutional layers for both the object proposal and detection. To equip our tracker with a global vision for its reliable memories, we fine-tune the FC-layers of a Faster-RCNN detector (ZF-Net) once we have learned sufficient memories in a video, which helps our tracker resolve local minimum issues caused by limited effective detection range. Though only supplied with coarse detections with a risk of false alarms, our tracker can start from a local region close to the target and then ensure accurate and smooth tracking results. Note that we only tune the CNN once, with around 150 seconds running time on one Tesla K40c for 3,000 iterations. When the tracking task is long, e.g., more than 3,000 frames, the average fps is larger than , which is certainly worthy for significant improvement in robustness. In the following stage, we perform CNN detection every frames, each taking less than s.

Figure 4: Tracking result comparison on sequences from the OTB-2013 dataset. Our tracker is represented by RMT and achieved the top performance on success plots evaluation standard. MEEM, TGPR, DSST and KCF also have close performance to our tracker. Only top- out of tracker results are shown for clearness.

4.3 Quantitative Evaluation

We first evaluate our method on challenging sequences from OTB-2013 [Wu et al.2013] against state-of-the-art methods: ACT [Nummiaro et al.2003], ASLA [Jia et al.2012], CSK [Henriques et al.2012], CXT [Dinh et al.2011], DSST [Danelljan et al.2014a], KCF [Henriques et al.2015], LOT [Oron et al.2012], MEEM [Zhang et al.2014a], MOSSE [Bolme et al.2010], SCM [Zhong et al.2012], Struct [Hare et al.2011], TGPR [Gao et al.2014], TLD [Kalal et al.2012] VTD [Kwon and Lee2010]. We employed the released code from the public resource (e.g., OTB-2013) or the released version by the authors, and all parameters are fixed for each trackers during testing. Fig. 4 shows the success plots on the whole dataset with the one pass evaluation (OPE) standard. Our tracker, represented as RMT (Reliable Memory Tracker), obtains the best performance, while MEEM, TGPR, KCF and DSST also provide competitive results. TGPR’s idea of building one tracker on auxiliary (very early) samples and MEEM’s idea of using tracker’s snapshot can be interpreted as making use of early formed reliable memory patterns, which is very relevant to our method. DSST designs a very concise pyramid representation for object scale estimation, and employs robust dense HOG feature to achieve high accuracy on estimating target motion. Our tracker outperforms the others on most challenging scenarios, (e.g.), occlusion, out-of-plane rotation, out of view, fast motion, as illustrated by Fig. 4. The main reason is that our tracker possesses amount of very reliable memories and a global vision that help it regain focus on the target after drastic appearance changes.

Motocross 2,035 295.9 181.5 182.5 67.5 44.7 33.4 21.5
Volkswagon 4,000 60.6 114.1 41.3 122.7 15.9 51.1 12.3
Carchase 4,000 125.0 129.4 98.0 132.6 34.4 38.1 34.1
Panda 3,000 64.8 83.3 64.5 71.4 27.1 97.9 23.9
Overall 13,035 118.5 122.3 86.1 105.3 28.7 55.1 23.1
Table 1: Tracking result comparison based on average errors of center location in pixels (the smaller the better) on four long-term videos over frames. Average performances are weighted by the frame number for fairness.

In order to explore the robustness of our tracker, and validate its resistance to drift error on long-term challenging tasks, we run our tracker on four long sequences from [Kalal et al.2012], over 13,000 frames in total. We have also evaluated the convolution filter based methods that are highly related to our method: MOSSE [Bolme et al.2010], KCF [Henriques et al.2015], ACT [Nummiaro et al.2003] and DSST [Danelljan et al.2014a], together with MEEM [Zhang et al.2014a] and a detector-based method TLD [Kalal et al.2012] (shown in Tab. 1). In order to make fair comparison, we have re-labeled the initial frame to ensure that no tracker lose focus in the beginning. While MOSSE often loses track at very early frames, KCF, ACT and DSST are able to track the target stably for hundreds of frames, but usually cannot maintain their focus after 600 frames. MEEM performs favorably on video for over

frames with its impressive robustness, but it is unadaptable to scale changes and still leads to inaccurate results. Our tracker and TLD performs over the other five trackers on all videos since both of them have a global vision to search for the target. However, based on an online random forest model, TLD takes in false positive samples slowly, which finally leads to false detections and inaccurate tracking results. Contrarily, guided by the CNN detector trained with our reliable memories, our tracker is only affected by very limited number of false detections. It robustly tracks the target across all frames, and gives accurate target location and target scale until the last frame for all four videos. A video clip with more detailed illustration and qualitative comparison can be found


5 Conclusion

In this paper, we propose a novel tracking framework, which explores temporally correlated appearance clusters across tracked samples, and then preserves reliable memories for robust visual tracking. A novel clustering method with temporal constraints is carefully designed to help our tracker retrieve good memories from a vast number of samples for accurate detection, while still ensures its real-time performance. Experiment shows that our tracker performs favorably against other state-of-the-art methods, with outstanding ability to recover from drift error in long-term tracking tasks.


  • [Avidan2004] Shai Avidan. Support vector tracking. PAMI, 26(8):1064–1072, 2004.
  • [Avidan2007] S. Avidan. Ensemble Tracking. PAMI, 29(2):261, 2007.
  • [Babenko et al.2011] Boris Babenko, Ming-Hsuan Yang, and Serge Belongie. Robust object tracking with online multiple instance learning. PAMI, 33(8):1619–1632, 2011.
  • [Bolme et al.2010] David S. Bolme, J. Ross Beveridge, Bruce A. Draper, and Yui Man Lui. Visual object tracking using adaptive correlation filters. In CVPR, pages 2544–2550, 2010.
  • [Collins and Liu2003] Robert Collins and Yanxi Liu. On-line selection of discriminative tracking features. In ICCV, pages 346–352, 2003.
  • [Dalal and Triggs2005] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886–893, 2005.
  • [Danelljan et al.2014a] Martin Danelljan, Gustav Häger, Fahad Khan, and Michael Felsberg. Accurate scale estimation for robust visual tracking. In British Machine Vision Conference, Nottingham, September 1-5, 2014. BMVA Press, 2014.
  • [Danelljan et al.2014b] Martin Danelljan, Fahad Shahbaz Khan, Michael Felsberg, and Joost van de Weijer. Adaptive color attributes for real-time visual tracking. In CVPR, pages 1090–1097, 2014.
  • [Dinh et al.2011] Thang Ba Dinh, Nam Vo, and Gérard Medioni. Context tracker: Exploring supporters and distracters in unconstrained environments. In CVPR, pages 1177–1184. IEEE, 2011.
  • [Gao et al.2014] Jin Gao, Haibin Ling, Weiming Hu, and Junliang Xing. Transfer learning based visual tracking with gaussian processes regression. In ECCV, pages 188–203. Springer, 2014.
  • [Girshick2015] Ross Girshick. Fast r-cnn. In ICCV, 2015.
  • [Grabner et al.2008] Helmut Grabner, Christian Leistner, and Horst Bischof. Semi-supervised on-line boosting for robust tracking. In ECCV, pages 234–247. Springer, 2008.
  • [Gray2005] Robert M Gray. Toeplitz and circulant matrices: A review. Communications and Information Theory, 2(3):155–239, 2005.
  • [Hare et al.2011] Sam Hare, Amir Saffari, and Philip HS Torr. Struck: Structured output tracking with kernels. In ICCV, pages 263–270. IEEE, 2011.
  • [He et al.2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 37(9):1904–1916, 2015.
  • [Henriques et al.2012] João F. Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. Exploiting the circulant structure of tracking-by-detection with kernels. In ECCV, pages 702–715, 2012.
  • [Henriques et al.2015] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. PAMI, 2015.
  • [Jepson et al.2003] Allan D Jepson, David J Fleet, and Thomas F El-Maraghi. Robust online appearance models for visual tracking. PAMI, 25(10):1296–1311, 2003.
  • [Jia et al.2012] Xu Jia, Huchuan Lu, and Ming-Hsuan Yang. Visual tracking via adaptive structural local sparse appearance model. In CVPR, pages 1822–1829. IEEE, 2012.
  • [Kalal et al.2012] Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas. Tracking-learning-detection. PAMI, 34(7):1409–1422, 2012.
  • [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105. Curran Associates, Inc., 2012.
  • [Kwon and Lee2010] Junseok Kwon and Kyoung Mu Lee. Visual tracking decomposition. In CVPR, pages 1269–1276, 2010.
  • [Mei and Ling2011] Xue Mei and Haibin Ling. Robust visual tracking and vehicle classification via sparse representation. TPAMI, 33(11):2259–2272, 2011.
  • [Nummiaro et al.2003] Katja Nummiaro, Esther Koller-Meier, and Luc Van Gool. An adaptive color-based particle filter. IVC, 21(1):99–110, 2003.
  • [Oron et al.2012] Shaul Oron, Aharon Bar-Hillel, Dan Levi, and Shai Avidan. Locally orderless tracking. In CVPR, pages 1940–1947. IEEE, 2012.
  • [Ren et al.2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [Rifkin et al.2003] Ryan Rifkin, Gene Yeo, and Tomaso Poggio. Regularized least-squares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003.
  • [Ross et al.2008] David A Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang. Incremental learning for robust visual tracking. IJCV, 77(1-3):125–141, 2008.
  • [Saffari et al.2009] Amir Saffari, Christian Leistner, Jakob Santner, Martin Godec, and Horst Bischof. On-line random forests. In ICCVW, pages 1393–1400. IEEE, 2009.
  • [Schölkopf et al.2001] Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In COLT, pages 416–426, 2001.
  • [Viola and Jones2001] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, pages I–511, 2001.
  • [Wang and Lu2014] Dong Wang and Huchuan Lu.

    Visual tracking via probability continuous outlier model.

    In CVPR, pages 3478–3485. IEEE, 2014.
  • [Wu et al.2013] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2013.
  • [Yilmaz et al.2006] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey. ACM Comput. Surv., 38(4), 2006.
  • [Zeiler and Fergus2014] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833. Springer, 2014.
  • [Zhang et al.2014a] Jianming Zhang, Shugao Ma, and Stan Sclaroff. Meem: Robust tracking via multiple experts using entropy minimization. In ECCV, pages 188–203. Springer, 2014.
  • [Zhang et al.2014b] Tianzhu Zhang, Si Liu, Narendra Ahuja, Ming-Hsuan Yang, and Bernard Ghanem. Robust visual tracking via consistent low-rank sparse learning. IJCV, pages 1–20, 2014.
  • [Zhong et al.2012] Wei Zhong, Huchuan Lu, and Ming-Hsuan Yang. Robust object tracking via sparsity-based collaborative model. In CVPR, pages 1838–1845. IEEE, 2012.