SPiKeS: Superpixel-Keypoints Structure for Robust Visual Tracking

10/23/2016 ∙ by François-Xavier Derue, et al. ∙ 0

In visual tracking, part-based trackers are attractive since they are robust against occlusion and deformation. However, a part represented by a rectangular patch does not account for the shape of the target, while a superpixel does thanks to its boundary evidence. Nevertheless, tracking superpixels is difficult due to their lack of discriminative power. Therefore, to enable superpixels to be tracked discriminatively as object parts, we propose to enhance them with keypoints. By combining properties of these two features, we build a novel element designated as a Superpixel-Keypoints structure (SPiKeS). Being discriminative, these new object parts can be located efficiently by a simple nearest neighbor matching process. Then, in a tracking process, each match votes for the target's center to give its location. In addition, the interesting properties of our new feature allows the development of an efficient model update for more robust tracking. According to experimental results, our SPiKeS-based tracker proves to be robust in many challenging scenarios by performing favorably against the state-of-the-art.



There are no comments yet.


page 1

page 4

page 5

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A robot needs to track its target to interact with it. Doubtful behaviors can be detected thanks to tracking in visual surveillance. Hands are tracked for gesture recognition. Those examples show only a small part of a wide range of tracking applications, thus encouraging many research efforts to focus on this topic. When no prior information about the object to track is available, tracking is referred to as model-free tracking. In a video, the goal is to locate a particular target given its location only in the first frame. This task is challenging because of numerous factors such as illumination variation modifying object color, occlusion hiding some parts, or new parts appearing if the viewpoint changes. While some of these issues are handled efficiently by different techniques, it is challenging for a single tracker to handle them all.

Trackers are generally split into two categories: discriminative and generative. Discriminative trackers consider tracking as a binary classification problem. Samples from foreground and background are selected to train a classifier that is able to separate the target from the rest of scene. Afterwards, this target detection yields a location estimation. This is the typical “tracking-by-detection” framework followed by many discriminative trackers

[2, 3, 4, 5], although Struck [6] achieves the classification and location in one step. In most approaches, samples are selected randomly, limiting their number for computational efficiency. Instead of random samples, Henriques et al. [7] proposed to select all the samples and exploited their redundancy to build a kernel classifier which tracks very quickly in the Fourier domain. Due to its simplicity and rapidity, many recent trackers build upon it [8, 9, 10].

Figure 1:

Decomposition of a frame into SPiKeS. Superpixels (black) stuctured by keypoints (red dot) linked by vectors (green).

In generative trackers, only foreground information models the target appearance and the tracking task aims to find the most similar image region to this model. In [11], the target is represented with a sparse model by using templates. The tracking location is the patch whose projection error into the template space is minimum. To account for appearance change and different kinds of motion, Kwon et al. [12] build different observation and motion models, so that each pair can be used within a basic tracker. These multiple basic trackers are then integrated into a main tracker, which is more robust thanks to the interaction between its components.

Although these methods can handle some appearance alterations, they are not robust against deformation and occlusion due to their holistic representation. These issues are usually handled by the family of part-based trackers. As the model is decomposed into several parts, an occlusion only affects some of them, without preventing the other ones to track the target. Typically, usual approaches consider the parts as rectangular patches structured in a grid [13, 14, 15, 16]. However, non-rectangular targets are not well represented because background patches inside the bounding box inevitably affect the model and make it drift. To this end, Li et al. [10] assigns reliability to patches so that noisy background patches do not affect the tracking.

Another part-based approach consists of oversegmenting the target into superpixels. Thanks to their boundary evidence, they take better the shape into account. In [17]

, a map is built showing the probability of a superpixel to belong to the target and the target location is the area with maximum likelihood. This tracker shows good performance but it needs a model which has to be learned in the first frames. Therefore, it needs manual annotation or another tracker for the initialization step. Recent approaches such as

[18] and [19] propose to integrate superpixels in a matching-tracking framework. An appearance model is built with superpixels and each of them attempts to find a match in the new frame in order to locate the target. One common problem is the low discriminative power of superpixel resulting in ambiguous matches. It then requires a complex strategy for matching.

Better features for matching are the keypoints. Because of their saliency and invariance to transformations, a keypoint-based appearance model can be matched efficiently even in case of occlusion and deformation. Nonetheless, keypoint-based trackers [20, 21, 22, 23] often fail to represent uniform regions, where no keypoints can be found.

Therefore, we hypothesize that superpixels and keypoints can complement each other. An object can always be segmented into superpixels but their lack of discriminative power makes them hard to match. Conversely, keypoints are more reliable to match but they poorly represent uniform-colored and non-textured regions. In our method, we propose to combine the assets of these two features in a single one: a Superpixel-Keypoints structure (SPiKeS). This is our first contribution. Figure 1 illustrates a frame decomposed into SPiKeS. Notice that keypoints contributing to a SPiKeS can be inside the superpixel or nearby. A single keypoint can contribute to many SPiKeS. Incomplete SPiKeS are possible if there is no keypoint around. In that case, they are only described with the superpixel.

Our second contribution is the design of a tracker that capitalizes on the SPiKeS. Experimental results show that our SPiKeS-based tracker performs well in numerous challenging situations and performs favorably against state-of-the-art trackers.

The paper is structured as follows. Section 2 presents works related to ours, i.e. trackers based on superpixels or keypoints. Our combination of these two features to build a SPiKeS is described in Section 3. Then, section 4 shows how this new feature can be integrated into a tracking framework for robust target location estimation. Finally, the evaluation of section 5 compares the proposed tracker to the state-of-the-art.

2 Related work

The idea of combining keypoints with other features for tracking has been exploited in [24]. The RGB and LBP histograms are extracted from patches to create an appearance model. SIFT keypoints [25] are then detected and their disposition is described by a circular histogram that represents a global geometric structure of the target. Our method is more flexible against deformation as each SPiKeS has its own keypoint structure, which allows local deformations.

Instead of patches, Liu et al. [26] proposed a tracker based on superpixels and SURF keypoints [27]. But unlike our proposition, their matching step only involves keypoints. The superpixels are only used for their boundary evidence, which is a useful clue when updating the model. Indeed, a new keypoint belonging to the same superpixel than a matching keypoint tends to be a part of the target since every point within a superpixel is likely to belong to the same object. We also consider this benefit in our method but in addition, because we are matching superpixels, even if there is no matching keypoint inside them, new keypoints can still be added. Therefore, our model update is more accurate.

Our localization process is inspired by [20, 21]

. Their approach assigns to each matching keypoint a vote for the center of the target, allowing keypoints to locate the target independently from each other. Hierarchical clustering then converges to a consensus of votes such that outliers are removed. Finally, the selected votes estimate the position as a simple center of mass. Furthermore, Bouachir et al.

[23] proposed to weight the votes according to the reliability of keypoints. We do the same, but instead of voting with keypoints, we vote with SPiKeS.

3 Superpixel-Keypoints Structure

As shown in figure 2, a Superpixel-Keypoints Structure ( SPiKeS ) consists in a superpixel and all the keypoints found in a region of radius around that superpixel’s center. It implies that keypoints can be inside and outside the superpixel. Each keypoint is linked to the superpixel’s center by a vector with a magnitude and an orientation. Therefore, a SPiKeS is a superpixel that acquired a spatial structure of keypoints, making it more discriminative. A SPiKeS without any keypoints is simply a superpixel.

Figure 2: SPiKeS representation. Keypoints are found inside or nearby the superpixel in a region of radius around its center. Keypoints relative positions are given by vectors.

3.1 SPiKeS definition

Let be a superpixel and the set of keypoints around , we write the associated SPiKeS denoted by as


with and the centers of superpixel and keypoint respectively. is the total number of keypoints found in a description region of radius centered on . Therefore, we define a descriptor for a SPiKeS as being with

  • : HSV histogram of

  • with the descriptor of .

  • with the vector from the superpixel’s center to

3.2 SPiKeS comparison

In order to compare two SPiKeS, we propose a measure of similarity based on their descriptors. Let be the similarity score between SPiKeS and . Because the color information of the superpixel and the keypoints structure are available, the score is a contribution of two terms:


The first term represents the similarity between superpixel’s color histograms


with being the Bhattacharyya distance measure.

The second term represents the similarity between keypoints structure. The higher the number of matching keypoints between and , the higher the score. Moreover, if both keypoints of a matching pair are positioned similarly with respect to their superpixel’s center, the score should also increase. Thus we define


with if and match, else . The factor weights the contribution of a keypoints match by comparing edges and . We compute with the vector difference magnitude normalized by the diameter of the description region:


Note that to benefit from keypoints rotation invariance, and are the vectors and reoriented according to the principal orientations given by keypoints and respectively.

Finally, the threshold ensures a minimum of color similarity to handle the case of wrong matching keypoints resulting in a high .

4 The SPiKeS Tracker

In model-free tracking, the information about the target location in the first frame is given by a bounding box. The SPiKeS are extracted from it to represent the appearance model. In the subsequent frames, after oversegmentation and keypoint detection, we build SPiKeS and locate those who match the model. Then, each matching SPiKeS votes for a position in the frame. The target’s center is estimated from all votes. If no occlusion is detected, the model is updated. These tracking steps are illustrated in figure 3.

Figure 3: Tracking steps of our SPiKeS-based tracker. Keypoints detection (b) and superpixels segmentation (c) are processed on an input frame (a). Each superpixel forms a SPiKeS with its surrounding keypoints (d). Our SPiKeS model is matched with the new SPiKeS and the matching ones vote for the target’s center (f) . The model is updated from the estimated bounding box (g) if no occlusion occurs.

4.1 Model

From the initial bounding box, we first detect keypoints, store them in a pool and combine them with the extracted superpixels to form our model of SPiKeS: . Then, SPiKeS is assigned a vote vector such that it can locate independently the target’s center:


with the target’s center at time , known as the center of the initial bounding box. We refer to as the center of which is equivalent to its superpixel’s center.

In addition, we extract keypoints in a surrounding region of the bounding box to keep a keypoint background model , which will help to detect occlusion similarly to [28].

4.2 Matching

During tracking, we extract a pool of SPiKeS from the entire incoming frame. Afterwards, we apply a greedy matching algorithm. The first step is looking for the nearest neighbour of every :


However, could be the nearest neighbour of different meaning a many-to-one match. Since a one-to-one match is required, only the highest score is kept:


At this point, there are now one-to-one matches that we refer to as pairs of matches with .

The next step consists in the rejection of wrong pairs of matches. Firstly, a given SPiKeS of the model may not have a valid match in a given new frame, e.g. when a part is not visible. In this case, the nearest neighbour has a low matching score relative to a threshold and it can be discarded. We set a different value for the threshold according to the presence or not of matching keypoints. Indeed, if there are no keypoint matches, only the color provides the match between SPiKeS. As we already set a color threshold , the minimum value of the matching score is according to equation 2 and 3. On the other hand, if there are matching keypoints, the score will always be higher than this minimum value, thus the threshold is set higher.

Secondly, as we assume the target motion is smooth and continuous in time, a match is also considered inconsistent if the displacement between and is too large with respect to recent motion.

Formally, a matching pair is valid and not discarded if and only if




with the total number of foreground keypoints matches, a score threshold parameter and a motion constraint parameter.

4.3 Location Estimation

Once the retained matching pairs have been determined, each votes for a position in the frame according to the vote vector given by its respective :


The estimated target location is computed by a weighted average of the votes:


The factors and , as introduced in [23], are the persistence and predictive factor of . They give more importance to SPiKeS that often match and vote correctly for the target’s center. More details are given in the next section.

4.4 Update

Section 4.1 introduced and , our models of foreground and background keypoints. During the matching process, keypoints belonging to are matched at the same time as the foreground ones. Once the new bounding box has been evaluated, if the number of keypoints inside it matching the background model exceeds a threshold , an occlusion is detected. Therefore, no update takes place and the next frame is processed. Otherwise, if no occlusion occurs, the following update scheme is applied.

Step 1 : Descriptors and votes update. For each valid match , the SPiKeS descriptor defined in subsection 3.1 is updated with:


This simple formula adapts the model to gradual change of illumination by updating the color of the superpixels and the position and description of the keypoints.

For a non-rigid target, local parts tend to move with respect to the center. Vote vectors need to be modified according to the SPiKeS new position to take into account local deformations:


As SPiKeS are the “parts” of our model, these terms are used interchangeably. A part that matches more often means that it is easier to identify and constitutes a stable part of the model. Consequently, this part should have more weight in the final vote because it has proved its reliability by the persistence of its matches. This persistence is interpreted as a factor . At the initialization, we consider every SPiKeS from the initial bounding box equally reliable and set an initial weight which is updated as


with if is a matching SPiKeS, otherwise.

However, suppose an unexpected part of the background is included in our model. It could match as often as a foreground part if it is also present in the other frames. In that case, the persistence factors would be the same while the foreground part should have more importance. It can be observed on figure 4 that, as a background SPiKeS will not follow the target, the center estimated by its vote will be far from the predicted location, whereas the foreground SPiKeS votes will be closer. To leverage this behaviour, we introduce a predictive factor . Given a factor at time for a SPiKeS belonging to the model, the predictive factor is updated such that it increases if the local prediction given by the vote is near the final location:


Those two factors and allow a SPiKeS to be more reliable if it often matches and votes correctly.

Figure 4: A wrong vote of a background SPiKeS included in the model (cyan) will have a weak predictive factor .

Step 2 : SPiKeS insertion. To handle appearance changes such as pose change resulting in new parts not visible in the initial bounding box, one needs to add these new parts to the model. The main problem at this step is to avoid adding background parts that make the model drift. Therefore, instead of naively adding all the SPiKeS from the bounding box, we select only superpixels and keypoints that will make good SPiKeS candidates. Figure 5 illustrates how superpixels and keypoints help each other selecting good candidates. As the introduction stated, a keypoint inside a matching SPiKeS is assumed to belong to the target because of the boundary evidence given by the superpixel. Indeed, all the points inside that area are likely to belong to the same object. Similarly, a superpixel containing a matching foreground keypoint is more likely to belong to the target.

Figure 5: A new keypoint (yellow, right) inside a matching superpixel (cyan) can be added to because this superpixel belongs to the target, unlike the red keypoint. In the same way, a new superpixel (green, right) can be added to because it includes a matching keypoint (white).

Let and be the sets of superpixels and keypoints candidates that meet these conditions. At first, the set is added to the foreground keypoints pool . Then, the old SPiKeS from refresh their structure with those new keypoints. Finally, we add the new SPiKeS made from the superpixels in and the updated .

In order to complete the keypoint-based background model, keypoints detected around the estimated bounding box are added to if they did not match the background keypoints.

Step 3 : SPiKeS deletion. As we add SPiKeS, our model grows and increases the complexity of the matching process. Furthermore, some SPiKeS may be irrelevant like redundant or background SPiKeS, which need to be deleted. To keep a reasonable number of SPiKeS in our model, once a maximum size is exceeded, the weakest SPiKeS are removed based on their persistence factor . The same discarding method apply for and . Thus a persistence factor is assigned to each keypoint, updated similarly to equation 15, so that the weakest ones can be identified.

Figure 6: Precision and Success plots for the one-pass evaluation (OPE) on OTTB. The number into brackets is the number of videos in the subset.

5 Experiments

In this section, we first present details of our implementation and the values for our parameters. Afterwards, we evaluate our method with the procedure proposed by [29] and compare our results to state-of-the-art trackers.

5.1 Experimental setup

For the oversegmentation, we choose the SEEDS superpixels [30] which have smooth boundaries and similar shapes in addition to be one of the fastest superpixel segmentation in the litterature. The size of a superpixel depends on the initial bounding box of dimension . In order to have about 30 superpixels included in the initial bounding box, a frame of dimension should be segmented in superpixels of diameter . Their HSV color histogram is quantified in bins and normalized. Similarly to [31], Grabcut [32] is used on the first frame to select foreground superpixels inside the given bounding box. This process makes the model more accurate as it avoids including background superpixels in the initial model. As for the keypoints and their descriptors, we use the SIFT algorithm [25] which produces scale and rotation invariant keypoints robust against illumination variation. A match between keypoints is defined as proposed in [25] with a ratio threshold . When building the SPiKeS, each superpixel searches its keypoints in a surrounding region of radius . We limit the size of to 3 times the number of superpixels in the initial bounding box. The keypoints model and are limited to . During the matching process, the color threshold is set to , the score parameter to and the motion constraint factor to . At the update stage, the occlusion parameter is set to

. A smooth appearance adaptation is obtained with interpolation factor

and learning factor . Finally, when a new SPiKeS is added, it starts with a weak persistence factor such that it could be discarded quickly if it does not match directly in the next frame. Our results can be reproduced with our C++ implementation available online111https://github.com/fderue/SPiKeS_T. The following evaluation gives an average of 3 frames per second on a 3.4 GHz CPU with 8 GB memory, without code optimization. Note that most of the time is spent on superpixels segmentation, keypoints computation and matching, which are tasks that could be implemented on GPU to improve execution speed.

5.2 Evaluation

5.2.1 Comparison to the state-of-the-art

The CVPR2013 Online Object Tracking Benchmark (OTTB) of Wu et al.[29] allows us to evaluate our approach against 29 state-of-the-art trackers over a dataset of 51 challenging sequences. The given groundtruth is a rectangular bounding box whose center corresponds to the target location. We also added a more recent tracker, KCF [9], as the code is available online.

After running the one-pass evaluation (OPE), we obtain two types of graphs based on different metrics. The precision plot shows the percentage of frames for which the center location error (CLE) is lower than a Location Error Threshold (LET), with CLE computed as the Euclidian distance between the tracker estimated location and the groundtruth’s center. On this plot, trackers are ranked by the precision obtained for LET = 20 pixels. The second graph is the success plot. It represents the percentage of frames for which the overlap ratio (OR) is larger than a given threshold. This ratio is computed between the intersection and union of the bounding box given by the tracker and the groundtruth

with denoting the number of pixels of the covered surface. The ranking on this plot employs the area under the curve (AUC) value as it measures the overall performance instead of the success obtained for a single threshold.

Figure 7: Comparison of a superpixel-based tracker (DGT), a keypoint-based tracker (CMT) and ours (SPiKeS-T) on OTTB.

Figure (a)a shows the overall plots obtained from the whole dataset, while plots (b)b-(j)j are obtained from subgroups gathering videos of the same challenging factor. Only the top ten methods are shown for clarity. We observe that our tracker (SPiKeS-T) gives promising performances since it ranks first for almost all of these cases. However, it reaches only the second place on the overall plot after KCF. KCF has the benefits of a scale adaptation, unlike our method, which explains why it tracks better on sequences with scale variation (fig. (g)g).

As Struck does not adapt to scale change either, we can fairly compare to it on the overall success and precision plots where our method reaches better performance on both. This is mainly due to our part-based model, which shows best results against deformation as seen in figure (c)c. Indeed, our local parts, the SPiKeS, are very flexible as we do not enforce rigid connection between them. Each one is matched regardless of the others allowing large deformations. Moreover, a superpixel is a deformable part itself and better represents local deformation. This advantage has been exploited in our update scheme making our tracker more robust against background clutter than the other trackers as we observe on figure (b)b. For example, SCM updates its generative model with rectangular patches, which are less reliable than superpixels as patches cannot adapt to the shape. Qualitative results for the top five trackers are also presented on figure 8.


SPiKeS-T   KCF   SCM   Struck   TLD

Figure 8: Qualitative results of top five trackers for sequences bolt, woman, david3, singer2 from top to bottom.

5.2.2 Comparison to related trackers

As our goal is to show the benefits of the SPiKeS for tracking, we also compare our method to two specific recent trackers: CMT [21] and DGT [18]. Both are part-based trackers which locate their target with votes like ours. However, the former is a keypoint-only tracker while the latter is a superpixel-only tracker. As their codes are available online, we evaluate them on the benchmark of Wu et al. [29], keeping the default parameters given by the authors. Results in figure 7 show that our approach outperforms the other two, demonstrating that combining superpixels and keypoints leads to a more robust tracking than using these features alone. More specifically, the results can be explained for different situations:

  • Deformation: As they are all part-based trackers, they are more suited to handle deformations. To alleviate the lack of discriminativity of superpixels, DGT employs spectral matching

    to match a graph of superpixels. This technique requires the computation of an affinity matrix. However, for that matrix to be computationally manageable, constraints on the deformation must be set. Consequently, this tracker fails in case of heavy deformation. As for CMT, keypoints can be difficult to match when the target undergoes deformation, since some keypoints will disappear and new ones appear. Therefore, only a few matches will determine the location, which will be inaccurate if some of the matches are wrong. On the contrary, as an image can always be segmented in a same number of superpixels, numerous SPiKeS are candidates to be matched even in case of deformation. Moreover, since SpiKeS may have keypoints in common, a single keypoint can lead to several matches of SPiKeS resulting in a more accurate location.

  • Background clutter: In this situation, the background distracts the tracking and often leads to a model drift. Where DGT will match wrong superpixels and CMT false keypoints, the more discriminative power of a SPiKeS helps in avoiding such ambiguous matches. Indeed, the color of a superpixel can avoid bad keypoint matches while the structure of keypoints can differentiate two superpixels of similar color. Compared to CMT, the boundary evidence brought by a superpixel avoids adding noisy keypoints to the model, as presented in figure 5 in the previous section. Furthermore, even if noisy SPiKeS are added to our model, the persistence and prediction factors favor reliable SPiKeS, which also prevents the model from drifting.

  • Occlusion: On this curve, we see that DGT is less efficient. If the occluder has similar color as the target, it will be classified as foreground and no occlusion will be detected. Thus, it will not be able to avoid updating the model which will make it drift. To detect occlusion, it seems that keypoints are better but SpikeS has still an advantage over keypoints-only trackers. In case of a missed occlusion detection and an unwanted update, it is less probable to add bad keypoints to a SpiKeS thanks to the boundary of the superpixel. However, as keypoint-only trackers have no clue as whether a new keypoint belongs to the target, it is more likely that the model will drift due to a background keypoint added erroneously to the model.

  • Illumination variation: In case of illumination variation, DGT tends to fail as it relies only on color and unlike CMT that detects BRISK keypoints [33], our tracker uses SIFT keypoints which are designed to be robust against illumination variation.

  • Fast motion: Constraints on the affinity matrix computed by DGT also limits the motion of each of its superpixels. This is why it performs poorly when there is fast motion. Our tracker adapts its motion constraint according to the target’s motion.

It is also interesting to see the influence of other types of superpixels and keypoints to build the SPiKeS. Figure 9 compares different combinations of features such as SIFT [25] and SURF [27] for keypoints and SLIC [34] and SEEDS [30] for superpixels. Although the results are quite equivalent, the best combination is not a surprise. SEEDS has shown to fit better to boundaries than SLIC in [30] and SIFT is more robust than SURF to illumination changes [35].

Figure 9: Influence of different superpixels-keypoints combinations on the overall performance on OTTB.

6 Conclusion

In this paper, we proposed a novel feature combining superpixels and keypoints that we called SPiKeS. We showed that this new feature can be matched efficiently by a simple nearest neighbor technique. Therefore, we developed a SPiKeS-based tracker that leverages this matching to locate accurately target parts in a new frame. Furthermore, based on the SPiKeS properties, we provided a reliable update scheme that avoids the model to drift. Finally, the evaluation against the state-of-the-art shows promising results, as our results are close to the ones of KCF tracker, even outperforming it in many scenarios, despite that our tracker does not yet include an adaptation to scale variation. In addition, our superior performance compared to superpixels-only and keypoints-only trackers first demonstrates the benefits of fully combining these two features for more robust tracking. As a final word, we point out that the use of SPiKeS could advantageously be extended to other applications such as object detection and foreground segmentation.

This work is supported by Fonds de recherche du Québec - Nature et technologies (FRQ-NT) team grant #172083. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPU used for this research.


  • [1]
  • [2] S. Avidan, “Support vector tracking,” PAMI, vol. 26, no. 8, pp. 1064–1072, 2004.
  • [3] Q. Bai, Z. Wu, S. Sclaroff, M. Betke, and C. Monnier, “Randomized ensemble tracking,” in ICCV, 2013, vol. 1, pp. 2040–2047.
  • [4] B. Babenko, M. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” PAMI, vol. 33, no. 8, pp. 1619–1632, 2011.
  • [5] Q. Zhou, H. Lu, and M.-H. Yang, “Online multiple support instance tracking,” in Proc. IEEE Conf. Automatic Face and Gesture Recognition, 2011, pp. 545–552.
  • [6] S. Hare, A. Saffari, and P. H. Torr., “Structured output tracking with kernels,” in ICCV, 2011, pp. 263–270.
  • [7] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in ECCV, 2012.
  • [8] M. Danelljan, F. S. Khan, M. Felsberg, and J. V. de Weijer, “Adaptive color attributes for real-time visual tracking,” in CVPR, 2014.
  • [9] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” TPAMI, vol. 33, no. 8, pp. 583–596, 2015.
  • [10] Y. Li, J. Zhu, and S. C. Hoi, “Reliable patch trackers: Robust visual tracking by exploiting reliable patches,” in CVPR, 2015, pp. 353–361.
  • [11] X. Mei and H. Ling, “Robust visual tracking using l1 minimization,” in ICCV, 2009.
  • [12] J. Kwon and K. M. Lee, “Visual tracking decomposition,” in CVPR, 2010.
  • [13] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in CVPR, 2006, vol. 1, pp. 798–805.
  • [14] W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparsity-based collaborative model,” in CVPR, 2012, pp. 1838–1845.
  • [15] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in CVPR, 2012, pp. 1822–1829.
  • [16] S. He, Q.-X.Yang, R. Lau, J. Wang, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in CVPR, 2013.
  • [17] S. Wang, H. Lu, F. Yang, and M.-H. Yang, “Superpixel tracking,” in ICCV, 2011, pp. 1323–1330.
  • [18] Z. Cai, L. Wen, Z. Lei, N. Vasconcelos, and S. Z. Li, “Robust deformable and occluded object tracking with dynamic graph,” IEEE Transaction on Image Processing, vol. 23, no. 12, pp. 5497–5509, 2014.
  • [19] J. Wang and Y. Yagi, “Many-to-many superpixel matching for robust tracking,” IEEE Transaction on Cybernetics, vol. 44, no. 7, pp. 1237–1248, 2014.
  • [20] G. Nebehay and R. Pflugfelder, “Consensus-based matching and tracking of keypoints for object tracking,” in WACV, 2014.
  • [21] G. Nebehay and R. Pflugfelder, “Clustering of static-adaptive correspondences for deformable object tracking,” in CVPR, 2015.
  • [22] S. Hare, A. Saffari, and P. H. Torr, “Efficient online structured output learning for keypoint-based object tracking,” in CVPR, 2012, pp. 1894–1901.
  • [23] W. Bouachir and G.-A. Bilodeau, “Part-based tracking via salient collaborating features,” in WACV, 2015.
  • [24] F. Yang, H. Lu, and M.-H. Yang, “Learning structured visual dictionary for object tracking,” Image and Vision Computing, vol. 31, no. 12, pp. 992–999, 2013.
  • [25] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.
  • [26] Y. Liu, W. Zhou, H. Yin, and N. Yu, “Tracking based on surf and superpixel,” in Sixth International Conference on Image and Graphics, 2011, pp. 714–719.
  • [27] H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up robust features,” in ECCV, 2006.
  • [28] W. Wang and R. Nevatia, “Multistore tracker (muster): A cognitive psychology inspired approach to object tracking,” in CVPR, 2015, pp. 749–758.
  • [29] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in CVPR, 2013.
  • [30] M. V. den Bergh, X. Boix, G. Roig, and L. V. Gool, “Seeds : Superpixels extracted via energy-driven sampling,” IJCV, 2014.
  • [31] Z. Hong, C. Wang, X. Mei, D. Prokhorov, and D. Tao, “Tracking using multilevel quantizations,” in ECCV, 2014.
  • [32] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive foreground extraction using iterated graph cuts,” ACM Transactions on Graphics (TOG), vol. 23, no. 3, pp. 309–314, 2004.
  • [33] S. Leutenegger, M. Chli, and R. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in ICCV, 2011, pp. 2548–2555.
  • [34] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” TPAMI, vol. 34, no. 11, pp. 2274–2282, 2012.
  • [35] N. Y. Khan, B. McCane, and G. Wyvill, “Sift and surf performance evaluation against various image deformations on benchmark dataset,” in International Conference on Digital Image Computing : Techniques and Applications, 2011, pp. 501–506.