I Introduction
Visual tracking has attracted much research interest in computer vision field, because it is a critical step of various applications, including video surveillance, sport analysis, autodrive car, etc. Despite having achieved promising progress over the past decade, it still remains very challenging for designing a robust tracker that can handle appearance changes caused by various critical situations, such as large deformation, illumination variation, partial and full occlusion, and background clutter. In particular, the deformation and occlusion are the two most notable challenges that degrade tracking performances.
For tracking scenarios where the target appearance is relatively stable, methods based on global appearance models can achieve satisfactory performances [20, 15, 33, 45, 9, 17]. However, if large deformation and occlusion happen, such approaches usually fail to track the target robustly. To counter this problem, many approaches based on models of local parts have received more attention [34, 30, 19, 32, 44]
. Moreover, several different methods to represent the target geometric structure have been proposed, such as structural Support Vector Machine (SVM)
[42], Markov Random Field (MRF) [31, 18], keypoint constellation [29, 28], and graph model [35, 6]. However, most approaches consider pairwise relations between target parts are easily disturbed by errors in pairwise affinities, rendering difficulties to well preserve the geometric structure underlying the target representation.In this paper, we present a novel Geometric hyperGraph Tracker (GGT) to handle the visual tracking task, especially for the deformable ones. Different from the previous works considering pairwise geometric relations between local parts, our method exploits highorder relations among more than two correspondences based on the geometric hypergraph. Specifically, the geometric hypergraph is constructed and learned based on the target part set and the candidate part set^{1}^{1}1The candidate part set consists of candidate parts extracted from the searching area in the current frame . We employ the target part set as the part representation of the target, which is consisted by the target parts up to the previous frame .. Then we generate possible correspondences between parts in the two sets, which are defined as correspondence hypotheses. In Fig. 1, we give a schematic diagram of constructing the hypergraph , where vertices encode correspondence hypotheses, and hyperedges encode highorder geometric relations among several correspondence hypotheses. Thus the geometric structure of the target can be effectively characterized by the hypergraph, rendering more discriminative power to extract the common appearance and geometric property of correspondences from noises. Moreover, correspondence hypotheses can form a set of modes where a large number of hyperedges are involved with high confidence; while other false correspondences have very few hyperedges with low confidence. For easier reading, we first define the structural correspondence mode as:
Definition 1. The structural correspondence mode is a group of reliable correspondences between target parts with similar appearance and consistent geometric structure that are interconnected with the local maximum of overall confidences on the geometric hypergraph.
The present work makes the following contributions:

The geometric hypergraph is used to represent the target, which fully exploits highorder geometric relations among correspondence hypotheses in consecutive frames.

The confidenceaware sampling method is proposed to approximate the geometric hypergraph, which not only alleviates sensitivity to noises, but also is scalable to the large scale hypergraph. Thus we seek structural correspondence modes on the hypergraph by the pairwise coordinate update method in [24] efficiently.

Our method is compared to existing methods on the VOT2014 dataset and DeformSOT dataset. The experimental results demonstrate the effectiveness and robustness of the proposed model.
The rest of the paper is organized as follows. In Section II, we review relevant previous works. The methodology is described in Section III and the model optimization is presented in Section IV. In Section VI, we evaluate the proposed algorithm on two tracking datasets compared to other existing methods. Then we conclude the paper with discussions on future works in Section VII.
Ii Related Works
Tracking methods based on modeling relations between target parts have been shown to be less susceptible to the problem posed by object deformation and occlusion. Recently, many works have focused on how to incorporate geometric information as an important clue to facilitate visual tracking.
Keypoint Based Tracking Methods. The keypoint based trackers use the displacements of target parts to vote for the target center in consecutive frames to consider the geometric structure. Hare et al. [16]
combine the feature matching, learning, and object pose estimation into a coherent structured output learning framework, resulting in learning for realtime keypointbased object detection and tracking. Yang
et al. [41] propose a visual tracking algorithm by incorporating the SIFT features from the interest points to represent appearance and exploiting their geometric structures, where a structured visual dictionary is learned to enhance its discriminative strength between the foreground object and the background. Yi et al. [43] propose a tracking method using “motion saliency” and “descriptor saliency” of local features and performs tracking based on generalized Hough transform. The tracking result is obtained by combining the results of each local feature of the target and the surroundings with generalized Hough transform voting. Guo et al. [14] formulate the task of tracking and recognition as a maximum posteriori estimation problem under the manifold representation learnt from collections of local features with preserving local appearance similarity and spatial structure. Nebehay and Pflugfelder [29] develop a keypointbased tracking method in a combined matchingandtracking framework, where each keypoint casts votes for the object center. Moreover, an improved algorithm in [28]employs geometric dissimilarity measure to separate inlier correspondences from outliers by considering both static and adaptive correspondences. In
[4], keypoints are considered as elementary predictors localizing the target in a collaborative search strategy, where the persistence, the spatial consistency, and the predictive power of a local feature are used to to measure the most reliable features for tracking. However, keypoint based trackers focus on modeling the displacements between parts and the corresponding target center, which are insufficient to exploit relations between local parts fully for geometric structure representation.Part Based Tracking Methods. To better solve the shape deformation and partial occlusion issue, part based methods are gaining popularity in visual tracking. Wen et al. [36] present a discriminative learning method to infer the position, shape and size of each part, using the MetropolisHastings algorithm integrated with an online SVM. Wang and Nevatia [35]
propose to track nonrigid objects with multiple related parts and model tracking as Dynamic Bayesian Network, where the spatial relations among parts are formulated probabilistically. Improved from
[15], Yao et al. [42] introduce a partbased tracking algorithm with online latent structured learning, and use a global object box and a small number of part boxes to approximate the irregular object, to reduce the amount of visual drift. Cehovin et al. [7] employ a global representation to probabilistically models target’s global visual properties. Meanwhile, the lowlevel patches are constrained and updated with the global model during tracking. A dynamic structure graph based tracker is used in [6] to formulate the tracking problem as subgraph matching between the geometric structure graph of the target and that of the candidate target proposals graph. Nam et al. [27] use a new graphical model to adapt sequence structure and propagate the posterior over time, where each vertex has a single outgoing edge but may has multiple incoming edges. Hong et al. [18] propose a MRFbased tracker to consider geometric structure by the hierarchical appearance representation, which exploits shared information across multilevel quantization of an image space, i.e., pixels, superpixels and bounding boxes. In [25], a realtime tracking method is proposed based on parts with multiple correlation filters, in which the Bayesian inference framework and a structural constraint mask are adopted to handle various appearance changes. However, the existing part based methods give less consideration to exploit highorder geometric relations among target parts for more robustness.
Segmentation Based Tracking Methods. The segmentation based methods consider the geometric information by finding out the precise location of each pixel in the target. Based on the generalized Houghtransform, Godec et al. [13] develop an improved online Hough Forests and couple the voting based detection and backprojection with a rough segmentation based on GrabCut. Duffner and Garcia present a pixelbased nonrigid object tracking method in [11], which consists of a generalized Hough transform with pixelbased descriptors based detector and a probabilistic segmentation method based on a global model for foreground and background. Recently, Wen et al. [38] develop a joint online tracking and segmentation algorithm, which integrates the multipart tracking and segmentation into a unified energy optimization framework.
Iii Methodology
We first introduce terms and notations to be used in the sequel. We denote the order of hypergraph as . The hypergraph is a generalization of a graph in which an edge (hyperedge, strictly speaking) can connect more than vertices, while a graph has its edges connecting vertices. The unconnected graph is a graph without edges between vertices.
Although our method is related with three previous works including SPT [34], DGT [6] and TCP [22], there are significant differences between our method and them, which are concluded below.

Though both our method and SPT use superpixel representation, our method uses a more effective representation of the geometric information of target parts, which leads to improved performance on more complex scenes. When the hypergraph degenerates into an unconnected graph (), SPT can be regarded as a special case of the proposed algorithm.

DGT uses a graph to exploit pairwise geometric relations between neighboring parts. On the contrary, our method employs a hypergraph that considers highorder geometric relations among correspondence hypotheses to exploit abrupt deformation, motion change and target context better. When the hypergraph reduces to a normal graph (), DGT can be regarded as a special case of the proposed algorithm.

TCP mainly exploits temporal highorder relations among different parts in consecutive frames, ignoring the geometric structure information of local parts spatially. In contrast, our method focuses on modeling the spatial highorder relations among correspondence hypotheses. Besides, the temporal relations of parts are also considered when updating the hypergraph.
In this work, the tracking problem is formulated as the modeseeking problem on the geometric hypergraph. In Section IIIA, we construct the hypergraph based on the target part set and candidate part set. Then we give the detailed formulation in Section IIIB and define corresponding confidence measure in Section IIIC.
Iiia Geometric Hypergraph
The superpixel representation is more flexible for the deformable target compared to the holistic representation, however has low discriminative power because of small size. Therefore we construct a geometric hypergraph to alleviate the problem with geometric constraints. Given the annotated bounding box in the first frame, the target part set is first initialized, and the candidate part set is determined by the coarse labeling of superpixels in the rest frames^{2}^{2}2Similar as [6], we first use the SLIC algorithm [1] to oversegment the searching area of the target into multiple parts (superpixels), and employ the Graph Cut method [5] to coarsely separate the foreground parts from the background, as shown in the topleft of Fig. 2(a).. Based on and , we construct the vertex set and hyperedge set of the geometric hypergraph as
(1) 
where is the number of vertices. and are the th and th vertex in hyperedge without conflicts or duplicates. is the order of hypergraph. is the Euclidean distance between the centers of parts and in image plane. The distance threshold is set as , where is the number of superpixels in the searching area with width and height in the current frame, as shown in Fig. 2.
IiiB Formulation
As analyzed in the introduction, multiple correspondences with similar structural geometric properties form a set of structural correspondence modes. By measuring the overall confidence of modes, the tracking problem is formulated as
(2) 
where is the structural correspondence mode including number of vertices. is the confidence measure function reflecting the confidence distribution in the mode , which is described as follows.
IiiC Confidence Measure
We design two terms to encode both the association confidence of vertices and the geometric confidence among them, i.e.,
(3) 
where and denote the vertex set and the hyperedge set of mode , respectively. and are the balancing factors of the two terms.
Association Confidence. The association confidence
encodes the probability of two parts in vertex
belonging to the same class, which is defined as(4) 
where is the distance between the appearance feature of the two parts, i.e., , and . In the experiment, the appearance feature is concatenated by HSV color feature and LBP texture. is the scaling parameter to measure the importance of appearance similarity.
Geometric Confidence. The geometric confidence describes the geometric relation among correspondence hypotheses in hyperedge . If the order of graph , it is a hypergraph describing highorder geometric relations among correspondence hypotheses. When reducing the order of graph, it degenerates into a graph to consider pairwise geometric relations, and an unconnected graph ignoring geometric relations. Therefore, for different order of hypergraph case, we have different calculation of geometric confidence, which is discussed as follows.
IiiC1 Unconnected Graph
If , the hypergraph becomes an unconnected graph, i.e., . Thus visual tracking only depends on the association confidence without any geometric structural constraints. Similar to SPT [34], it is actually a partbased template matching method. The appearance information encoded in is usually weak especially for small superpixels, resulting in worse performance in the scenarios with complex appearance variation.
IiiC2 Graph
If , the geometric information encoded in provides complementary pairwise geometric information of edge besides appearance than SPT [34] does. Thus DGT [6] falls into this category. The pairwise similarity to compare two correspondence hypotheses is calculated as
(5) 
where and denote the parts in target part set , and the parts in candidate part set . measures the consistency of the two supporters, which is calculated as the location displacement of two neighboring correspondence hypotheses, as shown in Fig. 3(a). is the scaling parameter to measure the importance of geometric constraint.
IiiC3 Hypergraph
As shown in Fig. 3(a), the supporters provide pairwise relation measure that are restricted to distances, making it hard to handle large scale change and conducting wrong correspondences between target parts. On the contrary, as shown in Fig. 3(b), we exploit the angle information of triplets of correspondence hypotheses to achieve scale invariance, leading to more correct associations between target parts. For example, three correspondence hypotheses form two triangles ( and ). Although the target scale changes drastically, the angles (highorder geometric relations) remain more stable compared to the relative displacement between parts (pairwise geometric relations). For better understanding, a real example of structural correspondence in waterski is shown in Fig. 1(b). Similar to [21], the geometric confidence is calculated by comparing corresponding angles, as
(6) 
where and denote angles of parts related to vertex in the set and , respectively.
Iv Optimization
Given the geometric hypergraph , the modeseeking problem is solved by searching (Section IVA). Before that, We propose the confidenceaware sampling technique to improve the effectiveness of the proposed method (Section IVB).
Iva ModeSeeking Problem
Since the maxima of (IIIB) corresponds to a structural correspondence mode, we fully search the hypergraph by setting each vertex in the hypergraph as a starting point. Let be the mode with vertex set and hyperedge set . Let be the vector containing the probability of each vertex in the hypergraph belonging to mode , i.e., if , ; otherwise, . is the number of vertices in the hypergraph. Combined with (3), the problem in (IIIB) is cast as optimizing the probability vector and further rewritten as
(7) 
in which the first term in the objective function penalizes the inclusion of vertices corresponding to less association confidence indicated by a lower , and the second term encourages the inclusion of hyperedges in the mode with larger geometric confidence
. Essentially, this is a NPhard combinatorial optimization problem. To solve this problem, the constraint
is relaxed to , where is a constant. Let the number of vertices in be , the mode contains at least number of vertices when keeping the constraint . To avoid the degeneracy problem, we require the minimal vertices in a mode satisfying the constraint to guarantee adequate structural correspondences included in one mode. Then the pairwise coordinate updating method [24] is used to solve the problem in (IVA) effectively. Refer to [24] for more details about the optimization strategy.IvB Confidenceaware Sampling
Suppose that the target part set and candidate part set consist of parts, there are at most correspondence hypotheses. For the case, the size of the resulting fullaffinity hyperedges will be , of order , which demands a huge amount of memory. It becomes than necessary to further reduce the computational complexity by introducing a sparse hypergraph structure with significant hypotheses. To this end, we develop a confidenceaware sampling method as follows.

Firstly, we reduce the number of vertices deterministically by introducing two thresholds. We assume target parts move smoothly in consecutive frames, which means that the appearances change little in a very short time interval. To remove noises, for each target part , we only consider a few correspondence hypotheses with at most number of highest association confidence larger than an appearance threshold .

Secondly, the number of hyperedges is greatly decreased probabilistically. Based on a simple assumption that a vertex with higher association confidence has a higher possibility of being reliable correspondence, we sample more hyperedges around the vertex with higher association confidence. Specifically, starting from each vertex in the reduced vertex set, we sample number of hyperedges comprising three vertices without conflicts. We regard the normalized confidence as the sampling probability, and the constant as the maximal number of sampled hyperedges for each vertex.
Different from other MRF or graph based approaches considering pairwise relations between the nearest neighboring vertices, we sample hyperedges randomly without distance constraints to exploit the geometric information fully, so that the hypergraph is spanned globally over all correspondence hypotheses. The additional benefit is that we can consider context information between target parts and background parts for more robustness.
Based on the confidenceaware sampling method, we sample vertices and hyperedges of , obtaining a approximate geometric hypergraph . Then we directly perform modeseeking on instead of . Specifically, the reduced vertex set and hyperedge set of are given as
(8) 
where denotes the number of vertices including part , and denotes the number of hyperedges including vertex . The sampling scheme ensures finding enough relevant correspondence hypotheses. Moreover, it decreases the number of vertices from to at most and the number of hyperedges from to at most , which removes more than redundant of vertices and hyperedges in empirically.
V Tracking
Va Extracting Reliable Target Parts
Given the optimized probability vector , we can determine the vertices belonging to the corresponding mode , i.e., . Since the hypergraph is searched starting from each vertex, one vertex may appear in multiple modes. The conflicts involved in the modes should be removed to find reliable target parts , and the whole procedure is summarized in Algorithm 1.
VB Reliable Target Parts Based Voting
After obtaining , we determine the target state in the current frame , including center and scale of the target by reliable target parts based voting. Similar to the method of [6], we construct a confidence map to represent location probability of the target in the searching area as
(9) 
where is the position in the searching area. means the region of target parts belonging to the extracted modes, and means the region of candidate parts. are constants for the influence of each type of regions.
To find the bounding box to cover more foreground regions with respect to center and scale , we form the following optimization problem
(10) 
where means the region with center and scale .
The target center in the current frame is largely determined by the one in the previous frame with the geometric constraint. To reduce computational complexity, we first estimate a rough target center by calculating the weighted mean of target part center with weight , i.e.,
(11) 
where is the optimal center in the previous frame . denotes the confidence of the mode including reliable target part in the current frame , i.e., . After that, we modify the target center with the displacement perturbation term and adjust the target scale with the scale perturbation term for a visually better location. The maximal values of two perturbation terms are set as the mean diameter of candidate parts in the current frame . The final target state is obtained by optimizing (10) using a sampling strategy, namely selecting the one with the maximal score out of numerous randomly sampled states . Assembling all parts belonging to the target, we find the optimal target state, as shown in Fig. 2(b).
VC Online Updating of Hypergraph
To handle possible significant changes of target appearance, geometric hypergraph is updated in two aspects, i.e., target part set and candidate part set . As illustrated in Fig. 2(a), based on the parsed reliable target part set , an old part in (e.g., part ) is deleted if it does not involve in any structural correspondence for a fixed number of frames ( frames in the experiment), while a new part (e.g., part and part ) not involved in existed modes is added in such that its geometric distance to any other parts is larger than a threshold^{3}^{3}3The threshold is set as double mean diameter of candidate parts in the current frame. to preserve the spatial sparsity of . On the other hand, the appearance model in the MRF based segmentation method is updated to generate every frame, as similar as in [6].
Vi Experiments
Via Datasets and Protocols
ViA1 VOT2014 dataset
The VOT2014 dataset [12] is popularly used in the tracking community, which is collected with representative sequences selected from sequences. Each sequence is annotated by several attributes such as occlusion, and illumination changes.
We evaluate the tracking methods following two protocols of the VOT2014 challenges, i.e., Baseline and Region_noise. Baseline corresponds to the experimental setting where the tracker is run on each sequence times by initializing it on the groundtruth bounding box, obtaining average statistic scores of the measures. Region_noise corresponds to the experiment setting where the tracker is initialized with noisy bounding boxes, which are randomly perturbed in order of of the groundtruth bounding box size, in each sequence. As defined in [8], two performance metrics, Accuracy (average bounding box overlap between the bounding box predicted by the tracker and the groundtruth one) and Robustness (number of reinitializations once the overlap ratio measure drops to zero) are reported in the experiment.
ViA2 DeformSOT dataset
To further evaluate the performance of trackers on deformation and occlusion, we collect the DeformSOT dataset, which includes challenging sequences and different targets with deformation and occlusion in varying levels in unconstrained environment. The dataset is diverse in object categories, camera viewpoints, sequence lengths and challenging levels. We categorize the difficulty levels of the sequences into six classes, including large deformation, severe occlusion, abnormal movement, illumination variation, scale change and background clutter, for comparison.
We use two popular measures for evaluation, i.e., precision plot and success plot. The precision plot shows the percentage of successfully tracked frames vs. the center location error in pixels, which ranks the trackers as precision score at pixels; the success plot draws the percentage of successfully tracked frames vs. the bounding box overlap threshold, where Area Under the Curve is used as success score for ranking. We run the OnePass Evaluation (OPE), Spatial Robustness Evaluation (SRE) and Temporal Robustness Evaluation (TRE) for two measures (see definitions in [40]) on the dataset.
ViB Implementation Details
The proposed tracker is implemented with MATLAB and C and runs at framepersecond on a machine with a GHz Intel i7 processor and GB memory. First of all, we study the influence of several important parameters as follows, where the experiment is performed on sequences selected from the DeformSOT dataset with different kinds of challenges.
ViB1 Order of Hypergraph
The order of hypergraph decides how we consider the geometric relations among correspondence hypotheses. We compare the tracking methods with different orders of hypergraph, denoted as GGTor (), where the corresponding geometric confidence is calculated in Section IIIC. As shown in Fig. 4, GGTor considering highorder geometric relations performs the best. In contrast, GGTor and GGTor consider just pairwise relations or no relations between parts, leading to big accuracy loss. It indicates the importance and effectiveness of our highorder representation that integrates geometric structural information fully for the target.
ViB2 Number of Pixels in Each Superpixel
The number of superpixel controls the size of parts and the number of vertices in the hypergraph. As Fig. 5 shows, we consider different numbers of pixels in each superpixel, i.e., GGTsp (). If the number of pixels in each superpixel is too large (e.g., GGTsp200), it is hard to exploit discriminative geometric structure cues of local parts to handle deformation. On the other hand, if it is too small (e.g., GGTsp30), the large number of hypotheses increases the computational complexity considerably without apparent performance improvement (e.g., GGTsp30 ranks the second in success score and ranks the first in precision score).
ViB3 Weight of Geometric Confidence
The weight of geometric confidence indicates the importance of geometric confidence. Here we set and enumerate the weight in (3), i.e., GGTgc (). Based on the performance in Fig. 6, an appropriate factor helps the tracker achieve higher performance by neither underestimating nor overestimating the geometric information.
ViB4 Maximal Number of Sampled Hyperedges
The number of hyperedges measures the importance of geometric information. We report the performance with different numbers of hyperedges in Fig. 7, denoted as GGThe (). If the number of hyperedges is too small (e.g., GGThe25), it is insufficient to exploit highorder geometric information, rendering less discriminative structure cues to handle deformation; if the number is too large (e.g., GGThe250), it is harmful to introduce many noisy relations by the large number of hyperedges because of the sparsity of reliable correspondences.
Based on the above parameter analysis, we set and fix all parameters in our algorithm empirically. The order of the geometric hypergraph is set as . For the searching area, we search the target location in current frame by times the size of previous one. For the SLIC oversegmentation method, the number of pixels in each superpixel is set as , and the range of number of superpixels . We use bins for each channel of HSV feature to represent the appearance of target parts. The weights in (3) are set as . The scaling parameters in (4), and in (5)(6). In the sampling method, the appearance threshold is set as , and the maximal number of sampled hyperedges is set as . In (9), the term .
ViC Evaluations on the VOT2014 Dataset
We compare our approach to several algorithms including the winner of the VOT2014 challenge, DSST [9], and two of the topperforming trackers of the online tracking benchmark [40], namely Struck [15] and KCF [17]. Furthermore, we include keypoint based CMT [29] and IIVTv2 [43], the part based DGT [6], LGTv1 [7], OGT [27], and PTp [11], as well as the baseline trackers including FRT [2], CT [46], and MIL [3]. To ensure a fair comparison, all the results are copied from the original submissions to the VOT2014 challenge by the corresponding authors or the VOT committee.
Qualitative Evaluation. Examples of visual tracking results of top trackers are also shown in Fig. 8. We can observe that our hypergraph based tracker performs against other graphbased trackers, such as DGT [6], LGT [7], and OGT [27]. For example, DGT [6] and OGT [27] do not adjust the scale change of the target car. When the figure skater in skating moves under the challenges of background clutter and illumination variation, some trackers do not locate well (e.g., OGT [27] in , and DGT [6] in ). Besides, DSST [9] fails in tracking the hand in hand1 and hand2 . This is because the highorder geometric relations in our method capture the invariant property of local parts such as angles rather than vulnerable pairwise affinity, rendering more tolerance on drastic rotation or appearance variations.
Quantitative Evaluation. Table I shows the average performance of the compared trackers. As these results show, our algorithm achieves the overall best robustness score, and comparable performance in accuracy among all the methods compared. Moreover, the considerable improvement in Region_noise level indicates that the spatial highorder representation in our method can resist noises effectively, and recover from initialization errors to gain improvements both in terms of accuracy and robustness.
Baseline  Region_noise  Overall  
Acc. Sc./Acc. Rk.  Rob. Sc./Rob. Rk.  Acc. Sc./Acc. Rk.  Rob. Sc./Rob. Rk.  Acc. Sc./Acc. Rk.  Rob. Sc./Rob. Rk.  
GGT  0.58/6.16  0.55/4.98  0.57/4.81  0.65/4.93  0.57/5.48  0.59/4.95 
DSST [9]  0.62/4.48  1.16/6.32  0.58/4.01  1.28/6.22  0.60/4.25  1.22/6.27 
DGT [6]  0.58/5.81  1.00/5.02  0.58/4.97  1.17/5.31  0.58/5.39  1.09/5.16 
KCF [17]  0.63/4.22  1.32/6.53  0.58/4.50  1.52/6.62  0.61/4.36  1.42/6.57 
LGTv1 [7]  0.47/9.29  0.66/5.96  0.46/8.73  0.64/5.42  0.47/9.01  0.65/5.69 
Struck [15]  0.52/8.04  2.16/8.64  0.49/7.90  2.22/8.16  0.51/7.97  2.19/8.40 
OGT [27]  0.55/7.09  3.34/9.78  0.51/7.19  3.37/10.30  0.53/7.14  3.36/10.04 
PTp [11]  0.47/10.98  1.40/7.20  0.45/9.77  1.46/7.33  0.46/10.38  1.43/7.26 
CMT [29]  0.48/9.18  2.64/9.16  0.44/9.97  2.64/9.14  0.46/9.58  2.64/9.15 
FoT [39]  0.51/8.44  2.28/9.69  0.48/9.13  2.71/10.59  0.50/8.79  2.50/10.14 
IIVTv2 [43]  0.47/9.30  3.19/9.70  0.45/9.96  3.13/9.14  0.46/9.63  3.16/9.42 
FSDT [12]  0.47/9.87  3.08/11.26  0.46/9.36  2.77/10.38  0.47/9.62  2.93/10.82 
IVT [23]  0.47/9.87  2.76/10.44  0.44/10.69  2.86/10.20  0.46/10.28  2.81/10.32 
CT [46]  0.43/11.76  3.12/10.23  0.43/11.04  3.34/10.45  0.43/11.40  3.23/10.34 
FRT [2]  0.48/9.17  3.32/12.20  0.44/10.20  3.46/12.29  0.46/9.69  3.39/12.24 
MIL [3]  0.40/12.03  2.27/8.80  0.35/13.67  2.60/9.67  0.38/12.85  2.44/9.23 
ViD Evaluations on the DeformSOT dataset
We evaluate the proposed algorithm against exsiting methods including holistic model based trackers (i.e., IVT [23], L1T [26], TLD [20], MIL [3], Struck [15], MTT [47], CT [46], CN [10], STT [37] and STC [45]) and part based trackers (i.e., Frag [2], SPT [34], SCM [48], LOT [30], ASLA [19], LSL [42], LGT [7], DGT [6], and TCP [22]). For fair comparison, we use the same initial bounding box of each sequence for all trackers. The experimental results of other trackers are reproduced from the available source codes with recommended parameters.
As shown in Fig. 9, the evaluation results on OPE, SRE and TRE indicate that our GGT tracker performs against other compared methods. In addition, Fig. 11 shows the tracking results of top trackers on several sequences.
Attributebased Evaluation.We also compare the performance of all tracking algorithms for videos with varying degrees of challenging factors shown in Fig. 10.
ViD1 Large Deformation
Existing part based trackers [34, 7, 42, 6] mainly consider vulnerable pairwise geometric relations, which are prone to fail in the sequences with significant target deformation (e.g., boarding in Fig. 11). According to Fig. 10(a)(g), our tracker performs against other methods because highorder triangle geometric relations instead of varying pairwise displacements preserve invariant angles to remove noises from a large set of correspondence hypotheses.
ViD2 Severe Occlusion
Some trackers [26, 20, 34, 48, 19, 42] drift away from the target or do not scale well when the target is heavily occluded (e.g., boarding, carscale, run and waterski in Fig. 11). However, our method is able to track the target relatively accurate because the structural correspondence modes exploit invariant local geometric structure of target parts. This information helps to avoid much influence of occlusion as long as adequate structural correspondence modes are detected to parse target parts to vote for the target state.
ViD3 Abnormal Movement
Abnormal movements consist of all kinds of nonrigid change such as abrupt motion, pose variation, and rotation. For example, SCM [48] and TCP [22] drift away when the gymnast jumps to grab bars in unevenbars . By comparison, our method performs well in estimating both scales and positions on these challenging sequences, which can be attributed to two reasons. Firstly, the hypergraph is constructed with coarse target parts to remove unnecessary background parts (see a example in Fig. 2(a)). Moreover, based on the modes, the reliable target parts are determined under noises to vote the optimal target state.
ViD4 Illumination Variation
Some trackers [19, 10] are insensitive to appearance changes caused by illumination variation, however, compared to our method, they perform poorly on the sequences undergoing other challenges such as large deformation and abnormal movement simultaneously (see bike in Fig. 11). This can be attributed to the use of geometric hypergraph learning to adapt the local parts’ appearance variation in consecutive frames.
ViD5 Scale Change
In terms of sequences with significant target scale change (e.g., boarding and carscale in Fig. 11), our tracker performs against other methods [19, 6, 22] as in Fig. 10(e)(k). This is because we employ the angles of the triangle to measure the similarity of several correspondence hypotheses, which is invariant to scale change (see more in Section IVB). Different from our algorithm, DGT [6] just considers neighboring pairwise relations between local parts, making it less flexible to handle changes of the target scale.
ViD6 Background Clutter
The background surrounding the target has similar appearance, leading to drift from the intended target to other objects when they appear in close proximity (e.g., football in Fig. 11). To handle this problem, some methods [37, 45] exploit the context information around the target, while the other ones [7, 6] employ a graph based representation to capture geometric structure of the target. Owing to the proposed confidenceaware sampling method without distance constraint, sampled representative hyperedges not only consider the relations between target parts and background parts (context), but also model the inlier geometric relations among local target parts (structure) simultaneously. As a whole, our method ranks the first in success score in Fig. 10(f) and the second in precision score in Fig. 10(i).
Vii Conclusion and Future Work
In this paper, we describe the Geometric hyperGraph Tracker (GGT) based on geometric hypergraph learning for visual tracking, where order geometric relations among correspondence hypotheses are integrated in the dynamically constructed geometric hypergraph. Our method is universal in that the traditional graphbased tracking methods can be viewed as special cases of the proposed algorithm with lowerorder of hypergraph. On the other hand, the confidenceaware sampling method is developed to reduce computational complexity and the scale of hypergraph for better efficiency. Experiments are carried out in the VOT2014 dataset and the DeformSOT dataset, which include large deformation and severe occlusion challenges, to demonstrate the favorable performance of the proposed method compared to other existing methods.
There are some issues in our method which can be further improved in the future work. To characterize more kinds of graphbased trackers, we can exploit highorder temporal and spatial relations among a large number of correspondence hypotheses in multiple consecutive frames simultaneously. More spatiotemporal relations considered in the model means higher computational complexity. Therefore, one future direction is to introduce an more effective mechanism for selecting vertices and hyperedges to reduce redundant hypotheses. Another possible direction is to learn a holistic target representation which is updated jointly with the simple superpixel representation for more robustness and discriminability.
References
 [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. Slic superpixels compared to stateoftheart superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.

[2]
A. Adam, E. Rivlin, and I. Shimshoni.
Robust fragmentsbased tracking using the integral histogram.
In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
, volume 1, pages 798–805, 2006.  [3] B. Babenko, M.H. Yang, and S. Belongie. Robust object tracking with online multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1619–1632, 2011.
 [4] W. Bouachir and G. Bilodeau. Collaborative partbased tracking using salient local predictors. Computer Vision and Image Understanding, 137:88–101, 2015.
 [5] Y. Boykov and V. Kolmogorov. An experimental comparison of mincut/maxflow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1124–1137, 2004.
 [6] Z. Cai, L. Wen, Z. Lei, N. Vasconcelos, and S. Z. Li. Robust deformable and occluded object tracking with dynamic graph. IEEE Transactions on Image Processing, 23(12):5497–5509, 2014.
 [7] L. Cehovin, M. Kristan, and A. Leonardis. Robust visual tracking using an adaptive coupledlayer visual model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4):941–953, 2013.
 [8] L. Cehovin, M. Kristan, and A. Leonardis. Is my new tracker really better than yours? In IEEE Winter Conference on Applications of Computer Vision, pages 540–547, 2014.
 [9] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Accurate scale estimation for robust visual tracking. In Proceedings of British Machine Vision Conference, 2014.
 [10] M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. Van de Weijer. Adaptive color attributes for realtime visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014.
 [11] S. Duffner and C. Garcia. Pixeltrack: A fast adaptive algorithm for tracking nonrigid objects. In Proceedings of the IEEE International Conference on Computer Vision, pages 2480–2487, 2013.
 [12] M. K. et al. The visual object tracking VOT2014 challenge results. In Workshops in Conjunction with European Conference on Computer Vision, pages 191–217, 2014.
 [13] M. Godec, P. M. Roth, and H. Bischof. Houghbased tracking of nonrigid objects. In Proceedings of the IEEE International Conference on Computer Vision, pages 81–88, 2011.
 [14] Y. Guo, Y. Chen, F. Tang, A. Li, W. Luo, and M. Liu. Object tracking using learned feature manifolds. Computer Vision and Image Understanding, 118:128–139, 2014.
 [15] S. Hare, A. Saffari, and P. H. Torr. Struck: Structured output tracking with kernels. In Proceedings of the IEEE International Conference on Computer Vision, pages 263–270, 2011.
 [16] S. Hare, A. Saffari, and P. H. S. Torr. Efficient online structured output learning for keypointbased object tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1894–1901, 2012.
 [17] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Highspeed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
 [18] Z. Hong, C. Wang, X. Mei, D. Prokhorov, and D. Tao. Tracking using multilevel quantizations. In European Conference on Computer Vision, volume 8694, pages 155–171, 2014.
 [19] X. Jia, H. Lu, and M. Yang. Visual tracking via adaptive structural local sparse appearance model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1822–1829, 2012.

[20]
Z. Kalal, J. Matas, and K. Mikolajczyk.
PN learning: Bootstrapping binary classifiers by structural constraints.
In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 49–56, 2010.  [21] J. Lee, M. Cho, and K. M. Lee. Hypergraph matching via reweighted random walks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1633–1640, 2011.
 [22] W. Li, L. Wen, M. C. Chuah, Y. Zhang, Z. Lei, and S. Z. Li. Online visual tracking using temporally coherent part cluster. In IEEE Winter Conference on Applications of Computer Vision, pages 9–16, 2015.
 [23] J. Lim, D. A. Ross, R.S. Lin, and M.H. Yang. Incremental learning for visual tracking. In Advances in Neural Information Processing Systems, pages 793–800, 2004.
 [24] H. Liu, X. Yang, L. J. Latecki, and S. Yan. Dense neighborhoods on affinity graph. International Journal of Computer Vision, 98(1):65–82, 2012.
 [25] T. Liu, G. Wang, and Q. Yang. Realtime partbased visual tracking via adaptive correlation filters. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4902–4912, 2015.
 [26] X. Mei and H. Ling. Robust visual tracking using minimization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1436–1443, 2009.
 [27] H. Nam, S. Hong, and B. Han. Online graphbased tracking. In European Conference on Computer Vision, pages 112–126, 2014.
 [28] G. Nebehay and R. Pflugfelder. Clustering of staticadaptive correspondences for deformable object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2784–2791, 2015.
 [29] G. Nebehay and R. P. Pflugfelder. Consensusbased matching and tracking of keypoints for object tracking. In Winter Conference on Applications of Computer Vision, pages 862–869, 2014.
 [30] S. Oron, A. BarHillel, D. Levi, and S. Avidan. Locally orderless tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1940–1947, 2012.
 [31] X. Ren and J. Malik. Tracking as repeated figure/ground segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
 [32] J. Wang and Y. Yagi. Manytomany superpixel matching for robust tracking. IEEE Transactions on Cybernetics, 44(7):1237–1248, 2014.

[33]
Q. Wang, F. Chen, and W. Xu.
Tracking by thirdorder tensor representation.
IEEE Transactions on Systems, Man, and Cybernetics, Part B, 41(2):385–396, 2011.  [34] S. Wang, H. Lu, F. Yang, and M.H. Yang. Superpixel tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 1323–1330, 2011.
 [35] W. Wang and R. Nevatia. Robust object tracking using constellation model with superpixel. In Asian Conference on Computer Vision, pages 191–204, 2012.
 [36] L. Wen, Z. Cai, D. Du, Z. Lei, and S. Z. Li. Learning discriminative hidden structural parts for visual tracking. In Workshops in Conjunction with Asian Conference on Computer Vision, pages 262–276, 2014.
 [37] L. Wen, Z. Cai, Z. Lei, D. Yi, and S. Z. Li. Online spatiotemporal structural context learning for visual tracking. In European Conference on Computer Vision, pages 716–729, 2012.
 [38] L. Wen, D. Du, Z. Lei, S. Z. Li, and M.H. Yang. JOTS: Joint online tracking and segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2226–2234, 2015.
 [39] A. Wendel, S. Sternig, and M. Godec. Robustifying the flock of trackers. In Computer Vision Winter Workshop, page 91. Citeseer, 2011.
 [40] Y. Wu, J. Lim, and M.H. Yang. Online object tracking: A benchmark. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2411–2418, 2013.
 [41] F. Yang, H. Lu, and M. Yang. Learning structured visual dictionary for object tracking. Image Vision Computing, 31(12):992–999, 2013.
 [42] R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel. Partbased visual tracking with online latent structural learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2363–2370, 2013.
 [43] K. M. Yi, H. Jeong, B. Heo, H. J. Chang, and J. Y. Choi. Initializationinsensitive visual tracking through voting with salient local features. In Proceedings of the IEEE International Conference on Computer Vision, pages 2912–2919, 2013.

[44]
X. Yu, J. Yang, T. Wang, and T. S. Huang.
Key point detection by max pooling for tracking.
IEEE Transactions on Cybernetics, 45(3):444–452, 2015.  [45] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M. Yang. Fast visual tracking via dense spatiotemporal context learning. In European Conference on Computer Vision, pages 127–141, 2014.
 [46] K. Zhang, L. Zhang, and M.H. Yang. Realtime compressive tracking. In European Conference on Computer Vision, pages 864–877. Springer, 2012.
 [47] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja. Robust visual tracking via multitask sparse learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2042–2049, 2012.
 [48] W. Zhong, H. Lu, and M. Yang. Robust object tracking via sparsitybased collaborative model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1838–1845, 2012.