Geometric Hypergraph Learning for Visual Tracking

03/18/2016 ∙ by Dawei Du, et al. ∙ 0

Graph based representation is widely used in visual tracking field by finding correct correspondences between target parts in consecutive frames. However, most graph based trackers consider pairwise geometric relations between local parts. They do not make full use of the target's intrinsic structure, thereby making the representation easily disturbed by errors in pairwise affinities when large deformation and occlusion occur. In this paper, we propose a geometric hypergraph learning based tracking method, which fully exploits high-order geometric relations among multiple correspondences of parts in consecutive frames. Then visual tracking is formulated as the mode-seeking problem on the hypergraph in which vertices represent correspondence hypotheses and hyperedges describe high-order geometric relations. Besides, a confidence-aware sampling method is developed to select representative vertices and hyperedges to construct the geometric hypergraph for more robustness and scalability. The experiments are carried out on two challenging datasets (VOT2014 and Deform-SOT) to demonstrate that the proposed method performs favorable against other existing trackers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual tracking has attracted much research interest in computer vision field, because it is a critical step of various applications, including video surveillance, sport analysis, auto-drive car, etc. Despite having achieved promising progress over the past decade, it still remains very challenging for designing a robust tracker that can handle appearance changes caused by various critical situations, such as large deformation, illumination variation, partial and full occlusion, and background clutter. In particular, the deformation and occlusion are the two most notable challenges that degrade tracking performances.

For tracking scenarios where the target appearance is relatively stable, methods based on global appearance models can achieve satisfactory performances [20, 15, 33, 45, 9, 17]. However, if large deformation and occlusion happen, such approaches usually fail to track the target robustly. To counter this problem, many approaches based on models of local parts have received more attention [34, 30, 19, 32, 44]

. Moreover, several different methods to represent the target geometric structure have been proposed, such as structural Support Vector Machine (SVM) 

[42], Markov Random Field (MRF) [31, 18], keypoint constellation [29, 28], and graph model [35, 6]. However, most approaches consider pairwise relations between target parts are easily disturbed by errors in pairwise affinities, rendering difficulties to well preserve the geometric structure underlying the target representation.

Fig. 1: In the top, between the target part set and the candidate part set , the correspondence hypotheses are generated and constrained by the high-order relations among them. In the bottom, the geometric hypergraph is constructed based on and . Then the mode is extracted by searching . For clarity, just a few vertices and hyperedges are shown.

In this paper, we present a novel Geometric hyperGraph Tracker (GGT) to handle the visual tracking task, especially for the deformable ones. Different from the previous works considering pairwise geometric relations between local parts, our method exploits high-order relations among more than two correspondences based on the geometric hypergraph. Specifically, the geometric hypergraph is constructed and learned based on the target part set and the candidate part set111The candidate part set consists of candidate parts extracted from the searching area in the current frame . We employ the target part set as the part representation of the target, which is consisted by the target parts up to the previous frame .. Then we generate possible correspondences between parts in the two sets, which are defined as correspondence hypotheses. In Fig. 1, we give a schematic diagram of constructing the hypergraph , where vertices encode correspondence hypotheses, and hyperedges encode high-order geometric relations among several correspondence hypotheses. Thus the geometric structure of the target can be effectively characterized by the hypergraph, rendering more discriminative power to extract the common appearance and geometric property of correspondences from noises. Moreover, correspondence hypotheses can form a set of modes where a large number of hyperedges are involved with high confidence; while other false correspondences have very few hyperedges with low confidence. For easier reading, we first define the structural correspondence mode as:

Definition 1. The structural correspondence mode is a group of reliable correspondences between target parts with similar appearance and consistent geometric structure that are inter-connected with the local maximum of overall confidences on the geometric hypergraph.

The present work makes the following contributions:

  • The geometric hypergraph is used to represent the target, which fully exploits high-order geometric relations among correspondence hypotheses in consecutive frames.

  • The confidence-aware sampling method is proposed to approximate the geometric hypergraph, which not only alleviates sensitivity to noises, but also is scalable to the large scale hypergraph. Thus we seek structural correspondence modes on the hypergraph by the pairwise coordinate update method in [24] efficiently.

  • Our method is compared to existing methods on the VOT2014 dataset and Deform-SOT dataset. The experimental results demonstrate the effectiveness and robustness of the proposed model.

The rest of the paper is organized as follows. In Section II, we review relevant previous works. The methodology is described in Section III and the model optimization is presented in Section IV. In Section VI, we evaluate the proposed algorithm on two tracking datasets compared to other existing methods. Then we conclude the paper with discussions on future works in Section VII.

Ii Related Works

Tracking methods based on modeling relations between target parts have been shown to be less susceptible to the problem posed by object deformation and occlusion. Recently, many works have focused on how to incorporate geometric information as an important clue to facilitate visual tracking.

Keypoint Based Tracking Methods. The keypoint based trackers use the displacements of target parts to vote for the target center in consecutive frames to consider the geometric structure. Hare et al. [16]

combine the feature matching, learning, and object pose estimation into a coherent structured output learning framework, resulting in learning for real-time keypoint-based object detection and tracking. Yang

et al. [41] propose a visual tracking algorithm by incorporating the SIFT features from the interest points to represent appearance and exploiting their geometric structures, where a structured visual dictionary is learned to enhance its discriminative strength between the foreground object and the background. Yi et al. [43] propose a tracking method using “motion saliency” and “descriptor saliency” of local features and performs tracking based on generalized Hough transform. The tracking result is obtained by combining the results of each local feature of the target and the surroundings with generalized Hough transform voting. Guo et al. [14] formulate the task of tracking and recognition as a maximum posteriori estimation problem under the manifold representation learnt from collections of local features with preserving local appearance similarity and spatial structure. Nebehay and Pflugfelder [29] develop a keypoint-based tracking method in a combined matching-and-tracking framework, where each keypoint casts votes for the object center. Moreover, an improved algorithm in [28]

employs geometric dissimilarity measure to separate inlier correspondences from outliers by considering both static and adaptive correspondences. In 

[4], keypoints are considered as elementary predictors localizing the target in a collaborative search strategy, where the persistence, the spatial consistency, and the predictive power of a local feature are used to to measure the most reliable features for tracking. However, keypoint based trackers focus on modeling the displacements between parts and the corresponding target center, which are insufficient to exploit relations between local parts fully for geometric structure representation.

Part Based Tracking Methods. To better solve the shape deformation and partial occlusion issue, part based methods are gaining popularity in visual tracking. Wen et al. [36] present a discriminative learning method to infer the position, shape and size of each part, using the Metropolis-Hastings algorithm integrated with an online SVM. Wang and Nevatia [35]

propose to track non-rigid objects with multiple related parts and model tracking as Dynamic Bayesian Network, where the spatial relations among parts are formulated probabilistically. Improved from 

[15], Yao et al. [42] introduce a part-based tracking algorithm with online latent structured learning, and use a global object box and a small number of part boxes to approximate the irregular object, to reduce the amount of visual drift. Cehovin et al. [7] employ a global representation to probabilistically models target’s global visual properties. Meanwhile, the low-level patches are constrained and updated with the global model during tracking. A dynamic structure graph based tracker is used in [6] to formulate the tracking problem as subgraph matching between the geometric structure graph of the target and that of the candidate target proposals graph. Nam et al. [27] use a new graphical model to adapt sequence structure and propagate the posterior over time, where each vertex has a single outgoing edge but may has multiple incoming edges. Hong et al. [18] propose a MRF-based tracker to consider geometric structure by the hierarchical appearance representation, which exploits shared information across multi-level quantization of an image space, i.e., pixels, superpixels and bounding boxes. In [25]

, a real-time tracking method is proposed based on parts with multiple correlation filters, in which the Bayesian inference framework and a structural constraint mask are adopted to handle various appearance changes. However, the existing part based methods give less consideration to exploit high-order geometric relations among target parts for more robustness.

Segmentation Based Tracking Methods. The segmentation based methods consider the geometric information by finding out the precise location of each pixel in the target. Based on the generalized Hough-transform, Godec et al. [13] develop an improved online Hough Forests and couple the voting based detection and back-projection with a rough segmentation based on GrabCut. Duffner and Garcia present a pixel-based non-rigid object tracking method in [11], which consists of a generalized Hough transform with pixel-based descriptors based detector and a probabilistic segmentation method based on a global model for foreground and background. Recently, Wen et al. [38] develop a joint online tracking and segmentation algorithm, which integrates the multi-part tracking and segmentation into a unified energy optimization framework.

Iii Methodology

Fig. 2: Tracking process on the waterski sequence. (a) Given the candidate part set , we aim to find their reliable correspondence with the target part set . This is done by seeking structural correspondence modes with tolerance of deformation and scale change. For example, the green arrows in the figure indicate the displacement between parts in and . Then reliable target parts can be determined (e.g., part , , ), and the geometric hypergraph is updated incrementally. (b) The confidence map is constructed based on the reliable target part set, and the tracking result is output by uniform sampling on the confidence map.

We first introduce terms and notations to be used in the sequel. We denote the order of hypergraph as . The hypergraph is a generalization of a graph in which an edge (hyperedge, strictly speaking) can connect more than vertices, while a graph has its edges connecting vertices. The unconnected graph is a graph without edges between vertices.

Although our method is related with three previous works including SPT [34], DGT [6] and TCP [22], there are significant differences between our method and them, which are concluded below.

  • Though both our method and SPT use superpixel representation, our method uses a more effective representation of the geometric information of target parts, which leads to improved performance on more complex scenes. When the hypergraph degenerates into an unconnected graph (), SPT can be regarded as a special case of the proposed algorithm.

  • DGT uses a graph to exploit pairwise geometric relations between neighboring parts. On the contrary, our method employs a hypergraph that considers high-order geometric relations among correspondence hypotheses to exploit abrupt deformation, motion change and target context better. When the hypergraph reduces to a normal graph (), DGT can be regarded as a special case of the proposed algorithm.

  • TCP mainly exploits temporal high-order relations among different parts in consecutive frames, ignoring the geometric structure information of local parts spatially. In contrast, our method focuses on modeling the spatial high-order relations among correspondence hypotheses. Besides, the temporal relations of parts are also considered when updating the hypergraph.

In this work, the tracking problem is formulated as the mode-seeking problem on the geometric hypergraph. In Section III-A, we construct the hypergraph based on the target part set and candidate part set. Then we give the detailed formulation in Section III-B and define corresponding confidence measure in Section III-C.

Iii-a Geometric Hypergraph

The superpixel representation is more flexible for the deformable target compared to the holistic representation, however has low discriminative power because of small size. Therefore we construct a geometric hypergraph to alleviate the problem with geometric constraints. Given the annotated bounding box in the first frame, the target part set is first initialized, and the candidate part set is determined by the coarse labeling of superpixels in the rest frames222Similar as [6], we first use the SLIC algorithm [1] to over-segment the searching area of the target into multiple parts (superpixels), and employ the Graph Cut method [5] to coarsely separate the foreground parts from the background, as shown in the top-left of Fig. 2(a).. Based on and , we construct the vertex set and hyperedge set of the geometric hypergraph as

(1)

where is the number of vertices. and are the -th and -th vertex in hyperedge without conflicts or duplicates. is the order of hypergraph. is the Euclidean distance between the centers of parts and in image plane. The distance threshold is set as , where is the number of superpixels in the searching area with width and height in the current frame, as shown in Fig. 2.

Iii-B Formulation

As analyzed in the introduction, multiple correspondences with similar structural geometric properties form a set of structural correspondence modes. By measuring the overall confidence of modes, the tracking problem is formulated as

(2)

where is the structural correspondence mode including number of vertices. is the confidence measure function reflecting the confidence distribution in the mode , which is described as follows.

Iii-C Confidence Measure

We design two terms to encode both the association confidence of vertices and the geometric confidence among them, i.e.,

(3)

where and denote the vertex set and the hyperedge set of mode , respectively. and are the balancing factors of the two terms.

Association Confidence. The association confidence

encodes the probability of two parts in vertex

belonging to the same class, which is defined as

(4)

where is the distance between the appearance feature of the two parts, i.e., , and . In the experiment, the appearance feature is concatenated by HSV color feature and LBP texture. is the scaling parameter to measure the importance of appearance similarity.

Geometric Confidence. The geometric confidence describes the geometric relation among correspondence hypotheses in hyperedge . If the order of graph , it is a hypergraph describing high-order geometric relations among correspondence hypotheses. When reducing the order of graph, it degenerates into a graph to consider pairwise geometric relations, and an unconnected graph ignoring geometric relations. Therefore, for different order of hypergraph case, we have different calculation of geometric confidence, which is discussed as follows.

Iii-C1 Unconnected Graph

If , the hypergraph becomes an unconnected graph, i.e., . Thus visual tracking only depends on the association confidence without any geometric structural constraints. Similar to SPT [34], it is actually a part-based template matching method. The appearance information encoded in is usually weak especially for small superpixels, resulting in worse performance in the scenarios with complex appearance variation.

Iii-C2 Graph

If , the geometric information encoded in provides complementary pairwise geometric information of edge besides appearance than SPT [34] does. Thus DGT [6] falls into this category. The pairwise similarity to compare two correspondence hypotheses is calculated as

(5)

where and denote the parts in target part set , and the parts in candidate part set . measures the consistency of the two supporters, which is calculated as the location displacement of two neighboring correspondence hypotheses, as shown in Fig. 3(a). is the scaling parameter to measure the importance of geometric constraint.

Iii-C3 Hypergraph

As shown in Fig. 3(a), the supporters provide pairwise relation measure that are restricted to distances, making it hard to handle large scale change and conducting wrong correspondences between target parts. On the contrary, as shown in Fig. 3(b), we exploit the angle information of triplets of correspondence hypotheses to achieve scale invariance, leading to more correct associations between target parts. For example, three correspondence hypotheses form two triangles ( and ). Although the target scale changes drastically, the angles (high-order geometric relations) remain more stable compared to the relative displacement between parts (pairwise geometric relations). For better understanding, a real example of structural correspondence in waterski is shown in Fig. 1(b). Similar to [21], the geometric confidence is calculated by comparing corresponding angles, as

(6)

where and denote angles of parts related to vertex in the set and , respectively.

Fig. 3: (a) Pairwise geometric relations. When large scale changes occur, the wrong correspondences between target parts (e.g., and ) are easily conducted since the supporters (shown in blue arrow) are no longer reliable. (b) High-order geometric relations. Different from the pairwise measure, the angles of the triplet hypotheses (e.g., , , and ) are invariant to large scale changes, leading to correct correspondences.

Iv Optimization

Given the geometric hypergraph , the mode-seeking problem is solved by searching (Section IV-A). Before that, We propose the confidence-aware sampling technique to improve the effectiveness of the proposed method (Section IV-B).

Iv-a Mode-Seeking Problem

Since the maxima of (III-B) corresponds to a structural correspondence mode, we fully search the hypergraph by setting each vertex in the hypergraph as a starting point. Let be the mode with vertex set and hyperedge set . Let be the vector containing the probability of each vertex in the hypergraph belonging to mode , i.e., if , ; otherwise, . is the number of vertices in the hypergraph. Combined with (3), the problem in (III-B) is cast as optimizing the probability vector and further rewritten as

(7)

in which the first term in the objective function penalizes the inclusion of vertices corresponding to less association confidence indicated by a lower , and the second term encourages the inclusion of hyperedges in the mode with larger geometric confidence

. Essentially, this is a NP-hard combinatorial optimization problem. To solve this problem, the constraint

is relaxed to , where is a constant. Let the number of vertices in be , the mode contains at least number of vertices when keeping the constraint . To avoid the degeneracy problem, we require the minimal vertices in a mode satisfying the constraint to guarantee adequate structural correspondences included in one mode. Then the pairwise coordinate updating method [24] is used to solve the problem in (IV-A) effectively. Refer to [24] for more details about the optimization strategy.

Iv-B Confidence-aware Sampling

Suppose that the target part set and candidate part set consist of parts, there are at most correspondence hypotheses. For the case, the size of the resulting full-affinity hyperedges will be , of order , which demands a huge amount of memory. It becomes than necessary to further reduce the computational complexity by introducing a sparse hypergraph structure with significant hypotheses. To this end, we develop a confidence-aware sampling method as follows.

  1. Firstly, we reduce the number of vertices deterministically by introducing two thresholds. We assume target parts move smoothly in consecutive frames, which means that the appearances change little in a very short time interval. To remove noises, for each target part , we only consider a few correspondence hypotheses with at most number of highest association confidence larger than an appearance threshold .

  2. Secondly, the number of hyperedges is greatly decreased probabilistically. Based on a simple assumption that a vertex with higher association confidence has a higher possibility of being reliable correspondence, we sample more hyperedges around the vertex with higher association confidence. Specifically, starting from each vertex in the reduced vertex set, we sample number of hyperedges comprising three vertices without conflicts. We regard the normalized confidence as the sampling probability, and the constant as the maximal number of sampled hyperedges for each vertex.

Different from other MRF or graph based approaches considering pairwise relations between the nearest neighboring vertices, we sample hyperedges randomly without distance constraints to exploit the geometric information fully, so that the hypergraph is spanned globally over all correspondence hypotheses. The additional benefit is that we can consider context information between target parts and background parts for more robustness.

Based on the confidence-aware sampling method, we sample vertices and hyperedges of , obtaining a approximate geometric hypergraph . Then we directly perform mode-seeking on instead of . Specifically, the reduced vertex set and hyperedge set of are given as

(8)

where denotes the number of vertices including part , and denotes the number of hyperedges including vertex . The sampling scheme ensures finding enough relevant correspondence hypotheses. Moreover, it decreases the number of vertices from to at most and the number of hyperedges from to at most , which removes more than redundant of vertices and hyperedges in empirically.

V Tracking

V-a Extracting Reliable Target Parts

Given the optimized probability vector , we can determine the vertices belonging to the corresponding mode , i.e., . Since the hypergraph is searched starting from each vertex, one vertex may appear in multiple modes. The conflicts involved in the modes should be removed to find reliable target parts , and the whole procedure is summarized in Algorithm 1.

0:  structural correspondence mode set D
0:  reliable target part set
1:  Sort the mod set based on the confidence values in descending order
2:  Initialize the mode set without conflicts
3:  for each non-empty mode  do
4:     if no intersections with all members in the set, i.e.,  then
5:         Add to the mode set, i.e.,
6:     else
7:         Remove the overlapping part in the parsed modes, i.e.,
8:         Add to the mode set, i.e.,
9:     end if
10:  end for
11:  Obtain the reliable target part set, i.e.,
Algorithm 1 Extracting Reliable Target Parts

V-B Reliable Target Parts Based Voting

After obtaining , we determine the target state in the current frame , including center and scale of the target by reliable target parts based voting. Similar to the method of [6], we construct a confidence map to represent location probability of the target in the searching area as

(9)

where is the position in the searching area. means the region of target parts belonging to the extracted modes, and means the region of candidate parts. are constants for the influence of each type of regions.

To find the bounding box to cover more foreground regions with respect to center and scale , we form the following optimization problem

(10)

where means the region with center and scale .

The target center in the current frame is largely determined by the one in the previous frame with the geometric constraint. To reduce computational complexity, we first estimate a rough target center by calculating the weighted mean of target part center with weight , i.e.,

(11)

where is the optimal center in the previous frame . denotes the confidence of the mode including reliable target part in the current frame , i.e., . After that, we modify the target center with the displacement perturbation term and adjust the target scale with the scale perturbation term for a visually better location. The maximal values of two perturbation terms are set as the mean diameter of candidate parts in the current frame . The final target state is obtained by optimizing (10) using a sampling strategy, namely selecting the one with the maximal score out of numerous randomly sampled states . Assembling all parts belonging to the target, we find the optimal target state, as shown in Fig. 2(b).

V-C Online Updating of Hypergraph

To handle possible significant changes of target appearance, geometric hypergraph is updated in two aspects, i.e., target part set and candidate part set . As illustrated in Fig. 2(a), based on the parsed reliable target part set , an old part in (e.g., part ) is deleted if it does not involve in any structural correspondence for a fixed number of frames ( frames in the experiment), while a new part (e.g., part and part ) not involved in existed modes is added in such that its geometric distance to any other parts is larger than a threshold333The threshold is set as double mean diameter of candidate parts in the current frame. to preserve the spatial sparsity of . On the other hand, the appearance model in the MRF based segmentation method is updated to generate every frame, as similar as in [6].

Vi Experiments

Vi-a Datasets and Protocols

Vi-A1 VOT2014 dataset

The VOT2014 dataset [12] is popularly used in the tracking community, which is collected with representative sequences selected from sequences. Each sequence is annotated by several attributes such as occlusion, and illumination changes.

We evaluate the tracking methods following two protocols of the VOT2014 challenges, i.e., Baseline and Region_noise. Baseline corresponds to the experimental setting where the tracker is run on each sequence times by initializing it on the groundtruth bounding box, obtaining average statistic scores of the measures. Region_noise corresponds to the experiment setting where the tracker is initialized with noisy bounding boxes, which are randomly perturbed in order of of the groundtruth bounding box size, in each sequence. As defined in [8], two performance metrics, Accuracy (average bounding box overlap between the bounding box predicted by the tracker and the groundtruth one) and Robustness (number of re-initializations once the overlap ratio measure drops to zero) are reported in the experiment.

Vi-A2 Deform-SOT dataset

To further evaluate the performance of trackers on deformation and occlusion, we collect the Deform-SOT dataset, which includes challenging sequences and different targets with deformation and occlusion in varying levels in unconstrained environment. The dataset is diverse in object categories, camera viewpoints, sequence lengths and challenging levels. We categorize the difficulty levels of the sequences into six classes, including large deformation, severe occlusion, abnormal movement, illumination variation, scale change and background clutter, for comparison.

We use two popular measures for evaluation, i.e., precision plot and success plot. The precision plot shows the percentage of successfully tracked frames vs. the center location error in pixels, which ranks the trackers as precision score at pixels; the success plot draws the percentage of successfully tracked frames vs. the bounding box overlap threshold, where Area Under the Curve is used as success score for ranking. We run the One-Pass Evaluation (OPE), Spatial Robustness Evaluation (SRE) and Temporal Robustness Evaluation (TRE) for two measures (see definitions in [40]) on the dataset.

Vi-B Implementation Details

The proposed tracker is implemented with MATLAB and C and runs at frame-per-second on a machine with a GHz Intel i7 processor and GB memory. First of all, we study the influence of several important parameters as follows, where the experiment is performed on sequences selected from the Deform-SOT dataset with different kinds of challenges.

Vi-B1 Order of Hypergraph

The order of hypergraph decides how we consider the geometric relations among correspondence hypotheses. We compare the tracking methods with different orders of hypergraph, denoted as GGT-or (), where the corresponding geometric confidence is calculated in Section III-C. As shown in Fig. 4, GGT-or considering high-order geometric relations performs the best. In contrast, GGT-or and GGT-or consider just pairwise relations or no relations between parts, leading to big accuracy loss. It indicates the importance and effectiveness of our high-order representation that integrates geometric structural information fully for the target.

Fig. 4: Performance vs. order of hypergraph.

Vi-B2 Number of Pixels in Each Superpixel

The number of superpixel controls the size of parts and the number of vertices in the hypergraph. As Fig. 5 shows, we consider different numbers of pixels in each superpixel, i.e., GGT-sp (). If the number of pixels in each superpixel is too large (e.g., GGT-sp200), it is hard to exploit discriminative geometric structure cues of local parts to handle deformation. On the other hand, if it is too small (e.g., GGT-sp30), the large number of hypotheses increases the computational complexity considerably without apparent performance improvement (e.g., GGT-sp30 ranks the second in success score and ranks the first in precision score).

Fig. 5: Performance vs. number of pixels in superpixel.

Vi-B3 Weight of Geometric Confidence

The weight of geometric confidence indicates the importance of geometric confidence. Here we set and enumerate the weight in (3), i.e., GGT-gc (). Based on the performance in Fig. 6, an appropriate factor helps the tracker achieve higher performance by neither underestimating nor overestimating the geometric information.

Fig. 6: Performance vs. weight of geometric confidence.

Vi-B4 Maximal Number of Sampled Hyperedges

The number of hyperedges measures the importance of geometric information. We report the performance with different numbers of hyperedges in Fig. 7, denoted as GGT-he (). If the number of hyperedges is too small (e.g., GGT-he25), it is insufficient to exploit high-order geometric information, rendering less discriminative structure cues to handle deformation; if the number is too large (e.g., GGT-he250), it is harmful to introduce many noisy relations by the large number of hyperedges because of the sparsity of reliable correspondences.

Fig. 7: Performance vs. number of hyperedges.

Based on the above parameter analysis, we set and fix all parameters in our algorithm empirically. The order of the geometric hypergraph is set as . For the searching area, we search the target location in current frame by times the size of previous one. For the SLIC over-segmentation method, the number of pixels in each superpixel is set as , and the range of number of superpixels . We use bins for each channel of HSV feature to represent the appearance of target parts. The weights in (3) are set as . The scaling parameters in (4), and in (5)(6). In the sampling method, the appearance threshold is set as , and the maximal number of sampled hyperedges is set as . In (9), the term .

Vi-C Evaluations on the VOT2014 Dataset

We compare our approach to several algorithms including the winner of the VOT2014 challenge, DSST [9], and two of the top-performing trackers of the online tracking benchmark [40], namely Struck [15] and KCF [17]. Furthermore, we include key-point based CMT [29] and IIVTv2 [43], the part based DGT [6], LGTv1 [7], OGT [27], and PTp [11], as well as the baseline trackers including FRT [2], CT [46], and MIL [3]. To ensure a fair comparison, all the results are copied from the original submissions to the VOT2014 challenge by the corresponding authors or the VOT committee.

Fig. 8: Tracking results of trackers (i.e., GGT, DSST [9], DGT [6], LGTv1 [7] and OGT [27]), are denoted in different colors on the VOT2014 dataset (from top to down are car, fish1, hand1, hand2, and skating, respectively). Note that one tracker is not shown in some frames, which means it fails in tracking and will re-initialize later (e.g., DSST [9] fails in hand1 ). Results are best viewed by zooming the digital edition of the figure.

Qualitative Evaluation. Examples of visual tracking results of top trackers are also shown in Fig. 8. We can observe that our hypergraph based tracker performs against other graph-based trackers, such as DGT [6], LGT [7], and OGT [27]. For example, DGT [6] and OGT [27] do not adjust the scale change of the target car. When the figure skater in skating moves under the challenges of background clutter and illumination variation, some trackers do not locate well (e.g., OGT [27] in , and DGT [6] in ). Besides, DSST [9] fails in tracking the hand in hand1 and hand2 . This is because the high-order geometric relations in our method capture the invariant property of local parts such as angles rather than vulnerable pairwise affinity, rendering more tolerance on drastic rotation or appearance variations.

Quantitative Evaluation. Table I shows the average performance of the compared trackers. As these results show, our algorithm achieves the overall best robustness score, and comparable performance in accuracy among all the methods compared. Moreover, the considerable improvement in Region_noise level indicates that the spatial high-order representation in our method can resist noises effectively, and recover from initialization errors to gain improvements both in terms of accuracy and robustness.

Baseline Region_noise Overall
Acc. Sc./Acc. Rk. Rob. Sc./Rob. Rk. Acc. Sc./Acc. Rk. Rob. Sc./Rob. Rk. Acc. Sc./Acc. Rk. Rob. Sc./Rob. Rk.
GGT 0.58/6.16 0.55/4.98 0.57/4.81 0.65/4.93 0.57/5.48 0.59/4.95
DSST [9] 0.62/4.48 1.16/6.32 0.58/4.01 1.28/6.22 0.60/4.25 1.22/6.27
DGT [6] 0.58/5.81 1.00/5.02 0.58/4.97 1.17/5.31 0.58/5.39 1.09/5.16
KCF [17] 0.63/4.22 1.32/6.53 0.58/4.50 1.52/6.62 0.61/4.36 1.42/6.57
LGTv1 [7] 0.47/9.29 0.66/5.96 0.46/8.73 0.64/5.42 0.47/9.01 0.65/5.69
Struck [15] 0.52/8.04 2.16/8.64 0.49/7.90 2.22/8.16 0.51/7.97 2.19/8.40
OGT [27] 0.55/7.09 3.34/9.78 0.51/7.19 3.37/10.30 0.53/7.14 3.36/10.04
PTp [11] 0.47/10.98 1.40/7.20 0.45/9.77 1.46/7.33 0.46/10.38 1.43/7.26
CMT [29] 0.48/9.18 2.64/9.16 0.44/9.97 2.64/9.14 0.46/9.58 2.64/9.15
FoT [39] 0.51/8.44 2.28/9.69 0.48/9.13 2.71/10.59 0.50/8.79 2.50/10.14
IIVTv2 [43] 0.47/9.30 3.19/9.70 0.45/9.96 3.13/9.14 0.46/9.63 3.16/9.42
FSDT [12] 0.47/9.87 3.08/11.26 0.46/9.36 2.77/10.38 0.47/9.62 2.93/10.82
IVT [23] 0.47/9.87 2.76/10.44 0.44/10.69 2.86/10.20 0.46/10.28 2.81/10.32
CT [46] 0.43/11.76 3.12/10.23 0.43/11.04 3.34/10.45 0.43/11.40 3.23/10.34
FRT [2] 0.48/9.17 3.32/12.20 0.44/10.20 3.46/12.29 0.46/9.69 3.39/12.24
MIL [3] 0.40/12.03 2.27/8.80 0.35/13.67 2.60/9.67 0.38/12.85 2.44/9.23
TABLE I: Tracking Results on the VOT2014 dataset. Accuracy scores and ranks (Acc. Sc. and Acc. Rk. for short) are reported as well as the Robustness ones. The first, second and third best values are highlighted by red, blue and green color, respectively.

Vi-D Evaluations on the Deform-SOT dataset

We evaluate the proposed algorithm against exsiting methods including holistic model based trackers (i.e., IVT [23], L1T [26], TLD [20], MIL [3], Struck [15], MTT [47], CT [46], CN [10], STT [37] and STC [45]) and part based trackers (i.e., Frag [2], SPT [34], SCM [48], LOT [30], ASLA [19], LSL [42], LGT [7], DGT [6], and TCP [22]). For fair comparison, we use the same initial bounding box of each sequence for all trackers. The experimental results of other trackers are reproduced from the available source codes with recommended parameters.

As shown in Fig. 9, the evaluation results on OPE, SRE and TRE indicate that our GGT tracker performs against other compared methods. In addition, Fig. 11 shows the tracking results of top trackers on several sequences.

Fig. 9: Precision plot and success plot over the Deform-SOT dataset using OPE, SRE and TRE. Best viewed in color.

Attribute-based Evaluation.We also compare the performance of all tracking algorithms for videos with varying degrees of challenging factors shown in Fig. 10.

Fig. 10: The plots of OPE with different attributes. Best viewed in color. For better clarity, the top trackers are shown.

Vi-D1 Large Deformation

Existing part based trackers [34, 7, 42, 6] mainly consider vulnerable pairwise geometric relations, which are prone to fail in the sequences with significant target deformation (e.g., boarding in Fig. 11). According to Fig. 10(a)(g), our tracker performs against other methods because high-order triangle geometric relations instead of varying pairwise displacements preserve invariant angles to remove noises from a large set of correspondence hypotheses.

Vi-D2 Severe Occlusion

Some trackers [26, 20, 34, 48, 19, 42] drift away from the target or do not scale well when the target is heavily occluded (e.g., boarding, carscale, run and waterski in Fig. 11). However, our method is able to track the target relatively accurate because the structural correspondence modes exploit invariant local geometric structure of target parts. This information helps to avoid much influence of occlusion as long as adequate structural correspondence modes are detected to parse target parts to vote for the target state.

Vi-D3 Abnormal Movement

Abnormal movements consist of all kinds of non-rigid change such as abrupt motion, pose variation, and rotation. For example, SCM [48] and TCP [22] drift away when the gymnast jumps to grab bars in uneven-bars . By comparison, our method performs well in estimating both scales and positions on these challenging sequences, which can be attributed to two reasons. Firstly, the hypergraph is constructed with coarse target parts to remove unnecessary background parts (see a example in Fig. 2(a)). Moreover, based on the modes, the reliable target parts are determined under noises to vote the optimal target state.

Vi-D4 Illumination Variation

Some trackers [19, 10] are insensitive to appearance changes caused by illumination variation, however, compared to our method, they perform poorly on the sequences undergoing other challenges such as large deformation and abnormal movement simultaneously (see bike in Fig. 11). This can be attributed to the use of geometric hypergraph learning to adapt the local parts’ appearance variation in consecutive frames.

Vi-D5 Scale Change

In terms of sequences with significant target scale change (e.g., boarding and carscale in Fig. 11), our tracker performs against other methods [19, 6, 22] as in Fig. 10(e)(k). This is because we employ the angles of the triangle to measure the similarity of several correspondence hypotheses, which is invariant to scale change (see more in Section IV-B). Different from our algorithm, DGT [6] just considers neighboring pairwise relations between local parts, making it less flexible to handle changes of the target scale.

Vi-D6 Background Clutter

The background surrounding the target has similar appearance, leading to drift from the intended target to other objects when they appear in close proximity (e.g., football in Fig. 11). To handle this problem, some methods [37, 45] exploit the context information around the target, while the other ones [7, 6] employ a graph based representation to capture geometric structure of the target. Owing to the proposed confidence-aware sampling method without distance constraint, sampled representative hyperedges not only consider the relations between target parts and background parts (context), but also model the inlier geometric relations among local target parts (structure) simultaneously. As a whole, our method ranks the first in success score in Fig. 10(f) and the second in precision score in Fig. 10(i).

Fig. 11: Tracking results of top trackers, denoted in different colors and lines, on the Deform-SOT dataset (from left to right and top to down are bike, boarding, bolt, carscale, football, run, uneven-bars, and waterski, respectively). Results are best viewed by zooming the digital edition of the figure.

Vii Conclusion and Future Work

In this paper, we describe the Geometric hyperGraph Tracker (GGT) based on geometric hypergraph learning for visual tracking, where -order geometric relations among correspondence hypotheses are integrated in the dynamically constructed geometric hypergraph. Our method is universal in that the traditional graph-based tracking methods can be viewed as special cases of the proposed algorithm with lower-order of hypergraph. On the other hand, the confidence-aware sampling method is developed to reduce computational complexity and the scale of hypergraph for better efficiency. Experiments are carried out in the VOT2014 dataset and the Deform-SOT dataset, which include large deformation and severe occlusion challenges, to demonstrate the favorable performance of the proposed method compared to other existing methods.

There are some issues in our method which can be further improved in the future work. To characterize more kinds of graph-based trackers, we can exploit high-order temporal and spatial relations among a large number of correspondence hypotheses in multiple consecutive frames simultaneously. More spatio-temporal relations considered in the model means higher computational complexity. Therefore, one future direction is to introduce an more effective mechanism for selecting vertices and hyperedges to reduce redundant hypotheses. Another possible direction is to learn a holistic target representation which is updated jointly with the simple superpixel representation for more robustness and discriminability.

References

  • [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.
  • [2] A. Adam, E. Rivlin, and I. Shimshoni. Robust fragments-based tracking using the integral histogram. In

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    , volume 1, pages 798–805, 2006.
  • [3] B. Babenko, M.-H. Yang, and S. Belongie. Robust object tracking with online multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1619–1632, 2011.
  • [4] W. Bouachir and G. Bilodeau. Collaborative part-based tracking using salient local predictors. Computer Vision and Image Understanding, 137:88–101, 2015.
  • [5] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1124–1137, 2004.
  • [6] Z. Cai, L. Wen, Z. Lei, N. Vasconcelos, and S. Z. Li. Robust deformable and occluded object tracking with dynamic graph. IEEE Transactions on Image Processing, 23(12):5497–5509, 2014.
  • [7] L. Cehovin, M. Kristan, and A. Leonardis. Robust visual tracking using an adaptive coupled-layer visual model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4):941–953, 2013.
  • [8] L. Cehovin, M. Kristan, and A. Leonardis. Is my new tracker really better than yours? In IEEE Winter Conference on Applications of Computer Vision, pages 540–547, 2014.
  • [9] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Accurate scale estimation for robust visual tracking. In Proceedings of British Machine Vision Conference, 2014.
  • [10] M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. Van de Weijer. Adaptive color attributes for real-time visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [11] S. Duffner and C. Garcia. Pixeltrack: A fast adaptive algorithm for tracking non-rigid objects. In Proceedings of the IEEE International Conference on Computer Vision, pages 2480–2487, 2013.
  • [12] M. K. et al. The visual object tracking VOT2014 challenge results. In Workshops in Conjunction with European Conference on Computer Vision, pages 191–217, 2014.
  • [13] M. Godec, P. M. Roth, and H. Bischof. Hough-based tracking of non-rigid objects. In Proceedings of the IEEE International Conference on Computer Vision, pages 81–88, 2011.
  • [14] Y. Guo, Y. Chen, F. Tang, A. Li, W. Luo, and M. Liu. Object tracking using learned feature manifolds. Computer Vision and Image Understanding, 118:128–139, 2014.
  • [15] S. Hare, A. Saffari, and P. H. Torr. Struck: Structured output tracking with kernels. In Proceedings of the IEEE International Conference on Computer Vision, pages 263–270, 2011.
  • [16] S. Hare, A. Saffari, and P. H. S. Torr. Efficient online structured output learning for keypoint-based object tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1894–1901, 2012.
  • [17] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
  • [18] Z. Hong, C. Wang, X. Mei, D. Prokhorov, and D. Tao. Tracking using multilevel quantizations. In European Conference on Computer Vision, volume 8694, pages 155–171, 2014.
  • [19] X. Jia, H. Lu, and M. Yang. Visual tracking via adaptive structural local sparse appearance model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1822–1829, 2012.
  • [20] Z. Kalal, J. Matas, and K. Mikolajczyk.

    P-N learning: Bootstrapping binary classifiers by structural constraints.

    In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 49–56, 2010.
  • [21] J. Lee, M. Cho, and K. M. Lee. Hyper-graph matching via reweighted random walks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1633–1640, 2011.
  • [22] W. Li, L. Wen, M. C. Chuah, Y. Zhang, Z. Lei, and S. Z. Li. Online visual tracking using temporally coherent part cluster. In IEEE Winter Conference on Applications of Computer Vision, pages 9–16, 2015.
  • [23] J. Lim, D. A. Ross, R.-S. Lin, and M.-H. Yang. Incremental learning for visual tracking. In Advances in Neural Information Processing Systems, pages 793–800, 2004.
  • [24] H. Liu, X. Yang, L. J. Latecki, and S. Yan. Dense neighborhoods on affinity graph. International Journal of Computer Vision, 98(1):65–82, 2012.
  • [25] T. Liu, G. Wang, and Q. Yang. Real-time part-based visual tracking via adaptive correlation filters. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4902–4912, 2015.
  • [26] X. Mei and H. Ling. Robust visual tracking using minimization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1436–1443, 2009.
  • [27] H. Nam, S. Hong, and B. Han. Online graph-based tracking. In European Conference on Computer Vision, pages 112–126, 2014.
  • [28] G. Nebehay and R. Pflugfelder. Clustering of static-adaptive correspondences for deformable object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2784–2791, 2015.
  • [29] G. Nebehay and R. P. Pflugfelder. Consensus-based matching and tracking of keypoints for object tracking. In Winter Conference on Applications of Computer Vision, pages 862–869, 2014.
  • [30] S. Oron, A. Bar-Hillel, D. Levi, and S. Avidan. Locally orderless tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1940–1947, 2012.
  • [31] X. Ren and J. Malik. Tracking as repeated figure/ground segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
  • [32] J. Wang and Y. Yagi. Many-to-many superpixel matching for robust tracking. IEEE Transactions on Cybernetics, 44(7):1237–1248, 2014.
  • [33] Q. Wang, F. Chen, and W. Xu.

    Tracking by third-order tensor representation.

    IEEE Transactions on Systems, Man, and Cybernetics, Part B, 41(2):385–396, 2011.
  • [34] S. Wang, H. Lu, F. Yang, and M.-H. Yang. Superpixel tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 1323–1330, 2011.
  • [35] W. Wang and R. Nevatia. Robust object tracking using constellation model with superpixel. In Asian Conference on Computer Vision, pages 191–204, 2012.
  • [36] L. Wen, Z. Cai, D. Du, Z. Lei, and S. Z. Li. Learning discriminative hidden structural parts for visual tracking. In Workshops in Conjunction with Asian Conference on Computer Vision, pages 262–276, 2014.
  • [37] L. Wen, Z. Cai, Z. Lei, D. Yi, and S. Z. Li. Online spatio-temporal structural context learning for visual tracking. In European Conference on Computer Vision, pages 716–729, 2012.
  • [38] L. Wen, D. Du, Z. Lei, S. Z. Li, and M.-H. Yang. JOTS: Joint online tracking and segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2226–2234, 2015.
  • [39] A. Wendel, S. Sternig, and M. Godec. Robustifying the flock of trackers. In Computer Vision Winter Workshop, page 91. Citeseer, 2011.
  • [40] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2411–2418, 2013.
  • [41] F. Yang, H. Lu, and M. Yang. Learning structured visual dictionary for object tracking. Image Vision Computing, 31(12):992–999, 2013.
  • [42] R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel. Part-based visual tracking with online latent structural learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2363–2370, 2013.
  • [43] K. M. Yi, H. Jeong, B. Heo, H. J. Chang, and J. Y. Choi. Initialization-insensitive visual tracking through voting with salient local features. In Proceedings of the IEEE International Conference on Computer Vision, pages 2912–2919, 2013.
  • [44] X. Yu, J. Yang, T. Wang, and T. S. Huang.

    Key point detection by max pooling for tracking.

    IEEE Transactions on Cybernetics, 45(3):444–452, 2015.
  • [45] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M. Yang. Fast visual tracking via dense spatio-temporal context learning. In European Conference on Computer Vision, pages 127–141, 2014.
  • [46] K. Zhang, L. Zhang, and M.-H. Yang. Real-time compressive tracking. In European Conference on Computer Vision, pages 864–877. Springer, 2012.
  • [47] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja. Robust visual tracking via multi-task sparse learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2042–2049, 2012.
  • [48] W. Zhong, H. Lu, and M. Yang. Robust object tracking via sparsity-based collaborative model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1838–1845, 2012.