Introduction
Multi-object tracking (MOT) is an important problem in computer vision with many applications, such as surveillance, behavior analysis, and sport video analysis. Although the performance of MOT has been significantly improved in recent years
[Choi2015, Kim et al.2015, Wen et al.2016, Tang et al.2017], it is still a challenging problem due to factors such as missed detections, false detections, and identification switches.An automatic MOT system usually employs a pre-trained object detector to locate candidate object regions in each frame, then match the detections across frames to form target trajectories. Most existing methods only consider the pairwise dependencies of detections (e.g., [Rezatofighi et al.2015, Dehghan, Assari, and Shah2015, Milan, Schindler, and Roth2016, Fagot-Bouquet et al.2016]), and do not take full advantage of the high-order dependencies among multiple targets across frames. This strategy is less effective when nearby objects with similar appearance or motion patterns occlude each other in the video. Several recent methods [Kim et al.2015, Collins2012, Shi et al.2014, Kim et al.2015, Wen et al.2014, Wen et al.2016] attempt to use the high-order information to improve the tracking performance, such as dense structure search on hypergraph [Wen et al.2014, Wen et al.2016]
, tensor power iterations
[Shi et al.2014], high-order motion constraints [Collins2012, Butt and Collins2013], and multiple hypothesis tracking [Kim et al.2015]. However, the aforementioned methods merely exploit fixed degrees of dependencies among objects, which limits the flexibility of the hypergraph model^{1}^{1}1A hypergraph is a generalization of a conventional graph where an edge can join more than two nodes. in complex environments, and calls for adaptive dependency patterns. As shown in Figure 1, -uniform hypergraph is unable to describe the dependencies between two tracklets of target and correctly. On the contrary, non-uniform hypergraph better adapts to different degrees of dependencies among tracklets, and achieves more reliable performance.In this paper, we describe a new non-uniform hypergraph learning based tracker (NT), which has much stronger descriptive power to accommodate different tracking scenarios than the conventional graph [Dehghan, Assari, and Shah2015] or uniform hypergraph [Wen et al.2014, Wen et al.2016]. The nodes in the hypergraph correspond to the tracklets^{2}^{2}2The terminology “tracklet” indicates a fragment of target trajectory. Notably, the input detection responses in each frame can be treated as tracklets of length one., and the hyperedges with different degrees encode similarities among tracklets to assemble various kinds of appearance and motion patterns. The tracking problem is formulated as searching dense structures on the non-uniform hypergraph. Different from previous methods [Wen et al.2014, Wen et al.2016], we do not fix the degree of the hypergraph model, but mix hyperedges of different degrees and learn their relative weights automatically from the data using the structural support vector machine (SSVM) method [Joachims, Finley, and Yu2009]. We propose an efficient approximation algorithm to exploit the dense structures to form long object trajectories to complete the tracking task. In addition, to achieve both accuracy and efficiency, we use a near-online strategy for MOT, i.e., we perform the dense structure searching on the non-uniform hypergraph to generate short tracklets in a temporal window, and then associate those short tracklets to the tracked targets to get the final trajectories of targets at the current time stamp. This process is carried out repeatedly to complete the tracking task in a video.
The main contributions are summarized as follows. (1) We propose a non-uniform hypergraph learning based near-online MOT method, which assembles the hyperedges with different degrees to encode various types of dependencies among objects. (2) The weights of hyperedges with different degrees in the non-uniform hypergraph are learned from data using the SSVM algorithm. (3) We propose an efficient approximation algorithm to complete the dense structure searching problem on the non-uniform hypergraph.
Related Work
MOT methods can be roughly classified into three categories, 1) online strategy, 2) off-line processing strategy, and 3) near-online strategy. If there occurs an error in tracking, it is hard for online strategy (
e.g., [Yang et al.2014, Xiang, Alahi, and Savarese2015, Yoon et al.2016]) to recover from due to imprecise appearance or motion measurements. Thus, many algorithms focus on off-line strategy (e.g., [Berclaz et al.2011, Tang et al.2017, Milan, Schindler, and Roth2016]). To make the association step efficient, [Berclaz et al.2011] formulate the association as a constrained flow optimization problem, solved by the k-shortest paths algorithm. [Tang et al.2017] present a graph-based formulation that links and clusters person hypotheses over time by solving an instance of a minimum cost lifted multicut problem. In addition, Milan et al. [Milan, Schindler, and Roth2016] pose MOT as minimization of a unified discrete-continuous energy function using the L-BFGS and QPBO algorithms. However, as only association between pairs of detections in local temporal domain are considered, the aforementioned methods do not perform well when multiple similar objects appear in proximity with clutter backgrounds.To alleviate this problem, [Dehghan, Assari, and Shah2015] use a graph to integrate all the relations among objects in a batch of frames and formulate the MOT problem as a Generalized Maximum Multi Clique problem on the graph. [Wen et al.2014] exploit the motion information to help tracking and formulate MOT as the dense structure searching on a uniform hypergraph, in which the nodes correspond to tracklets and the edges encode the high-order dependencies among tracklets. To further improve the efficiency, an approximate RANSAC-style approach is proposed in [Wen et al.2016] to complete the dense structure searching.
Besides, [Choi2015] designs a near-online strategy, which inherits the advantages of both online and offline approaches. The tracking problem is formulated as a data-association between targets and detections in a temporal window, that is performed repeatedly at every frame. In this way, the algorithm is able to fix any association error made in the past when more detections are provided. [Wang and Fowlkes2015]
present an end-to-end framework to learn parameters of min-cost flow for MOT problem using a tracking-specific loss function in the SSVM framework. Nevertheless our approach uses the non-uniform hypergraph to describe the high-order dependencies among tracklets, and uses SSVM framework to learn the weights of the hyperedges with different degrees.
Non-uniform Hypergraph
Definition. A hypergraph is a generalization of a conventional graph, where an edge can join more than two nodes. We use to denote a (weighted) hypergraph, where is the node set, is the -th node and is the total number of nodes, is the set of hyperedges, and is the affinity set corresponding to the edges/hyperedges. Specifically, we define , where is the set of self-loops, is the set of conventional graph edges, is the set of hyperedges with degree , , and is the maximal degree of hyperedges. If all hyperedges in have the same cardinality , is a -uniform hypergraph (i.e., for ); otherwise, is a non-uniform hypergraph. For node , we denote its neighborhood as , which is the set of nodes connected to .
Similar to [Wen et al.2016], we define a dense structure on as a sub-hypergraph that has the maximum affinities combining all hyperedges, edges and self-loops of nodes. We introduce an indicator variable , such that , and , where is the number of nodes in the dense structure. The affinity summation of the hyperedges, edges and self-loops of nodes of the dense structure can be calculated as
(1) |
where , is the indicator variable corresponding to node (), i.e., if node belongs to the dense structure; otherwise, . Thus, indicates the confidence of the hyperedge (), edge (), or self-loop () included in the dense structure. Weights are used to balance the significance of different degrees of hyperedges^{3}^{3}3Notably, in this paper, we use the terminology “affinity” to indicate the value associated to each edge/hyperedge, which reflects the similarities of the nodes in the corresponding edge/hyperedge. Meanwhile, the terminology “weight” is adopted to indicate the numbers used to balance the significance of different degrees of hyperedges, edges and self-loops in dense structure searching. The weights of -th hyperedges may consist of terms (e.g., the weights of the second degree hyperedges may consist of the appearance similarity and motion consistency between two tracklets). In such cases, the weight is a vector with the size , and the affinity is also a vector with the size .. The affinity summation from degree to in (1) describes the overall affinity score combining all the hyperedges, edges, and self-loops of the nodes in the dense structure. Thus, we need to maximize the overall affinity score to exploit the dense structures to complete multi-object tracking.
MOT formulation. We use the non-uniform hypergraph to encode the relations among different tracklets. For each video clip, MOT is initialized by the tracklets^{4}^{4}4Our definition of tracklet generalizes cases for single detection, i.e., , or continuous sequence of detections, i.e., the frame index set corresponding the detections on the tracklet, where is an integer, and , .. Let be the tracklet set in the video sequence, where is the -th tracklet. consists of frame detections, and , where and are center location and dimension of the detection, and is the corresponding frame index.
We formulate the MOT problem as searching dense structures on a non-uniform hypergraph ^{5}^{5}5Specifically, we only consider the edges/hyperedges with no duplicate nodes, i.e., each edge/hyperedge contains different nodes.. We set every node in as the starting point, and search the corresponding dense structure from their neighborhoods. Specifically, for a starting point , we initialize the indicator variable , , where is the number of nodes in ’s neighborhood. For node , the dense structure searching problem is formulated as
(2) |
where is the neighborhood of node . Notably, the constraint indicates that the node is included in the searched dense structure, and indicates that the -th node in is included in the searched dense structure, otherwise, .
The problem in (2) is a combinational optimization problem, since we cannot know the number of nodes in the dense structure priorly. To reduce the complexity of this NP-hard problem, we relax the constraint to . In addition, we set a minimal size of the sub-hypergraph to be a constant number to avoid the degeneracy, i.e., . Thus, the constraint is converted to . We would like to highlight that the objective function for dense structure exploiting in [Wen et al.2016] is a specific case of (2), i.e., if we set for a specific , and make , , the non-uniform hypergraph will degenerate into a -uniform hypergraph, and the objective in (2) becomes similarly to that in [Wen et al.2016]. The optimization algorithm in [Wen et al.2016] for uniform hypergraph model cannot be directly applied to solve the problem in (2).
After exploiting the dense structures, the radical post-processing strategy presented in [Wen et al.2014] is adopted to remove the conflicts among the searched dense structures. Then, we stitch the tracklets in each post-processed dense structures to form the long trajectories.
Enforcing edge/hyperedge constraints. In the practical MOT scenarios, the objects have two physical constraints: 1) one object cannot occupy two different places at a time; 2) the velocity of a object is below certain maximum possible velocity. As such, in constructing the hypergraph, two nodes connected by one edge/hyperedge should not overlap in time, and the distance between the last and first detections of the tracklet should not larger than the maximal distance that can reach with the maximal possible velocity. These two constraints can reduce the number of edges and hyperedges and computational complexity.
Calculating self-loop affinity. We associate a node with a score to reflect its reliability being a true tracklet of an object, i.e., , where () is the confident score of the tracklet calculated by averaging the scores of all detections in the tracklet.
Calculating edge affinity. The edges in the hypergraph encode the similarities between two nodes (tracklets), which consists of three terms: HSV histogram similarity , CNN feature similarity , and local motion similarity , i.e., .
Specifically, the HSV histogram similarity is calculated as , where
is the cosine similarity between the HSV histograms of the detections in the last frame of
(i.e., ) and the first frame of (i.e., ).Moreover, the CNN feature similarity is calculated as , where and are the CNN features of the detections in the last frame of and the first frame of .
Finally, the similarity between two bounding boxes based on the generalized KLT tracker [Zhou, Tang, and Wang2013] is calculated as , where and are the areas of the detections in the last frame of and the first frame of , and is the number of point trajectories generated by KLT tracker across the bounding boxes of both the first frame of and first frame of .
Calculating hyperedge affinity. We count the number of local point trajectories passing through the regions of to calculate the affinities of hyperedges, which encodes the motion consistency of tracklets . Thus, for the -th hyperedge with degree , the affinity is calculated as , where measures the number of local point trajectories crossing all regions of , is the length of tracklet , is the -th detection on , and is the area of the detection .
Near-online tracking. It is difficult to handle all detections in a long video sequences at a time, since it requires large memory and computation sources to construct non-uniform hypergraphs and perform dense structure search on all detections. In order to achieve both accuracy and efficiency, inspired by [Choi2015], we use a near-online strategy for MOT. Specifically, after getting video frames at time , we construct a non-uniform hypyergraph to describe the hybrid orders of dependencies among detections and search the dense structures on the hypergraph to generate short tracklets in the temporal window . Then, we construct a conventional graph^{6}^{6}6The conventional graph is a special case of the non-uniform hypergraph, which only includes the conventional edges in the graph, i.e., . to describe the associations between the tracked targets and the short tracklets within . After that, we perform the dense structure searching on the conventional graph to associate the short tracklets and the tracked targets to get the final trajectories at the current time stamp. This process is carried out repeatedly every frames to complete the tracking task in the whole video.
Inference
For efficiency, we use the simple pairwise update algorithm [Liu et al.2012] to solve the dense structure searching problem on hypergraph corresponding to node in (2). We first form the Lagrangian of the problem as
(3) |
where , , and are Lagrangian multipliers with , , and , . Any local maximizer of the objective function must satisfy the Karush-Kuhn-Tucker (KKT) conditions [Kuhn and Tucker1951], i.e.,
(4) |
We define as reward at node , which is calculated as
Since , , , , we have that if , then . Meanwhile, since , , and , we have that if , then . In this way, for node , the KKT conditions can be further rewritten as
(5) |
Based on and , we can partition the solution space into three disjoint subsets, , , and . Thus, similar to Theorem 1 in [Liu et al.2012], we find that there exists an appropriate , such that (1) the rewards at all node in are no larger than ; (2) the rewards at all nodes in are equal to ; and (3) the rewards at all nodes in are larger than .
A simple pairwise updating method is used to optimize (2). That is, we can increase one component and decrease another one appropriately, to increase the objective . To be specific, we first introduce another variable that is defined as: , for and ; , for ; and , for , where is the updated indicator variable in optimization process. Then, the change of objective after updating is
(6) |
where .
To maximize the objective difference , we select the updating step as follows^{7}^{7}7In general, we can assume . When , we can exchange indexes and to maximize . Please see the supplementary material for more details.:
(7) |
We use a heuristic strategy to compute a local maximizer
of (2), i.e., gradually select pairs of nodes to maximize the increase of by updating the indicator variable based on the updating step calculated by (7). Specifically, from (6) and (7), we find that (1) if , there exists such that the objective can be increased by updating based on (6); (2) when and , the objective can be increased by increasing either or , and decreasing the other one; (3) when and , the objective will not be affected by changing .Thus, in each iteration, we can select node with the largest reward from set , i.e., , and node with the smallest reward from set , i.e., , satisfying , to increase by increasing and decreasing with an appropriate in (7). This process is iterated until the reward of equals to . If can not be increased according to (6), then is already a local maximizer. The overall procedure is summarized in Algorithm 1.
Learning
Instead of selecting the weights in (1) empirically, we use a structured SVM [Joachims, Finley, and Yu2009] to learn automatically from the training data. Specifically, given a set of ground-truth bounding boxes of objects in the -th training video (, where is the total number of training videos), we aim to recover the trajectories of objects, which is equivalent to cluster the input bounding boxes into several groups. That is to obtain the indicator variables of the clusters , where () is the indicator variable of the -th target, and is the total number of targets in the video. The bounding boxes in each group belong to the same target.
The function defined in (1) can be rewritten as a linear function of , i.e., , where
We aim to find the optimal weights by maximizing the objective function with the same input object detections. Then, the objective using a SSVM with margin rescaling is formulated as
(8) |
Intuitively, this formulation requires that the score of any ground-truth annotated video must be larger than the score of any other results by the loss minus the slack variable . The constant adjusts the importance of minimizing the slack variables. The loss function measures how incorrect is according to the weighted Hamming loss in [Wang and Fowlkes2015]. Meanwhile, the SSVM formulation in (8) has exponential number of constraints for each training sequence. We use a cutting plain algorithm [Joachims, Finley, and Yu2009] to solve this problem, which has time complexity linear in the number of training examples.
Experiments
We conduct experiments on several popular MOT evaluation datasets, i.e., the multi-pedestrian tracking [Wen et al.2016] (including the PETS09 and ParkingLot sequences), MOT2016 [Milan et al.2016], and multi-face tracking [Wen et al.2016] datasets, to evaluate the performance of the proposed MOT method (denoted as NT subsequently)^{8}^{8}8The source code of the proposed method is available at https://github.com/longyin880815.. We use the MOT2016-train set to train the set-to-set recognition model [Liu, Yan, and Ouyang2017] to calculate the CNN feature similarity, and the multi-pedestrian tracking dataset to analyze the influence of the degree of hypergraph to tracking performance. In addition, we conduct the ablation study to demonstrate the effectiveness of non-uniform hypergraph and SSVM learning.
Evaluation Metrics. Following previous MOT methods, we use the widely adopted multi-object tracking accuracy (MOTA) metric [Bernardin and Stiefelhagen2008] to compare the performance of the trackers. MOTA is a cumulative measure combing false negatives (FN), false positives (FP), and identity switches (IDS). We report mostly tracked (MT), mostly lost (ML), FP, FN, IDS, and the fragmentation of the tracked objects (FM) to measure a tracker comprehensively. In addition, for the multi-pedestrian and multi-face tracking datasets [Wen et al.2016], we also report the multi-object tracking precision (MOTP) score, which computes the total error of tracked positions comparing with the manually annotated ground-truth, with normalization to the hit/miss threshold value. Following the evaluation protocol in MOT2016, we use the ID F1 score (IDF1) [Ristani et al.2016] instead of MOTP, which is the ratio of correctly identified detections over the average number of ground-truth and computed detections.
Parameters. We conduct an experiment to select the maximal degree of the hypergraph . We set while keeping other parameters fixed, and denote the resulting models as NT_d(2), , NT_d(5). For each maximal degree, we use the sequences in the training set of MOT2016 to learn the weights of different degrees of hyperedges using SSVM, and use the sequences in multi-pedestrian tracking dataset for testing. The uniform average performance of the trackers in multi-pedestrian tracking dataset is presented in Table 1. Specifically, we divide each sequence in the MOT2016 train-set into non-overlapping sequences of frames. And then, we take the detections that have more than overlap with the ground-truth as true detections to collect training samples for the weights learning.
As shown in Table 1, NT achieves the best performance with the maximal degree , indicated by higher MOTA and lower IDS and FM scores. We notice that the performance of NT decreases when , this may be because the hypergraph with excessive high degree fails to describe the motion patterns of objects well, particularly for the objects moving fast with drastic variations of directions. Thus, we set in our experiments, and the learned weights of different degree of hyperedge are , , , and . The batch size in near-online tracking is set to . The minimal size of the sub-hypergraph is set as . We fix all parameters to these values in the experiments.
Variants | MOTA | MOTP | IDS | FM | ||
NT_d(2) | learned | 67.5 | 62.4 | 103.7 | 92.2 | |
NT_d(3) | learned | 68.8 | 64.5 | 83.8 | 76.2 | |
NT_d(4) | learned | 68.9 | 65.0 | 68.3 | 68.8 | |
NT_d(5) | learned | 68.5 | 64.7 | 61.5 | 63.7 | |
NT_r(4) | learned, , | 68.4 | 63.5 | 72.7 | 74.2 | |
NT_r(5) | learned, , | 67.6 | 63.5 | 64.3 | 66.0 | |
NT_e(2) | , | 67.1 | 62.6 | 103.7 | 87.0 | |
NT_e(3) | , | 67.5 | 63.7 | 103.3 | 87.5 | |
NT_e(4) | , | 67.4 | 63.7 | 104.0 | 86.7 | |
NT_e(5) | , | 67.1 | 64.6 | 93.2 | 81.7 |
Ablation Study. To demonstrate the contribution of non-uniform hypergraph, we construct two variants of the proposed NT tracker by removing the hyperedges with certain degrees, i.e., NT_r(3) and NT_r(4), and evaluate them on the multi-pedestrian tracking dataset [Wen et al.2016], shown in Table 1. The results in Table 1 shows that removing the hyperedges with degrees and will negatively affect the performance (i.e., reduce and MOTA scores), which shows that exploiting different degrees of dependencies among objects is important for MOT performance.
Besides, to demonstrate the contribution of SSVM, in Table 1, we present the performance of non-uniform hypergraph based trackers with equal weights of different degrees of hyperedges in multi-pedestrian tracking, denoted as NT_e(2), , NT_e(5). The NT_d() methods perform consistently better than the NT_e() methods with the same maximal degrees, e.g., NT_d(2) vs. NT_e(2), and NT_d(5) vs. NT_e(5), where . The results show that using SSVM to learn the weights of hyperedges of different degrees can improve the performance.
Multi-Pedestrian Tracking. We perform experiments for the multi-pedestrian tracking on five sequences from the PETS09 dataset [Ellis and Ferryman2010]: S2L1 ( frames), S2L2 ( frames), S2L3 ( frames), S1L1-1 ( frames), and S1L1-2 ( frames), and ParkingLot sequence from [Zamir, Dehghan, and Shah2012] ( frames). These sequences are captured in the crowded surveillance scenes with frequent occlusions, abrupt motion, illumination changes, etc. Following [Wen et al.2016, Andriyenko, Schindler, and Roth2012], we report the uniform average scores on different metrics over sequences of the proposed NT algorithm, as well as five state-of-the-art trackers, i.e., KSP [Berclaz et al.2011], DPMF [Pirsiavash, Ramanan, and Fowlkes2011], CEM [Andriyenko and Schindler2011], DCT [Andriyenko, Schindler, and Roth2012] and FH^{2}T [Wen et al.2016], in Table 2. The tracking results of previous methods are taken from [Wen et al.2016]. For fair and comprehensive comparisons, we use the same frame detections, ground-truth annotations as well as the evaluation protocol provided by the authors of [Wen et al.2016]. We train the set-to-set recognition method [Liu, Yan, and Ouyang2017] based on the pre-trained GoogLeNet [Szegedy et al.2015] on the training set of MOT2016 to extract the CNN features of the detections.
Method | MOTA | MOTP | MT[%] | ML[%] | FP | FN | IDS | FM |
---|---|---|---|---|---|---|---|---|
KSP | 45.5 | 67.1 | 33.4 | 35.6 | 107.8 | 2223.2 | 42.2 | 49.8 |
DPMF | 51.6 | 70.0 | 21.5 | 27.0 | 68.8 | 1897.0 | 61.8 | 80.7 |
CEM | 55.7 | 66.6 | 30.1 | 21.7 | 127.3 | 1652.8 | 63.7 | 56.7 |
DCT | 58.1 | 67.6 | 43.1 | 21.3 | 119.5 | 1610.2 | 64.2 | 53.2 |
FH^{2}T | 66.2 | 64.9 | 54.3 | 14.7 | 194.5 | 1150.8 | 45.2 | 73.7 |
NT | 68.9 | 65.0 | 58.2 | 9.6 | 252.7 | 974.3 | 68.3 | 68.8 |
As shown in Table 2, we find that our NT tracker performs better than the state-of-the-art methods on several important metrics (e.g., MOTA, MT, and ML). Specifically, NT improves and average MOTA and MT scores, and reduces average ML score, against the second best tracker FH^{2}T [Wen et al.2016]. This may be attributed to that our method uses non-uniform hypergraph instead of uniform hypergraph in [Wen et al.2016], especially for tracking in crowded scenes with different motions and frequent occlusions of objects. By the way, we notice that the FH^{2}T method [Wen et al.2016] performs better than the methods (e.g., DPMF [Pirsiavash, Ramanan, and Fowlkes2011] and DCT [Andriyenko, Schindler, and Roth2012]), both only considering the similarities between pairs of tracklets (i.e., FH^{2}T [Wen et al.2016] produces and higher average MOTA score than DPMF [Pirsiavash, Ramanan, and Fowlkes2011] and DCT [Andriyenko, Schindler, and Roth2012]), which indicates that exploiting the high-order similarities among multiple tracklets is crucial for MOT.
MOT2016 Benchmark. The MOT2016 benchmark [Milan et al.2016] is a collection of video sequences (/ for training and testing, respectively), with a relatively high variations in object movements, camera motion, viewing angle and crowd density. The benchmark primarily focuses on pedestrian tracking. The ground-truths for testing set are strictly invisible to all methods, i.e., all results on testing set were submitted to the respective testing servers for evaluation. We use the training set to learn the parameters of the proposed algorithm, and submit our results on testing set for evaluation, shown in Table 3. For a fair comparison with the state-of-the-art MOT methods, we use the reference object detections provided by the benchmark [Milan et al.2016]. We train the set to set recognition method [Liu, Yan, and Ouyang2017] based on the pre-trained GoogLeNet [Szegedy et al.2015] on the training set of MOT2016 to extract the CNN features of the detections.
Method | MOTA | IDF1 | MT[%] | ML[%] | FP | FN | IDS | FM | Hz |
online: | |||||||||
EAMTT | 38.8 | 42.4 | 7.9 | 49.1 | 8,114 | 102,452 | 965 | 1,657 | 11.8 |
DCCRF | 44.8 | 39.7 | 14.1 | 42.3 | 5,613 | 94,133 | 968 | 1,378 | 0.1 |
STAM | 46.0 | 50.0 | 14.6 | 43.6 | 6,895 | 91,117 | 473 | 1,422 | 0.2 |
AMIR | 47.2 | 46.3 | 14.0 | 41.6 | 2,681 | 92,856 | 774 | 1,675 | 1.0 |
offline: | |||||||||
Quad | 44.1 | 38.3 | 14.6 | 44.9 | 6,388 | 94,775 | 745 | 1,096 | 1.8 |
INT | 45.4 | 37.7 | 18.1 | 38.7 | 13,407 | 85,547 | 600 | 930 | 4.3 |
MHT | 45.8 | 46.1 | 16.2 | 43.2 | 6,412 | 91,758 | 590 | 781 | 0.8 |
NLPa | 47.6 | 47.3 | 17.0 | 40.4 | 5,844 | 89,093 | 629 | 768 | 8.3 |
FWT | 47.8 | 44.3 | 19.1 | 38.2 | 8,886 | 85,487 | 852 | 1,534 | 0.6 |
LMP | 48.8 | 51.3 | 18.2 | 40.1 | 6,654 | 86,245 | 481 | 595 | 0.5 |
near-online: | |||||||||
NOMT | 46.4 | 53.3 | 18.3 | 41.4 | 9,753 | 87,565 | 359 | 504 | 2.6 |
Ours | 47.5 | 43.6 | 19.4 | 36.9 | 13,002 | 81,762 | 1,035 | 1,408 | 0.8 |
In Table 3, NT is compared with the state-of-the-art methods including EAMTT [Sanchez-Matilla, Poiesi, and Cavallaro2016], Quad [Son et al.2017], MHT [Kim et al.2015], STAM [Chu et al.2017], NOMT [Choi2015], AMIR [Sadeghian, Alahi, and Savarese2017], NLPa [Levinkov et al.2017], FWT [Henschel et al.2017], LMP [Tang et al.2017], INT [Lan et al.2018], and DCCRF [Zhou et al.2018]. Our NT method performs on par with the state-of-the-art trackers (e.g., FWT and LMP) in terms of tracking accuracy. Specifically, LMP uses additional person re-identification datasets to train a deep StackNet with body part fusion to associate pedestrians across frames, achieving the top tracking accuracy (i.e., MOTA), while FWT incorporates multiple detectors to improve the tracking performance. In contrast to the aforementioned methods using complex appearance model, our NT algorithm focuses on exploiting different degrees of dependencies among tracklets to assemble various kinds of appearance and motion patterns. The appearance modeling strategies proposed in those methods are complementary to our NT tracker. Meanwhile, we notice that NT achieves better performance than the high-order information based MHT in terms of tracking accuracy ( vs. ), which implies that exploiting adaptive dependencies among objects is important for MOT.
Multi-Face Tracking. In addition to pedestrian tracking, we also evaluate NT on the SubwayFaces dataset used in [Wen et al.2016]. The dataset consists of four sequences, namely S001, S002, S003, and S004 with , , , and frames, captured from surveillance videos in subway with manually annotations. We compare our approach with five state-of-the-art MOT algorithms, i.e., CEM [Andriyenko and Schindler2011], KSP [Berclaz et al.2011], DCT [Andriyenko, Schindler, and Roth2012], DPMF [Pirsiavash, Ramanan, and Fowlkes2011] and FH^{2}T [Wen et al.2016], with uniform average scores on different metrics over sequences presented in Table 4. We use the same input detections, ground-truth annotations and the evaluation protocol as [Wen et al.2016], and the results of the state-of-the-art trackers in Table 4 are taken from [Wen et al.2016]. We use pre-trained AlexNet [Krizhevsky, Sutskever, and Hinton2012] to extract the CNN features of the detected faces.
Method | MOTA | MOTP | MT[%] | ML[%] | FP | FN | IDS | FM |
---|---|---|---|---|---|---|---|---|
CEM | 18.9 | 71.4 | 18.8 | 37.4 | 1185.3 | 4095.3 | 69.8 | 100.3 |
KSP | 32.8 | 74.0 | 15.1 | 32.2 | 648.5 | 3589.3 | 70.0 | 82.3 |
DCT | 37.6 | 73.7 | 25.5 | 12.6 | 1235.0 | 2691.0 | 66.3 | 59.3 |
DPMF | 42.6 | 73.7 | 24.6 | 14.3 | 679.0 | 2858.3 | 62.8 | 74.0 |
FH^{2}T | 45.8 | 73.4 | 27.4 | 11.5 | 742.3 | 2634.0 | 43.0 | 57.3 |
NT | 53.1 | 70.4 | 34.2 | 8.5 | 648.5 | 2292.8 | 37.5 | 36.3 |
As presented in Table 4, we find that our approach achieves the best performance on almost all evaluation metrics except MOTP. Specifically, the NT method produces and larger average MOTA and MT scores, and lower average ML score, comparing to the second best FH^{2}T tracker. The evaluated sequences are recorded in the unconstrained scenes with fast motion, illumination variations, motion blurs and frequent occlusions. Since different degrees of dependencies among objects are considered, our method is able to exploit different types of motion patterns to improve the tracking performance, indicated by the consistent highest scores of almost all metrics (i.e.
, MOTA, MT, ML, FP, FN, IDS, and FM). Meanwhile, comparison with the state-of-the-art methods, our approach tracks the objects more robustly even when occlusions occur, indicated by the IDS, FM and FN scores. However, the linear interpolation is used in our method to estimate the occluded parts of the trajectories, which is not accurate enough to achieve good MOTP score, especially for crowded scenes containing non-linear motion patterns.
Running Time. We implement the NT algorithm in C++ without any code optimization. To demonstrate the running time of NT, we run it five times using a single thread on a laptop with a GHz Intel processor and GB memory. Given the detections with the corresponding CNN features, the average speeds on the multi-pedestrian tracking dataset, MOT2016 dataset, and multi-face tracking dataset are , , and frame per second (FPS), respectively.
Conclusions
In this work, we propose a non-uniform hypergraph learning based near-online MOT method, which assembles different degrees of dependencies among tracklets in a unified objective. In contrast to previous graph or hypergraph based methods, our formulation exploit different high-degree cues among multiple tracklets in a computationally efficient way. Extensive experiments on several datasets, including the multi-pedestrian and multi-face tracking datasets, and MOT2016 benchmark, show that our method achieves comparable performance regarding to the state-of-the-arts. For future work, we plan to investigate and compare different optimization strategies to solve the dense structure searching problem on non-uniform hypergraphs.
Acknowledgments
Dawei Du and Siwei Lyu are supported by US NSF IIS-1816227 and the National Natural Science Foundation of China under Grant 61771341.
Appendix A The Proof of Calculating the Updating Step .
We present the proof of calculating the updating step . As discussed in the paper, the objective of dense structure searching on non-uniform hypergraph is defined as
(9) |
We use the pairwise updating scheme to search the dense structures on the hypergraph to complete the tracking task. Specifically, we increase one component and decrease another one appropriately, to increase , i.e.,
(10) |
where is the updated indicator variable in the optimization process, and .
The objective with the updated indicator variable is calculated as:
(11) |
The difference of objective after updating is
(12) |
Then, we rewrite the difference of objective as
(13) |
where
(14) | ||||
(15) |
As discussed in the paper, we select an appropriate updating step to maximize the objective difference ^{9}^{9}9When and , we have . We can not select any to increase the objective. Thus, we ignore this case in discussion.. Based on the updating strategy presented in (10), we have two constraints of , i.e.,
Comments
There are no comments yet.