A Hybrid Data Association Framework for Robust Online Multi-Object Tracking

03/31/2017 ∙ by Min Yang, et al. ∙ 0

Global optimization algorithms have shown impressive performance in data-association based multi-object tracking, but handling online data remains a difficult hurdle to overcome. In this paper, we present a hybrid data association framework with a min-cost multi-commodity network flow for robust online multi-object tracking. We build local target-specific models interleaved with global optimization of the optimal data association over multiple video frames. More specifically, in the min-cost multi-commodity network flow, the target-specific similarities are online learned to enforce the local consistency for reducing the complexity of the global data association. Meanwhile, the global data association taking multiple video frames into account alleviates irrecoverable errors caused by the local data association between adjacent frames. To ensure the efficiency of online tracking, we give an efficient near-optimal solution to the proposed min-cost multi-commodity flow problem, and provide the empirical proof of its sub-optimality. The comprehensive experiments on real data demonstrate the superior tracking performance of our approach in various challenging situations.



There are no comments yet.


page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Online multi-object tracking is to estimate the spatio-temporal trajectories of multiple objects in an online video stream (

i.e., the video is provided frame-by-frame), which is a fundamental problem for numerous real-time applications, such as video surveillance, autonomous driving, and robot navigation. Assume that an object detector is available to detect potential locations of multiple objects in each frame, the tracking problem is consequently reduced to a data association procedure which links these individual detections to form consistent trajectories.

Data association is a challenging problem in many situations, especially in complex scenes, due to the presence of occlusions, inaccurate detections, and interactions among similar-looking objects. Standard approaches for data association are to recursively link detections frame by frame [breitenstein2011online, shu2012part, possegger2014Occlusion, xiang2015learning, solera2015learning, yang2016temporal, hong2016online, Milan2016arxiv], resulting in a bi-partite matching between the existing trajectories and the newly obtained detections, as shown in Fig. 1(a). These approaches are temporally local and computationally efficient, making them suitable for the online setting. However, using only the local information for data association might lead to irrecoverable errors when an object is undetected or is confused with clutters. To overcome this shortcoming, the global data association over entire video frames (or a batch of frames) has been devoted to inferring optimal trajectories [zhang2008global, berclaz2011multiple, luo2015automatic, chari2015pairwise, dehghan2015gmmcp, tang2015subgraph, wang2016tracklet, tang2016multi, maksai2016globally], as shown in Fig. 1(b). Such a data association problem can be solved in an optimization framework with carefully designed cost functions. Unfortunately, global association methods can not be directly applied to online video streams. Overlapping temporal window is a common choice to handle online data [berclaz2011multiple, chari2015pairwise, dehghan2015gmmcp], but the connection between consecutive batches remains an open problem.

Fig. 1: Illustration of our hybrid data association approach. (a) Local association is performed between two consecutive frames and , and a bi-partite matching between the existing trajectories (marked as color arrows) at the current frame and the detections (marked as circles) from the next frame is usually solved. (b) Global association is performed over a batch of frames (length in this example), and a optimization problem is usually solved to infer optimal trajectories based on pairwise affinities between detections. (c) The hybrid association finds globally optimal associations for the existing trajectories within a temporally local window (length in this example), and the target-specific information from the existing trajectories provides helpful local constraints to guide the global optimization.

In this paper, we propose a hybrid data association framework for online multi-object tracking, which characterizes the superiorities of both local and global data association methods. The core of our approach lies on the association between the existing trajectories and the detections from multiple video frames within a temporal window, as shown in Fig. 1(c). We exploit a mini-cost multi-commodity flow which is with respect to a cost-flow network constructed by the detections from multiple frames. The proposed mini-cost multi-commodity network is able to formulate a hybrid data association strategy to handle online data with an efficient near-optimal solution.

In our framework, concretely, all possible associations among the detections are represented by edges in the network, where the corresponding edge costs account for the association likelihoods. Each existing trajectory is then supposed to be a specific commodity, and its optimal associations can be found by sending specific commodity flows through the network with a minimum cost. To this end, the following three challenges need to be studied: (i) identifying newly appeared objects automatically; (ii) computing edge cost for different commodities; (iii) solving the min-cost multi-commodity flow problem efficiently. By addressing these challenges, we bring the following three contributions:

  • We introduce a dummy commodity into our network to automatically identify a new object. The dummy commodity corresponds to a target-independent model, and its commodity flows indicate the permissible tracks of objects newly appeared in a temporal window.

  • We present an online discriminative appearance modeling approach to build target-specific models for different existing trajectories. The edge costs of multiple commodities in the network are estimated by exploiting the target-specific information to discriminate a specific target from both other targets and the background.

  • We propose a near-optimal solution algorithm to the min-cost multi-commodity flow problem, and provide the empirical proof of its sub-optimality. By using the reformulation and column generation strategy, our solution is extremely efficient and performs superiorly in multi-object tracking.

The proposed hybrid strategy offers several advantages over existing methods. First, it makes the global optimization of trajectories applicable to online data. The local association between consecutive frames is extended to account for more hypotheses from multiple frames. Irrecoverable errors caused by noisy detections or frequent occlusions can be alleviated to improve online tracking performance. Second, the target-specific information from the existing trajectories is explicitly modeled to guide the global optimization over the current batch of frames. In practice, it enforces local constraints to reduce the complexity of the optimization problem as the associated detections are restricted to be consistent with the target-specific models. We believe that the techniques described in this paper are of wide interests due to their efficiency and performance. Both qualitative explanation and experimental confirmation are provided to support this claim.

The rest of the paper is organized as follows. Section II reviews the related work. In Section III, we describe the details of our online multi-object tracking method using the hybrid data association including min-cost multi-commodity flow formulation and its edge costs. Section IV presents the globally-optimal solution of our model. We report and discuss the experimental results in Section V, and conclude the paper in Section VI.

Ii Related Work

In multi-object tracking, data association based methods fall into a sub-domain known as the tracking-by-detection technique, which has shown impressive tracking performance in unconstrained environments. A thorough review can be found in [luo2014multiple]. As evidenced in Section I

, the local association method has aroused considerable research interests. Especially with the success of recurrent neural networks (RNNs) in computer vision community

[karpathy2015visualizing], RNNs-based methods have witnessed significant advances on MOT problems. Based on the pioneer work introduced by Ondruska and Posner [ondruska2016deep], RNNs-based method quickly sparked significant interest to model the local association, and inspired a number of extensions including [MilanAAAIRNNTracking, alahi2016social, sadeghian2017tracking]. Nevertheless, the RNNs usually comes with high computational and memory demands both during the model training and inference. We here introduce to explicitly enforce locality into the global data association formulation, and introduce a hybrid data association framework that is able to integrate the advantages of both local and global association methods.

Maintaining locality for global data association is critical for multi-object tracking performance, since global optimization might scale poorly for the complex scenario and long batches without local constraints. Many global association methods enforce locality by iteratively optimizing trajectories [huang2008robust, yang2012multi, Milan:2014:CEM, Milan2016PAMI], or using tracklets (i.e., short-term trajectory fragments) instead of individual detections [zhang2008global, yang2014multi, wang2016tracklet]. However, these strategies are hardly applied to online video streams. Alternatively, one can divide an online video stream into consecutive batches with temporal sliding windows, and apply global data association to each video batch [berclaz2011multiple, chari2015pairwise, dehghan2015gmmcp]

. In order to produce consistent trajectories, the connection between optimized trajectories from adjacent batches need to be considered. However, most existing methods adopt heuristic strategies to connect adjacent batches and can not ensure the optimality of the trajectories.

To retain the ability of handling online data, we turn to explicitly model the target-specific information from previous observations, similar to local data association methods, to cooperate with the global data association over multiple frames. Integrating local and global data association is rarely mentioned in the literature. Lenz et al. [lenz2015followme] proposed an approximate online solution to the min-cost network flow problem with bounded memory and computation. The local consistency, however, is ignored in the optimization of trajectories. Choi [choi2015NOMT] proposed a near-online multi-object tracking method to formulate the data association between previously tracked objects and detections in a temporal window. The method has a similar problem setting with ours, while the difference is that a highly non-convex formulation is adopted to select appropriate hypotheses for the objects. The solution heavily relies on both the affinity measures and the generated trajectory hypotheses. In contrast, we use a more compact formulation, i.e., the min-cost multi-commodity flow, to address the hybrid data association. The target-specific information contained in the existing trajectories is incorporated into the flow costs in a natural way, ensuring that the objective is still convex. We also propose an optimization algorithm to the network flow problem, and show its effectiveness in multi-object tracking.

Recently, multi-commodity flow has been introduced into multi-object tracking in [ben2014multi, dehghan2015target]. Ben Shitrit et al. [ben2014multi] employed the multi-commodity network to account for different appearance groups which are fixed beforehand. Each appearance group (e.g., a basketball team) is supposed to be a specific commodity in the network, and solving multi-commodity flow problems is able to distinguish different appearance groups during the optimization process. Dehghan et al. [dehghan2015target] have focused on integrating object detector learning and multi-object tracking, where the multi-commodity network is used to track a fixed number of objects in a short video batch. Our approach is different from these methods in that we use a multi-commodity network to formulate a hybrid data association strategy to handle online data. Furthermore, a high-quality near-optimal solution to the min-cost multi-commodity flow problem can be achieved by an efficient algorithm, especially when the number of objects (commodities) is relatively large. Thus we do not need to heuristically prune the graph [ben2014multi] or iteratively relax the hard constraints [dehghan2015target].

Fig. 2: Illustration of the online multi-object tracking process with our hybrid data association. At each time step , we solve a data association problem between the set of existing trajectories and the set of detection responses in a temporal window . After that, the trajectory set is updated to by incorporating the associated detections at frame , and the temporal window moves one time step forward.

Iii Hybrid Data Association

Let denote the set of detections from the video with the -th detection and the number of detections. Assume that, at each time step , we have a set of existing trajectories and observe multiple video frames in a temporal window . A set of detection responses is obtained by applying an object detector to each video frame within the temporal window. The task of hybrid data association is to find globally optimal associations of over the detections , and simultaneously identify newly appeared objects. Then the trajectory set is updated to by incorporating the associated detections at the frame , and the temporal window moves one time step forward, as shown in Fig. 2. In practice, it causes a latency of to output tracking results, as the trajectories at frame is not updated until the frame is observed. Nevertheless, our approach operates in a fully online manner and thus is capable of handling online data. Note that the traditional local or global data association methods can be regarded as special cases of the proposed hybrid framework by adjusting the length of the temporal window as or (total length of the video), respectively.

In this section, the data association between and is formulated as a min-cost multi-commodity flow problem, as in Fig. 3. For the convenience of discussion, we drop the time index in the following description, and denote the current set of existing trajectories as , where is the -th existing trajectory and is the number of existing trajectories.

Iii-a Our min-cost multi-commodity flow

Given the set of existing trajectories and the set of detections , we introduce a directed network with multiple sources and sinks , . The directed network is constructed by the set of detections . Each detection corresponds to a pair of nodes in connected by an observation edge with cost and flow . The cost indicates the confidence of observing the detection , and the flow encodes the selection of the detection in some tracks. Each transition between a pair of detections is represented by a transition edge with cost and flow . The cost represents the coherence between detections and , and the flow indicates that the two detections are connected through the same track. The set of permissible transitions between detections is denoted as . It could be a subset of all pairs of detections in successive frames by using choice heuristics (e.g., spatial proximity). Finally, the source and sink are introduced with track start edges (with cost and flow ) and track termination edges (with cost and flow ). Then the multi-object tracking problem is formulated as sending a set of flows from the source to sink , which minimizes the total cost


In this work, each existing trajectory is supposed to be a target-specific commodity which corresponds to a source-sink pair . Specifically, sources and sinks are introduced with track start edges and track termination edges connected to all detections, indicating that the existing trajectories or newly appeared trajectories are allowed to start and terminate at any detection from the temporal window. For each commodity , sending flows from to through the network incurs a specific set of edge costs. Formally, we use , , , and to represent the amount of the -th commodity flows on the observation edge , the transition edge , the track start edges , and the track terminate edge , respectively. The corresponding edge costs, in a similar way, are denoted as , , , and .

To identify newly appeared objects, we add a dummy commodity with the source and sink to represent a target-independent model. We call a flow sent from to the -th commodity flow. That is, the source and sink are extended to account for multiple commodities (see an example in Fig. 3). Then the optimal associations of over can be found by sending the -th commodity flow through the network. It leads to a multi-commodity flow problem in the community of network flow [ahuja1993network].

With the network , the hybrid data association problem is formulated as finding an optimal set of flows between multiple source and sink pairs , which minimizes the total cost


Intuitively, each flow path connects a set of coherent detections over time and thus can be interpreted as an object track. In practice, the flow should subject to the following constraints to satisfy the physical conditions in a real world:


The constraint (3) is a edge capacity constraint which means that each detection belongs to at most one track. The flow conservation constraint (4) encodes that the sum of flows arriving at any detection is equal to the flow of its observation edge , which also is the sum of outgoing flows from the detection . The constraints (3), (4), and (5) ensure that all permissible flows in the network come in the form of flow paths from sources to sinks, and also ensure that there is no overlap between multiple paths. The flow variables , , , act as binary indicators taking the value when the corresponding edge is selected in a flow path of the commodity . The constraint (6) restricts the total amount of flows sent from to to be a certain value . Consequently, each flow path in the network can be interpreted as an object track which connects a set of coherent detections over time. A flow path of commodity with is the success track of the existing trajectory within the temporal window. We thus set for to ensure that each existing trajectory has only one success track. For the dummy commodity, we set to capture a sufficient number of new objects.

To simplify the notation, we collect the flow variables , , ,

in a long vector

and the edge cost variables , , , and in a long vector , respectively. Then the optimization problem that minimizes the cost (2) with constraints (3), (4), (5), and (6) can be rewritten as


where the constraints are rearranged into the matrix form. The vectors with all zero and one entries are denoted as and , respectively.

Fig. 3: An example of the directed network with multiple sources and sinks. Each detection is represented by a pair of nodes connected by an observation edge. Possible transitions between detections are modeled by transition edges. To allow tracks to start and terminate at any detections from the video, each detection is connected to both a source and a sink . We use , , , and to represent the amount of the -th commodity flows on the observation edge , the transition edge , the track start edges , and the track terminate edge , respectively. We add a dummy commodity with the source and sink to represent a target-independent model

Iii-B Computing edge costs

In our min-cost multi-commodity flow formulation, sending flows of a commodity through the network incurs a specific set of edge costs . Therefore, local information contained in the existing trajectories can be incorporated into the edge costs in a natural way, and thus guides the global data association over multiple video frames. In this subsection, we show that the edge costs can be computed by exploiting the target-specific information from the existing trajectories.

Iii-B1 Observation cost

Given an existing trajectory and a detection , the observation cost encodes the possibility of belonging to . is computed by


where is the similarity function used to recognize the specific object corresponding to , and and are the appearance feature of the existing trajectory and the detection

, respectively. We use Convolutional Neural Network (CNN) features to capture the appearance information of an object, as described in Section 

V-C. The appearance feature of is represented by the average feature vector over the last frames, and the appearance feature of is extracted from the image region corresponding to its location. The similarity function is involved to assign high similarity scores to pairs of appearance features when both of them originate from the same object corresponding to , while producing low similarity scores when more than one of them originate from the other object. We utilize an online similarity learning approach to learn the target-specific similarity function , as described in Section III-C. For the dummy commodity, we set to the negative detector score of the detection .

Note that the observation costs take negative values when the appearance similarity scores or the detector scores are larger than zero, which facilitates the generation of long trajectories. Furthermore, the observation costs taking negative values ensure the appearance consistency for each trajectory since the total cost of the network flows is minimized in our model.

Iii-B2 Transition cost

The transition cost indicates the confidence of connecting the detections and in the same success track of , which can be computed by


where and are the appearance feature of the detection and the detection , respectively. For the dummy commodity, the transition cost is computed by using the cosine of the angel between two appearance feature vectors as a target-independent similarity function.

Iii-B3 Track start/termination cost

The track start cost encodes the possibility that a success track of the starts at the detection . Given the frame index of the detection , we use a constant velocity model to obtain a prediction of at frame , denoted as . Then the track start cost is given by


where is a decay factor (set to 0.95) which discounts long term prediction, is the last associated frame of , and the function denotes the overlap rate between two bounding boxes. For the dummy commodity, we set the track start cost to be a large positive value (10 in our implementation) to reduce the priority of identifying new objects while facilitating the association of the existing trajectories.

Similarly, the track termination cost encodes the possibility that a success track of the ends at the detection

. Assume that an object trajectory ends at all detections with the same probability, we simply set

for all .

Iii-C Online similarity learning

Given an existing trajectory , we learn a target-specific similarity function to distinguish the corresponding object from the others. Formally, we use a parametric similarity function that has a bi-linear form to estimate the appearance similarity between two appearance features and ,


where with the dimensionality of appearance features. The task of online similarity learning is to estimate an appropriate parameter matrix for the existing trajectory in the process of the online tracking.

At each time , we assume that a detection from time , whose appearance feature is denoted as , is associated with the existing trajectory . The parameter matrix of at the current time is needed to be updated to account for the newly observed appearance feature . The principle of updating is to recognize as a relevant appearance and as irrelevant appearances. We therefore construct a set of triplets , where is the appearance feature of at the current time . Each triplet indicate that the similarity between and is apparently larger than the similarity between and . Forcing the current matrix to satisfy the triplet set leads to the updated matrix at time .

We here present an incremental update algorithm to satisfy the triplets sequentially [chechik2010large]. Without loss of generality, assume that we have a parameter matrix at the -th iteration and observe a triplet . The goal of incremental updating is to obtain a new matrix satisfying


which means that it fulfills the definition of a triplet with a safety margin of . Meanwhile, applying the Passive-Aggressive algorithm [crammer2006online] to maintain smoothness, the new matrix is selected to remain close to the previous matrix .

We define a hinge loss function to measure the confidence that a matrix

satisfies the triplet ,


Then the problem of incremental updating can be expressed as


where is the Frobenius norm, is a slack variable, and is a parameter that controls the trade-off between preserving smoothness and minimizing the loss on the current triplet.

Since Eq. (14) is a constrained convex optimization problem, we can directly derive its optimal solution by using the Karush-Kuhn-Tucker (KKT) conditions,


According to Eq. (15), the update only happens when the hinge loss on the triplet is larger than zero.

To summarize, for each existing trajectory at time , we incrementally update the similarity function parameterized by the matrix through the following steps:

  • construct the triplet set ;

  • sequentially update the matrix by using the triplet in one-by-one with Eq. (15);

  • obtain the updated matrix at the time .

Note that the parameter matrix of the existing trajectory

is initialized to an identity matrix when the trajectory is initialization. The incremental update on each iteration, as defined by Eq. (

15), only involves few matrix operations and thus is extremely efficient. Moreover, the entire online similarity learning process for each trajectory is independent and can be performed parallelly to further improve the computational efficiency.

Iv Optimization

Finding a global minimum to the hybrid data association problem (7

) is exactly an Integer Linear Program (ILP) which is NP-hard. In addition, the optimal solution to its Linear Program (LP) relaxation is not guaranteed to be integral, which serves as an important requirement for the generation of reasonable object trajectories. In this section, by exploring the special structure of the constraints, we propose an efficient optimization algorithm that is able to provide near-optimal integer solutions with empirical sub-optimality certificates.

Iv-a Dantzig-Wolfe decomposition

Note that most constraints in the problem (7) only involve a single commodity, we use the Dantzig-Wolfe decomposition [dantzig1960decomposition] to reformulate the “relatively easy” constraints. Specifically, we consider the nonnegativity constraints and the flow conservation constraints that are exactly identical for each commodity . All feasible flow vectors can be treated as points lying on the polyhedron . It is a cone and has a single vertex and a finite number of rays . By the Minkowski-Weyl theorem [schrijver1998theory], we can represent a flow vector as


where is the associated non-negative coefficient. In our case, the rays form the basis of the null space defined by the constraint matrix in the flow conservation constraints , which correspond to indicator vectors of all possible paths from the source to the sink in our network.

Substituting the equation (16) into (7), we can rewrite the formulation as


The formulation (17) can be seen as a path flow formulation that is equivalent to the original edge flow formulation (7). The variable is interpreted as the -th commodity flow on the path corresponding to , indicating whether the path is selected by the -th commodity or not.

Iv-B Column generation

Enumerating all possible paths to construct the complete set leads to a very large number of variables for optimization. Actually, only a few paths among is needed to achieve the optimal solution in practice. We thus use the column generation [ford1958suggested] process to dynamically find the critical paths. In the following, we consider the LP relaxation of (17), denoted as the master LP (MLP), by removing the integer constraints, and show later how to obtain a near-optimal integer solution.

Formally, the MLP problem can be expressed as


where is the whole index set of all possible paths. The dual problem of the MLP, denoted as DMLP, has the form


where are the dual variables of the primal variables . Due to the duality theory, any dual feasible solution of the DMLP provides a lower bound on the MLP, being the fundamental of the column generation algorithm.

Assume that, at the iteration , only a subset of paths with available. Solving the MLP on the subset gives rise to the restricted master linear program (RMLP),


Let and be the optimal primal and dual solution to the RMLP, respectively. We need to check whether the optimal solution to the RMLP is also optimal for the MLP, and decide whether the current path set is needed to be augmented. It can be realized by solving the following pricing problem:


In our case, the pricing problem turns into a shortest path problem with regard to the modified edge cost for the commodity , which can be solved very efficiently by dynamic programming. With the optimal solution to the pricing problem, we have the following proposition.

Proposition 1

If holds for all , the optimal primal solution to the RMLP optimally solves the MLP.

Given the optimal primal solution to the RMLP , we can validate that is a feasible solution to the MLP by setting for those paths not in the current set . Therefore, the optimal value of the RMLP gives an upper bound on the MLP,


where and are the optimal value of the RMLP and the MLP, respectively.

Due to the definition of the pricing problem (21), when holds for all , we have


It can be rewritten as


which implying that the optimal dual solution to the RMLP is also a feasible solution to the DMLP given by (19). Due to the duality theory, the solution provides a lower (dual) bound on the MLP, we therefore have


Note that the above equation use the fact that the optimal primal solution and the optimal dual solution to the RMLP give the exactly same optimal value of the objective function.

With the equations (22) and (25), we can conclude that the RMLP and the MLP have the same optimal value if holds for all . Therefore, the optimal primal solution to the RMLP optimally solves the MLP. This completes the proof.

If the condition of the Proposition 1 is not satisfied, i.e., for some , the shortest path provided by the pricing problem (21) has a negative reduced cost. We introduce into the subset , and repeat the process for the next iteration to decrease the objective value of the MLP.

To obtain a near-optimal integer solution to the ILP (17), one can retain the feasible solution with the minimum objective value once the RMLP provides an integer solution during the column generation process (which happens very frequently in practice). Since the optimal solution to the MLP gives a lower bound for the ILP, the difference between the objective value of the returned integer solution and the lower bound is thus an upper bound certificate on its sub-optimality. In our experiments, we obtained small sub-optimality certificates for the returned integer solutions, indicating that our optimization algorithm based on column generation is stable, as summarized in Algorithm 1.

0:  the edge costs and track numbers for all commodities .
0:  the near-optimal integer solution to the problem (7) and its sub-optimality certificate .
1:  Initialize:  the initial path set consists of the shortest paths of all commodities with regard to the edge cost .
2:  for  to ITERMAX do
3:     solve the RMLP defined by (20) on to get the optimal primal and dual solution , ;
4:      retain the integer solution
5:     if  are integer then
6:        ;
7:        ;
8:     end if
9:      find shortest paths
10:     for  do
11:        ;
12:        ;
13:     end for
14:      optimality check
15:     if  holds for all  then
16:        break;
17:     end if
18:      augment the path set
19:     .
20:  end for
21:  return  , .
Algorithm 1  The Hybrid Data Association via Column Generation

V Experiments

In this section, we evaluate our approach on real world videos to demonstrate its effectiveness. Specifically, the performance of our approach is analyzed in three aspects. (i) We evaluate the influence of the length of the temporal window,i.e., on multi-object tracking performance for our hybrid data association framework; (ii) We compare the column generation (CG) solver introduced in this paper and the exact integer linear programming (ILP) solver in terms of sub-optimality, convergence speed, and MOTA score; (iii) We show that our approach produces superior tracking results over the state-of-the-art via both quantitative and qualitative evaluation.

(a) MOTA
(b) IDS
(c) FG
Fig. 4: Influence of the length of the temporal window () on tracking performance, the MOTA, IDS, and FG scores on the PETS dataset are plotted.

V-a Datasets

We use two publicly available benchmark datasets, i.e., the PETS 2009 dataset and the MOTChallenge 2015 dataset, for performance evaluation. The details are listed as follows.

V-A1 Pets 2009

The PETS 2009 dataset [ellis2009pets] shows an outdoor scene where numerous pedestrians enter, exit, and interact with each other frequently. The images of the dataset are recorded in pixels at fps. The major challenges of this dataset are frequent occlusions either caused by people interaction or static occlusions due to a traffic sign. Additionally to the widely used S2L1 and S2L2 sequence, we also evaluate our approach on the more challenging S2L3 sequence that captures much denser crowds. The input detections and ground truth of these sequences are from Milan et al. [Milan:2014:CEM].

In our experiments, we use the PETS 2009 dataset for diagnosis analysis, including the investigation of the influence of the critical parameter (see Section V-D) and the comparison between the proposed CG solver and the ILP solver (see Section V-E). The reason is that the S2L1, S2L2, and S2L3 sequences from the PETS 2009 dataset, respectively, correspond to three representative application scenarios of multi-object tracking with low, high, and crowded object densities.

V-A2 MOTChallenge 2015

The MOTChallenge 2015 dataset gathers various existing and new challenging video sequences to evaluate the performance of multi-object tracking methods. Since our method performs tracking on the image coordinate, we use the 2D MOT 2015 sequences in the MOTChallenge 2015. The sequences are composed of training and testing video sequences in which the challenges include camera motion, low viewpoint, varying frame rates, and server weather condition. The training sequences contain over frames ( minutes) and annotated trajectories ( bounding boxes). The benchmark releases the ground truth of the training sequences publicly, and thus one can use the training sequences to determine the set of system parameters. The testing sequences contains over frames ( minutes) and annotated trajectories ( bounding boxes), while the annotations are not available to avoid (over)fitting of the competing methods to the specific sequences.

Since it is hard for methods to finetune on such a large amount of data, we use the testing sequences from the MOTChallenge 2015 dataset for quantitative comparison against various state-of-the-art trackers in our experiments (see Section V-F). Moreover, the tracking results of all competing methods are automatically evaluated by the benchmark and the performance scores publicly online, making the quantitative comparison strictly fair.

V-B Evaluation Metrics

We use the widely accepted CLEAR MOT performance metrics [keni2008evaluating] for performance evaluation which include the multiple object tracking precision (MOTP) that measures average overlap rate between estimated trajectories and the ground truth, the multiple object tracking accuracy (MOTA) that is a cumulative accuracy combining false positives (FP), false negatives (FN) and identity switches (IDS). We also report performance scores defined by Li et al. [li2009learning], including the percentage of mostly tracked (MT) ground truth trajectories, the percentage of mostly lost (ML) ground truth trajectories, and the number of times that a ground truth trajectory is interrupted (Frag). To be specific, a ground truth trajectory is determined to be mostly tracked if and only if it is covered by the estimated trajectories with percentage larger than , while a ground truth trajectory is determined to be mostly lost when the coverage percentage is less than . Additionally, we report the false positive ratio to account for the accuracy of identifying true targets, which is measured by the number of false alarms per frame (FAF). Here, means that higher scores indicate better results, and represents that lower is better.

V-C Appearance feature

As for the appearance features, we utilize the region-based CNN features proposed in [girshick2016region]

, where the deep neural network is trained on the ImageNet dataset and fine-tuned on the PASCAL VOC dataset. To obtain a more generic deep representation, we follow the strategy in

[babenko2015aggregating] to use sum pooling to aggregate the output of the last convolutional layer, rather than directly use the features from the last fully-connected layer. For each detection region, the final feature vector is -dimensional with better time and space complexity. Considering that objects of interest tend to be located close to the geometrical center of an image, we also apply the centering prior to the sum pooling strategy to improve the accuracy, which assigns larger weights to the features from the center of the region.

V-D Influence of large temporal window

The length of the temporal window () determines the number of video frames in which the existing trajectories can find their associations, and thus is critical for the proposed hybrid association framework. Intuitively, taking more frames into account should be helpful for handling inaccurate detections and occlusions. To study the influence of on multi-object tracking performance, we conduct an experiment with on the PETS dataset. Fig. 4 shows the MOTA, IDS, and FG scores as a function of .

We can observe from Fig. 4 that enlarging the temporal window improvers the overall performance and apparently reduces the number of ID switches and trajectory fragments, especially compared with the purely local method when the length of temporal window is set to . This result indicates the importance of the data association across multiple frames which our hybrid data association framework can leverage. As we claimed, integrating the local target-specific model with the global optimization over multiple frames is able to alleviate the irrecoverable errors caused by making decision with only local information. Inaccuracy brought by false alarms and short-term occlusions can be exactly resolved to improve the multi-object tracking performance.

(a) PETS-S2L1
(b) PETS-S2L2
(c) PETS-S2L3
Fig. 5: Sub-optimality comparison between the CG and ILP solvers. The sub-optimality certificates are reported for each frame of the PETS-S2L1, PETS-S2L2, and PETS-S2L3 sequences, respectively. The certificates provided by the CG solver are quite small (equal to zero in most of the cases) and comparable to the ILP solver, indicating that the CG solver is stable

On the other hand, the performance decreases when the temporal window is unduly large (). The reason is that the local consistency enforced by target-specific models becomes inaccurate with a long temporal distance. Specifically, due to appearance variations, the target-specific similarity functions obtained by online learning might be inaccurate when they are used to evaluate the object appearances coming from the future. Minimizing the edge costs in the multi-commodity network is therefore unstable to produce consistent flows (trajectories). Similarly, the constant velocity model used to estimate the track start cost might provide unstable long term predictions and thus degrades the tracking accuracy. To achieve a tradeoff between local consistency and global association, we set for our hybrid data association approach and keep it fixed throughout the following experiments.

V-E Solver comparison

Video CG Solver ILP Solver
Run time (s) MOTA (%) Run time (s) MOTA (%)
TABLE I: Comparison of tracking performance and convergence speed of the CG and ILP solvers on the PETS dataset.

In this paper, we introduce a column generation (CG) based solver to the min-cost multi-commodity flow problem in terms of multi-object tracking. Alternatively, one can solve the problem directly using existing integer linear programming packages. To demonstrate the superiority of the proposed CG solver over the standard ILP solver, we report the sub-optimality certificates of the solutions provided by both the CG solver and the ILP solver for the PETS dataset in Fig. 5. The sub-optimality certificates are computed as described in Section IV-B. For the ILP solver, we employ the commercial software Gurobi which represents the state of the art in ILP.

Overall, the certificates provided by the CG solver are quite small (equal to zero in most of the cases) and comparable to the ILP solver, indicating that the CG solver is stable. As can be observed in Fig. 5(a), the CG solver provides zero certificates on each frame of the PETS-S2L1 sequence, while the ILP solver provides certificates much close to zero. It demonstrates that the CG solver exactly finds the optimal integer solution to the min-cost multi-commodity flow problem when the ILP has a tight relaxation to a LP. For the situations where the ILP is not equivalent to a LP, caused by the close interactions of multiple objects, the CG solver provides a near-optimal solution in an efficient way by using a column generation process, as shown in Fig. 5(b) and Fig. 5(c).

To further demonstrate the superiority of the CG solver in terms of multi-object tracking, we report the tracking performance (the MOTA score) and convergence speed (the average run time per frame) of both the CG solver and the ILP solver for the three sequences with varying object densities in the PETS dataset. Results are shown in Table I. As can be observed, the CG solver achieves better results compared with ILP with significantly faster speed. For each sequence, the CG solver achieves higher MOTA scores than the ILP solver, indicating that the near-optimal solutions produced by the CG solver are much more meaningful for multi-object tracking. It owes to the path-flow reformulation involved in the CG solver which conducts a direct connection between the solution and the estimated trajectories. Furthermore, favorable convergence speed is provided by the CG solver even though the number of objects increases quickly from the sequence PETS-S2L1 ( objects per frame) to PETS-S2L3 ( objects per frame).

V-F Comparison with the state-of-the-art

We now compare our approach with the state-of-the-art methods on the MOTChallenge 2015 dataset. The state-of-the-art methods are selected with available corresponding publications at the time of our submission to the test bench, including TC_ODAL [BaeY2014robust], RMOT [yoon2015bayesian], MDP [xiang2015learning], SCEA [hong2016online], TDAM [yang2016temporal], DP_NMS [pirsiavash2011globally], SMOT [dicle2013way], TBD [geiger20143d], CEM [Milan:2014:CEM], MotiCon [leal2014learning], SegTrack [milan2015joint], MHT_DAM [kim2015multiple], JPDA_m [hamid2015joint], TSMLCDE [wang2016tracklet], and NOMT [choi2015NOMT]. Note that the TC_ODAL, RMOT, MDP, SCEA and TDAM trackers are local data-association methods, the NOMT tracker and our approach perform data association in a hybrid way, while the other trackers are global data-association methods.

Method MOTA[%] MOTP[%] FAF MT[%] ML[%] FP FN IDS FG
TC_ODAL [BaeY2014robust]
RMOT [yoon2015bayesian]
MDP [xiang2015learning] 38.4
TDAM [yang2016temporal] 72.8 39.1 30,617
Local SCEA [hong2016online] 1.0 6,060
Global DP_NMS [pirsiavash2011globally]
SMOT [dicle2013way]
TBD [geiger20143d]
CEM [Milan:2014:CEM]
MotiCon [leal2014learning]
SegTrack [milan2015joint] 737
MHT_DAM [kim2015multiple] 16.0
JPDA_m [hamid2015joint] 1.1 6,373 365
TSMLCDE [wang2016tracklet] 34.313.1 14.0
NOMT [choi2015NOMT]
Hybrid HybridDAT 35.015.0 72.6 31,140 358
TABLE II: Quantitative comparison results of our approach (denoted as HybridDAT) with other state-of-the-art methods on the MOTChallenge 2015 dataset. We group the result listings into local, global, and hybrid methods. Bold scores highlight the best results while italic scores indicate the second best ones. (accessed on 7/6/2016)

Table II lists detailed quantitative comparison results on the MOTChallenge 2015 dataset, where the results are grouped into local, global, and hybrid data-association methods 111The comparison is also available at the website of the MOTChallenge http://motchallenge.net/results/2D_MOT_2015/.. With only the provided detections and a simple dynamic model, our approach shows very competitive performance with the best MOTA score. It demonstrates that our approach performs favorable over the state-of-the-art and is suitable for various unconstrained environments. In particular, the MOTA score and the number of ID switches are substantially improved compared with both local and global data-association methods. It is ascribed to the hybrid data association framework that is able to find optimal associations for the existing trajectories over multiple video frames. Errors caused by inaccurate detections and occlusions, which are the most challenging issues in complex scenes, are significantly alleviated by our approach to produce consistent trajectories.

As expected, hybrid data-association methods performs better than both local and global methods by a large margin. This superior performance is mainly due to the integration of local target-specific models and global optimization over multiple frames. Compared with the local methods, hybrid data association takes multiple frames into account and therefore is much more stable against noise when association decisions are made. Moreover, compared with the global methods, hybrid data association utilizes the local target-specific models to ensure the local consistency of estimated trajectories, meanwhile retains the ability to handle online data. Benefitting from the superiority of hybrid data association, the NOMT tracker also achieves good scores on the challenging dataset, as we can observed in Table II. In contrast, our approach produces apparently lower FN and IDS scores with a reasonable number of false alarms, and thus provides a better MOTA score. This is because that our min-cost multi-commodity flow formulation models the multi-object tracking problem in a compact form and enables the efficient near-optimal solution to obtain more accurate trajectories.

On the other hand, our approach produces slightly more fragmented trajectories in return. The reason is that our approach can perform multi-object tracking in an online manner, even though the global optimization over multiple frames are involved. Our approach tends to terminate the trajectory when it has no associated detections in the future frames and thus increases the FG scores. The number of ID switches is significantly reduced due to the consideration of multiple future frames, as shown in Table II.

Several qualitative examples of tracking results produced by our approach on the MOTChallenge 2015 are shown in Fig. 6. Consistency of the estimated trajectories is indicated by bounding boxes of the same color on the same object over time. Our method is able to accurately track the objects against the inference of abundant false positive detections, short-term occlusions, abrupt motions etc. (Videos suitable for qualitative evaluation of the results across all frames are available at the website of the MOTChallenge http://motchallenge.net/results/2D_MOT_2015/, as well as the detailed tracking results provided by our approach and the state-of-the-art algorithms.)

Fig. 6: Sample tracking results of our approach on five representative testing video sequences of the MOTChallenge 2015 dataset (i.e., PETS09-S2L2, ETH-Jelmoli, ADL-Rundle-1, Venice-1, and KITTI-19). At each frame, we show the bounding boxes together with the past trajectories (last 30 frames). The color of the bounding boxes and trajectories indicates the ID of the tracked objects. Best viewed in color. (Refer to the tracking videos for more detailed results.)

Vi Conclusion

In this paper, we have proposed a hybrid data association framework for multi-object tracking. Instead of only considering local associations between adjacent video frames, we explored the superior abilities of global optimization over multiple frames to carry out online tracking. It was formulated as a min-cost multi-commodity flow problem where the local target-specific information is modeled to cooperate with the global association. We employed a powerful online similarity learning algorithm to explicitly build target-specific appearance models to compute the edge costs of our multi-commodity network, improving the discriminative ability of the framework. In addition, we introduced an efficient and effective solution with empirical sub-optimality certificates, and validated its superiority in terms of multi-object tracking. Extensive experiments on various challenging datasets have demonstrated that our approach outperforms the state-of-the-art methods.

Our future work will explore more effective approaches to learn edge costs for the multi-commodity network since it is the most critical issue for good performance. Online similarity learning is just one example of using the appearance cue to compute edge costs, and we believe that our hybrid data association framework can be further improved in terms of multi-object tracking by introducing more useful cues such as motion and shape.