I Introduction
Online multiobject tracking is to estimate the spatiotemporal trajectories of multiple objects in an online video stream (
i.e., the video is provided framebyframe), which is a fundamental problem for numerous realtime applications, such as video surveillance, autonomous driving, and robot navigation. Assume that an object detector is available to detect potential locations of multiple objects in each frame, the tracking problem is consequently reduced to a data association procedure which links these individual detections to form consistent trajectories.Data association is a challenging problem in many situations, especially in complex scenes, due to the presence of occlusions, inaccurate detections, and interactions among similarlooking objects. Standard approaches for data association are to recursively link detections frame by frame [breitenstein2011online, shu2012part, possegger2014Occlusion, xiang2015learning, solera2015learning, yang2016temporal, hong2016online, Milan2016arxiv], resulting in a bipartite matching between the existing trajectories and the newly obtained detections, as shown in Fig. 1(a). These approaches are temporally local and computationally efficient, making them suitable for the online setting. However, using only the local information for data association might lead to irrecoverable errors when an object is undetected or is confused with clutters. To overcome this shortcoming, the global data association over entire video frames (or a batch of frames) has been devoted to inferring optimal trajectories [zhang2008global, berclaz2011multiple, luo2015automatic, chari2015pairwise, dehghan2015gmmcp, tang2015subgraph, wang2016tracklet, tang2016multi, maksai2016globally], as shown in Fig. 1(b). Such a data association problem can be solved in an optimization framework with carefully designed cost functions. Unfortunately, global association methods can not be directly applied to online video streams. Overlapping temporal window is a common choice to handle online data [berclaz2011multiple, chari2015pairwise, dehghan2015gmmcp], but the connection between consecutive batches remains an open problem.
In this paper, we propose a hybrid data association framework for online multiobject tracking, which characterizes the superiorities of both local and global data association methods. The core of our approach lies on the association between the existing trajectories and the detections from multiple video frames within a temporal window, as shown in Fig. 1(c). We exploit a minicost multicommodity flow which is with respect to a costflow network constructed by the detections from multiple frames. The proposed minicost multicommodity network is able to formulate a hybrid data association strategy to handle online data with an efficient nearoptimal solution.
In our framework, concretely, all possible associations among the detections are represented by edges in the network, where the corresponding edge costs account for the association likelihoods. Each existing trajectory is then supposed to be a specific commodity, and its optimal associations can be found by sending specific commodity flows through the network with a minimum cost. To this end, the following three challenges need to be studied: (i) identifying newly appeared objects automatically; (ii) computing edge cost for different commodities; (iii) solving the mincost multicommodity flow problem efficiently. By addressing these challenges, we bring the following three contributions:

We introduce a dummy commodity into our network to automatically identify a new object. The dummy commodity corresponds to a targetindependent model, and its commodity flows indicate the permissible tracks of objects newly appeared in a temporal window.

We present an online discriminative appearance modeling approach to build targetspecific models for different existing trajectories. The edge costs of multiple commodities in the network are estimated by exploiting the targetspecific information to discriminate a specific target from both other targets and the background.

We propose a nearoptimal solution algorithm to the mincost multicommodity flow problem, and provide the empirical proof of its suboptimality. By using the reformulation and column generation strategy, our solution is extremely efficient and performs superiorly in multiobject tracking.
The proposed hybrid strategy offers several advantages over existing methods. First, it makes the global optimization of trajectories applicable to online data. The local association between consecutive frames is extended to account for more hypotheses from multiple frames. Irrecoverable errors caused by noisy detections or frequent occlusions can be alleviated to improve online tracking performance. Second, the targetspecific information from the existing trajectories is explicitly modeled to guide the global optimization over the current batch of frames. In practice, it enforces local constraints to reduce the complexity of the optimization problem as the associated detections are restricted to be consistent with the targetspecific models. We believe that the techniques described in this paper are of wide interests due to their efficiency and performance. Both qualitative explanation and experimental confirmation are provided to support this claim.
The rest of the paper is organized as follows. Section II reviews the related work. In Section III, we describe the details of our online multiobject tracking method using the hybrid data association including mincost multicommodity flow formulation and its edge costs. Section IV presents the globallyoptimal solution of our model. We report and discuss the experimental results in Section V, and conclude the paper in Section VI.
Ii Related Work
In multiobject tracking, data association based methods fall into a subdomain known as the trackingbydetection technique, which has shown impressive tracking performance in unconstrained environments. A thorough review can be found in [luo2014multiple]. As evidenced in Section I
, the local association method has aroused considerable research interests. Especially with the success of recurrent neural networks (RNNs) in computer vision community
[karpathy2015visualizing], RNNsbased methods have witnessed significant advances on MOT problems. Based on the pioneer work introduced by Ondruska and Posner [ondruska2016deep], RNNsbased method quickly sparked significant interest to model the local association, and inspired a number of extensions including [MilanAAAIRNNTracking, alahi2016social, sadeghian2017tracking]. Nevertheless, the RNNs usually comes with high computational and memory demands both during the model training and inference. We here introduce to explicitly enforce locality into the global data association formulation, and introduce a hybrid data association framework that is able to integrate the advantages of both local and global association methods.Maintaining locality for global data association is critical for multiobject tracking performance, since global optimization might scale poorly for the complex scenario and long batches without local constraints. Many global association methods enforce locality by iteratively optimizing trajectories [huang2008robust, yang2012multi, Milan:2014:CEM, Milan2016PAMI], or using tracklets (i.e., shortterm trajectory fragments) instead of individual detections [zhang2008global, yang2014multi, wang2016tracklet]. However, these strategies are hardly applied to online video streams. Alternatively, one can divide an online video stream into consecutive batches with temporal sliding windows, and apply global data association to each video batch [berclaz2011multiple, chari2015pairwise, dehghan2015gmmcp]
. In order to produce consistent trajectories, the connection between optimized trajectories from adjacent batches need to be considered. However, most existing methods adopt heuristic strategies to connect adjacent batches and can not ensure the optimality of the trajectories.
To retain the ability of handling online data, we turn to explicitly model the targetspecific information from previous observations, similar to local data association methods, to cooperate with the global data association over multiple frames. Integrating local and global data association is rarely mentioned in the literature. Lenz et al. [lenz2015followme] proposed an approximate online solution to the mincost network flow problem with bounded memory and computation. The local consistency, however, is ignored in the optimization of trajectories. Choi [choi2015NOMT] proposed a nearonline multiobject tracking method to formulate the data association between previously tracked objects and detections in a temporal window. The method has a similar problem setting with ours, while the difference is that a highly nonconvex formulation is adopted to select appropriate hypotheses for the objects. The solution heavily relies on both the affinity measures and the generated trajectory hypotheses. In contrast, we use a more compact formulation, i.e., the mincost multicommodity flow, to address the hybrid data association. The targetspecific information contained in the existing trajectories is incorporated into the flow costs in a natural way, ensuring that the objective is still convex. We also propose an optimization algorithm to the network flow problem, and show its effectiveness in multiobject tracking.
Recently, multicommodity flow has been introduced into multiobject tracking in [ben2014multi, dehghan2015target]. Ben Shitrit et al. [ben2014multi] employed the multicommodity network to account for different appearance groups which are fixed beforehand. Each appearance group (e.g., a basketball team) is supposed to be a specific commodity in the network, and solving multicommodity flow problems is able to distinguish different appearance groups during the optimization process. Dehghan et al. [dehghan2015target] have focused on integrating object detector learning and multiobject tracking, where the multicommodity network is used to track a fixed number of objects in a short video batch. Our approach is different from these methods in that we use a multicommodity network to formulate a hybrid data association strategy to handle online data. Furthermore, a highquality nearoptimal solution to the mincost multicommodity flow problem can be achieved by an efficient algorithm, especially when the number of objects (commodities) is relatively large. Thus we do not need to heuristically prune the graph [ben2014multi] or iteratively relax the hard constraints [dehghan2015target].
Iii Hybrid Data Association
Let denote the set of detections from the video with the th detection and the number of detections. Assume that, at each time step , we have a set of existing trajectories and observe multiple video frames in a temporal window . A set of detection responses is obtained by applying an object detector to each video frame within the temporal window. The task of hybrid data association is to find globally optimal associations of over the detections , and simultaneously identify newly appeared objects. Then the trajectory set is updated to by incorporating the associated detections at the frame , and the temporal window moves one time step forward, as shown in Fig. 2. In practice, it causes a latency of to output tracking results, as the trajectories at frame is not updated until the frame is observed. Nevertheless, our approach operates in a fully online manner and thus is capable of handling online data. Note that the traditional local or global data association methods can be regarded as special cases of the proposed hybrid framework by adjusting the length of the temporal window as or (total length of the video), respectively.
In this section, the data association between and is formulated as a mincost multicommodity flow problem, as in Fig. 3. For the convenience of discussion, we drop the time index in the following description, and denote the current set of existing trajectories as , where is the th existing trajectory and is the number of existing trajectories.
Iiia Our mincost multicommodity flow
Given the set of existing trajectories and the set of detections , we introduce a directed network with multiple sources and sinks , . The directed network is constructed by the set of detections . Each detection corresponds to a pair of nodes in connected by an observation edge with cost and flow . The cost indicates the confidence of observing the detection , and the flow encodes the selection of the detection in some tracks. Each transition between a pair of detections is represented by a transition edge with cost and flow . The cost represents the coherence between detections and , and the flow indicates that the two detections are connected through the same track. The set of permissible transitions between detections is denoted as . It could be a subset of all pairs of detections in successive frames by using choice heuristics (e.g., spatial proximity). Finally, the source and sink are introduced with track start edges (with cost and flow ) and track termination edges (with cost and flow ). Then the multiobject tracking problem is formulated as sending a set of flows from the source to sink , which minimizes the total cost
(1) 
In this work, each existing trajectory is supposed to be a targetspecific commodity which corresponds to a sourcesink pair . Specifically, sources and sinks are introduced with track start edges and track termination edges connected to all detections, indicating that the existing trajectories or newly appeared trajectories are allowed to start and terminate at any detection from the temporal window. For each commodity , sending flows from to through the network incurs a specific set of edge costs. Formally, we use , , , and to represent the amount of the th commodity flows on the observation edge , the transition edge , the track start edges , and the track terminate edge , respectively. The corresponding edge costs, in a similar way, are denoted as , , , and .
To identify newly appeared objects, we add a dummy commodity with the source and sink to represent a targetindependent model. We call a flow sent from to the th commodity flow. That is, the source and sink are extended to account for multiple commodities (see an example in Fig. 3). Then the optimal associations of over can be found by sending the th commodity flow through the network. It leads to a multicommodity flow problem in the community of network flow [ahuja1993network].
With the network , the hybrid data association problem is formulated as finding an optimal set of flows between multiple source and sink pairs , which minimizes the total cost
(2) 
Intuitively, each flow path connects a set of coherent detections over time and thus can be interpreted as an object track. In practice, the flow should subject to the following constraints to satisfy the physical conditions in a real world:
(3)  
(4)  
(5)  
(6) 
The constraint (3) is a edge capacity constraint which means that each detection belongs to at most one track. The flow conservation constraint (4) encodes that the sum of flows arriving at any detection is equal to the flow of its observation edge , which also is the sum of outgoing flows from the detection . The constraints (3), (4), and (5) ensure that all permissible flows in the network come in the form of flow paths from sources to sinks, and also ensure that there is no overlap between multiple paths. The flow variables , , , act as binary indicators taking the value when the corresponding edge is selected in a flow path of the commodity . The constraint (6) restricts the total amount of flows sent from to to be a certain value . Consequently, each flow path in the network can be interpreted as an object track which connects a set of coherent detections over time. A flow path of commodity with is the success track of the existing trajectory within the temporal window. We thus set for to ensure that each existing trajectory has only one success track. For the dummy commodity, we set to capture a sufficient number of new objects.
To simplify the notation, we collect the flow variables , , ,
in a long vector
and the edge cost variables , , , and in a long vector , respectively. Then the optimization problem that minimizes the cost (2) with constraints (3), (4), (5), and (6) can be rewritten as(7) 
where the constraints are rearranged into the matrix form. The vectors with all zero and one entries are denoted as and , respectively.
IiiB Computing edge costs
In our mincost multicommodity flow formulation, sending flows of a commodity through the network incurs a specific set of edge costs . Therefore, local information contained in the existing trajectories can be incorporated into the edge costs in a natural way, and thus guides the global data association over multiple video frames. In this subsection, we show that the edge costs can be computed by exploiting the targetspecific information from the existing trajectories.
IiiB1 Observation cost
Given an existing trajectory and a detection , the observation cost encodes the possibility of belonging to . is computed by
(8) 
where is the similarity function used to recognize the specific object corresponding to , and and are the appearance feature of the existing trajectory and the detection
, respectively. We use Convolutional Neural Network (CNN) features to capture the appearance information of an object, as described in Section
VC. The appearance feature of is represented by the average feature vector over the last frames, and the appearance feature of is extracted from the image region corresponding to its location. The similarity function is involved to assign high similarity scores to pairs of appearance features when both of them originate from the same object corresponding to , while producing low similarity scores when more than one of them originate from the other object. We utilize an online similarity learning approach to learn the targetspecific similarity function , as described in Section IIIC. For the dummy commodity, we set to the negative detector score of the detection .Note that the observation costs take negative values when the appearance similarity scores or the detector scores are larger than zero, which facilitates the generation of long trajectories. Furthermore, the observation costs taking negative values ensure the appearance consistency for each trajectory since the total cost of the network flows is minimized in our model.
IiiB2 Transition cost
The transition cost indicates the confidence of connecting the detections and in the same success track of , which can be computed by
(9) 
where and are the appearance feature of the detection and the detection , respectively. For the dummy commodity, the transition cost is computed by using the cosine of the angel between two appearance feature vectors as a targetindependent similarity function.
IiiB3 Track start/termination cost
The track start cost encodes the possibility that a success track of the starts at the detection . Given the frame index of the detection , we use a constant velocity model to obtain a prediction of at frame , denoted as . Then the track start cost is given by
(10) 
where is a decay factor (set to 0.95) which discounts long term prediction, is the last associated frame of , and the function denotes the overlap rate between two bounding boxes. For the dummy commodity, we set the track start cost to be a large positive value (10 in our implementation) to reduce the priority of identifying new objects while facilitating the association of the existing trajectories.
Similarly, the track termination cost encodes the possibility that a success track of the ends at the detection
. Assume that an object trajectory ends at all detections with the same probability, we simply set
for all .IiiC Online similarity learning
Given an existing trajectory , we learn a targetspecific similarity function to distinguish the corresponding object from the others. Formally, we use a parametric similarity function that has a bilinear form to estimate the appearance similarity between two appearance features and ,
(11) 
where with the dimensionality of appearance features. The task of online similarity learning is to estimate an appropriate parameter matrix for the existing trajectory in the process of the online tracking.
At each time , we assume that a detection from time , whose appearance feature is denoted as , is associated with the existing trajectory . The parameter matrix of at the current time is needed to be updated to account for the newly observed appearance feature . The principle of updating is to recognize as a relevant appearance and as irrelevant appearances. We therefore construct a set of triplets , where is the appearance feature of at the current time . Each triplet indicate that the similarity between and is apparently larger than the similarity between and . Forcing the current matrix to satisfy the triplet set leads to the updated matrix at time .
We here present an incremental update algorithm to satisfy the triplets sequentially [chechik2010large]. Without loss of generality, assume that we have a parameter matrix at the th iteration and observe a triplet . The goal of incremental updating is to obtain a new matrix satisfying
(12) 
which means that it fulfills the definition of a triplet with a safety margin of . Meanwhile, applying the PassiveAggressive algorithm [crammer2006online] to maintain smoothness, the new matrix is selected to remain close to the previous matrix .
We define a hinge loss function to measure the confidence that a matrix
satisfies the triplet ,(13) 
Then the problem of incremental updating can be expressed as
(14) 
where is the Frobenius norm, is a slack variable, and is a parameter that controls the tradeoff between preserving smoothness and minimizing the loss on the current triplet.
Since Eq. (14) is a constrained convex optimization problem, we can directly derive its optimal solution by using the KarushKuhnTucker (KKT) conditions,
(15) 
According to Eq. (15), the update only happens when the hinge loss on the triplet is larger than zero.
To summarize, for each existing trajectory at time , we incrementally update the similarity function parameterized by the matrix through the following steps:

construct the triplet set ;

sequentially update the matrix by using the triplet in onebyone with Eq. (15);

obtain the updated matrix at the time .
Note that the parameter matrix of the existing trajectory
is initialized to an identity matrix when the trajectory is initialization. The incremental update on each iteration, as defined by Eq. (
15), only involves few matrix operations and thus is extremely efficient. Moreover, the entire online similarity learning process for each trajectory is independent and can be performed parallelly to further improve the computational efficiency.Iv Optimization
Finding a global minimum to the hybrid data association problem (7
) is exactly an Integer Linear Program (ILP) which is NPhard. In addition, the optimal solution to its Linear Program (LP) relaxation is not guaranteed to be integral, which serves as an important requirement for the generation of reasonable object trajectories. In this section, by exploring the special structure of the constraints, we propose an efficient optimization algorithm that is able to provide nearoptimal integer solutions with empirical suboptimality certificates.
Iva DantzigWolfe decomposition
Note that most constraints in the problem (7) only involve a single commodity, we use the DantzigWolfe decomposition [dantzig1960decomposition] to reformulate the “relatively easy” constraints. Specifically, we consider the nonnegativity constraints and the flow conservation constraints that are exactly identical for each commodity . All feasible flow vectors can be treated as points lying on the polyhedron . It is a cone and has a single vertex and a finite number of rays . By the MinkowskiWeyl theorem [schrijver1998theory], we can represent a flow vector as
(16) 
where is the associated nonnegative coefficient. In our case, the rays form the basis of the null space defined by the constraint matrix in the flow conservation constraints , which correspond to indicator vectors of all possible paths from the source to the sink in our network.
Substituting the equation (16) into (7), we can rewrite the formulation as
(17) 
The formulation (17) can be seen as a path flow formulation that is equivalent to the original edge flow formulation (7). The variable is interpreted as the th commodity flow on the path corresponding to , indicating whether the path is selected by the th commodity or not.
IvB Column generation
Enumerating all possible paths to construct the complete set leads to a very large number of variables for optimization. Actually, only a few paths among is needed to achieve the optimal solution in practice. We thus use the column generation [ford1958suggested] process to dynamically find the critical paths. In the following, we consider the LP relaxation of (17), denoted as the master LP (MLP), by removing the integer constraints, and show later how to obtain a nearoptimal integer solution.
Formally, the MLP problem can be expressed as
(18) 
where is the whole index set of all possible paths. The dual problem of the MLP, denoted as DMLP, has the form
(19) 
where are the dual variables of the primal variables . Due to the duality theory, any dual feasible solution of the DMLP provides a lower bound on the MLP, being the fundamental of the column generation algorithm.
Assume that, at the iteration , only a subset of paths with available. Solving the MLP on the subset gives rise to the restricted master linear program (RMLP),
(20) 
Let and be the optimal primal and dual solution to the RMLP, respectively. We need to check whether the optimal solution to the RMLP is also optimal for the MLP, and decide whether the current path set is needed to be augmented. It can be realized by solving the following pricing problem:
(21) 
In our case, the pricing problem turns into a shortest path problem with regard to the modified edge cost for the commodity , which can be solved very efficiently by dynamic programming. With the optimal solution to the pricing problem, we have the following proposition.
Proposition 1
If holds for all , the optimal primal solution to the RMLP optimally solves the MLP.
Given the optimal primal solution to the RMLP , we can validate that is a feasible solution to the MLP by setting for those paths not in the current set . Therefore, the optimal value of the RMLP gives an upper bound on the MLP,
(22) 
where and are the optimal value of the RMLP and the MLP, respectively.
Due to the definition of the pricing problem (21), when holds for all , we have
(23) 
It can be rewritten as
(24) 
which implying that the optimal dual solution to the RMLP is also a feasible solution to the DMLP given by (19). Due to the duality theory, the solution provides a lower (dual) bound on the MLP, we therefore have
(25) 
Note that the above equation use the fact that the optimal primal solution and the optimal dual solution to the RMLP give the exactly same optimal value of the objective function.
With the equations (22) and (25), we can conclude that the RMLP and the MLP have the same optimal value if holds for all . Therefore, the optimal primal solution to the RMLP optimally solves the MLP. This completes the proof.
If the condition of the Proposition 1 is not satisfied, i.e., for some , the shortest path provided by the pricing problem (21) has a negative reduced cost. We introduce into the subset , and repeat the process for the next iteration to decrease the objective value of the MLP.
To obtain a nearoptimal integer solution to the ILP (17), one can retain the feasible solution with the minimum objective value once the RMLP provides an integer solution during the column generation process (which happens very frequently in practice). Since the optimal solution to the MLP gives a lower bound for the ILP, the difference between the objective value of the returned integer solution and the lower bound is thus an upper bound certificate on its suboptimality. In our experiments, we obtained small suboptimality certificates for the returned integer solutions, indicating that our optimization algorithm based on column generation is stable, as summarized in Algorithm 1.
V Experiments
In this section, we evaluate our approach on real world videos to demonstrate its effectiveness. Specifically, the performance of our approach is analyzed in three aspects. (i) We evaluate the influence of the length of the temporal window,i.e., on multiobject tracking performance for our hybrid data association framework; (ii) We compare the column generation (CG) solver introduced in this paper and the exact integer linear programming (ILP) solver in terms of suboptimality, convergence speed, and MOTA score; (iii) We show that our approach produces superior tracking results over the stateoftheart via both quantitative and qualitative evaluation.
Va Datasets
We use two publicly available benchmark datasets, i.e., the PETS 2009 dataset and the MOTChallenge 2015 dataset, for performance evaluation. The details are listed as follows.
VA1 Pets 2009
The PETS 2009 dataset [ellis2009pets] shows an outdoor scene where numerous pedestrians enter, exit, and interact with each other frequently. The images of the dataset are recorded in pixels at fps. The major challenges of this dataset are frequent occlusions either caused by people interaction or static occlusions due to a traffic sign. Additionally to the widely used S2L1 and S2L2 sequence, we also evaluate our approach on the more challenging S2L3 sequence that captures much denser crowds. The input detections and ground truth of these sequences are from Milan et al. [Milan:2014:CEM].
In our experiments, we use the PETS 2009 dataset for diagnosis analysis, including the investigation of the influence of the critical parameter (see Section VD) and the comparison between the proposed CG solver and the ILP solver (see Section VE). The reason is that the S2L1, S2L2, and S2L3 sequences from the PETS 2009 dataset, respectively, correspond to three representative application scenarios of multiobject tracking with low, high, and crowded object densities.
VA2 MOTChallenge 2015
The MOTChallenge 2015 dataset gathers various existing and new challenging video sequences to evaluate the performance of multiobject tracking methods. Since our method performs tracking on the image coordinate, we use the 2D MOT 2015 sequences in the MOTChallenge 2015. The sequences are composed of training and testing video sequences in which the challenges include camera motion, low viewpoint, varying frame rates, and server weather condition. The training sequences contain over frames ( minutes) and annotated trajectories ( bounding boxes). The benchmark releases the ground truth of the training sequences publicly, and thus one can use the training sequences to determine the set of system parameters. The testing sequences contains over frames ( minutes) and annotated trajectories ( bounding boxes), while the annotations are not available to avoid (over)fitting of the competing methods to the specific sequences.
Since it is hard for methods to finetune on such a large amount of data, we use the testing sequences from the MOTChallenge 2015 dataset for quantitative comparison against various stateoftheart trackers in our experiments (see Section VF). Moreover, the tracking results of all competing methods are automatically evaluated by the benchmark and the performance scores publicly online, making the quantitative comparison strictly fair.
VB Evaluation Metrics
We use the widely accepted CLEAR MOT performance metrics [keni2008evaluating] for performance evaluation which include the multiple object tracking precision (MOTP) that measures average overlap rate between estimated trajectories and the ground truth, the multiple object tracking accuracy (MOTA) that is a cumulative accuracy combining false positives (FP), false negatives (FN) and identity switches (IDS). We also report performance scores defined by Li et al. [li2009learning], including the percentage of mostly tracked (MT) ground truth trajectories, the percentage of mostly lost (ML) ground truth trajectories, and the number of times that a ground truth trajectory is interrupted (Frag). To be specific, a ground truth trajectory is determined to be mostly tracked if and only if it is covered by the estimated trajectories with percentage larger than , while a ground truth trajectory is determined to be mostly lost when the coverage percentage is less than . Additionally, we report the false positive ratio to account for the accuracy of identifying true targets, which is measured by the number of false alarms per frame (FAF). Here, means that higher scores indicate better results, and represents that lower is better.
VC Appearance feature
As for the appearance features, we utilize the regionbased CNN features proposed in [girshick2016region]
, where the deep neural network is trained on the ImageNet dataset and finetuned on the PASCAL VOC dataset. To obtain a more generic deep representation, we follow the strategy in
[babenko2015aggregating] to use sum pooling to aggregate the output of the last convolutional layer, rather than directly use the features from the last fullyconnected layer. For each detection region, the final feature vector is dimensional with better time and space complexity. Considering that objects of interest tend to be located close to the geometrical center of an image, we also apply the centering prior to the sum pooling strategy to improve the accuracy, which assigns larger weights to the features from the center of the region.VD Influence of large temporal window
The length of the temporal window () determines the number of video frames in which the existing trajectories can find their associations, and thus is critical for the proposed hybrid association framework. Intuitively, taking more frames into account should be helpful for handling inaccurate detections and occlusions. To study the influence of on multiobject tracking performance, we conduct an experiment with on the PETS dataset. Fig. 4 shows the MOTA, IDS, and FG scores as a function of .
We can observe from Fig. 4 that enlarging the temporal window improvers the overall performance and apparently reduces the number of ID switches and trajectory fragments, especially compared with the purely local method when the length of temporal window is set to . This result indicates the importance of the data association across multiple frames which our hybrid data association framework can leverage. As we claimed, integrating the local targetspecific model with the global optimization over multiple frames is able to alleviate the irrecoverable errors caused by making decision with only local information. Inaccuracy brought by false alarms and shortterm occlusions can be exactly resolved to improve the multiobject tracking performance.
On the other hand, the performance decreases when the temporal window is unduly large (). The reason is that the local consistency enforced by targetspecific models becomes inaccurate with a long temporal distance. Specifically, due to appearance variations, the targetspecific similarity functions obtained by online learning might be inaccurate when they are used to evaluate the object appearances coming from the future. Minimizing the edge costs in the multicommodity network is therefore unstable to produce consistent flows (trajectories). Similarly, the constant velocity model used to estimate the track start cost might provide unstable long term predictions and thus degrades the tracking accuracy. To achieve a tradeoff between local consistency and global association, we set for our hybrid data association approach and keep it fixed throughout the following experiments.
VE Solver comparison
Video  CG Solver  ILP Solver  

Run time (s)  MOTA (%)  Run time (s)  MOTA (%)  
PETSS2L1  
PETSS2L2  
PETSS2L3 
In this paper, we introduce a column generation (CG) based solver to the mincost multicommodity flow problem in terms of multiobject tracking. Alternatively, one can solve the problem directly using existing integer linear programming packages. To demonstrate the superiority of the proposed CG solver over the standard ILP solver, we report the suboptimality certificates of the solutions provided by both the CG solver and the ILP solver for the PETS dataset in Fig. 5. The suboptimality certificates are computed as described in Section IVB. For the ILP solver, we employ the commercial software Gurobi which represents the state of the art in ILP.
Overall, the certificates provided by the CG solver are quite small (equal to zero in most of the cases) and comparable to the ILP solver, indicating that the CG solver is stable. As can be observed in Fig. 5(a), the CG solver provides zero certificates on each frame of the PETSS2L1 sequence, while the ILP solver provides certificates much close to zero. It demonstrates that the CG solver exactly finds the optimal integer solution to the mincost multicommodity flow problem when the ILP has a tight relaxation to a LP. For the situations where the ILP is not equivalent to a LP, caused by the close interactions of multiple objects, the CG solver provides a nearoptimal solution in an efficient way by using a column generation process, as shown in Fig. 5(b) and Fig. 5(c).
To further demonstrate the superiority of the CG solver in terms of multiobject tracking, we report the tracking performance (the MOTA score) and convergence speed (the average run time per frame) of both the CG solver and the ILP solver for the three sequences with varying object densities in the PETS dataset. Results are shown in Table I. As can be observed, the CG solver achieves better results compared with ILP with significantly faster speed. For each sequence, the CG solver achieves higher MOTA scores than the ILP solver, indicating that the nearoptimal solutions produced by the CG solver are much more meaningful for multiobject tracking. It owes to the pathflow reformulation involved in the CG solver which conducts a direct connection between the solution and the estimated trajectories. Furthermore, favorable convergence speed is provided by the CG solver even though the number of objects increases quickly from the sequence PETSS2L1 ( objects per frame) to PETSS2L3 ( objects per frame).
VF Comparison with the stateoftheart
We now compare our approach with the stateoftheart methods on the MOTChallenge 2015 dataset. The stateoftheart methods are selected with available corresponding publications at the time of our submission to the test bench, including TC_ODAL [BaeY2014robust], RMOT [yoon2015bayesian], MDP [xiang2015learning], SCEA [hong2016online], TDAM [yang2016temporal], DP_NMS [pirsiavash2011globally], SMOT [dicle2013way], TBD [geiger20143d], CEM [Milan:2014:CEM], MotiCon [leal2014learning], SegTrack [milan2015joint], MHT_DAM [kim2015multiple], JPDA_m [hamid2015joint], TSMLCDE [wang2016tracklet], and NOMT [choi2015NOMT]. Note that the TC_ODAL, RMOT, MDP, SCEA and TDAM trackers are local dataassociation methods, the NOMT tracker and our approach perform data association in a hybrid way, while the other trackers are global dataassociation methods.
Method  MOTA[%]  MOTP[%]  FAF  MT[%]  ML[%]  FP  FN  IDS  FG  

TC_ODAL [BaeY2014robust]  
RMOT [yoon2015bayesian]  
MDP [xiang2015learning]  38.4  
TDAM [yang2016temporal]  72.8  39.1  30,617  
Local  SCEA [hong2016online]  1.0  6,060  
Global  DP_NMS [pirsiavash2011globally]  
SMOT [dicle2013way]  
TBD [geiger20143d]  
CEM [Milan:2014:CEM]  
MotiCon [leal2014learning]  
SegTrack [milan2015joint]  737  
MHT_DAM [kim2015multiple]  16.0  
JPDA_m [hamid2015joint]  1.1  6,373  365  
TSMLCDE [wang2016tracklet]  34.313.1  14.0  
NOMT [choi2015NOMT]  
Hybrid  HybridDAT  35.015.0  72.6  31,140  358 
Table II lists detailed quantitative comparison results on the MOTChallenge 2015 dataset, where the results are grouped into local, global, and hybrid dataassociation methods ^{1}^{1}1The comparison is also available at the website of the MOTChallenge http://motchallenge.net/results/2D_MOT_2015/.. With only the provided detections and a simple dynamic model, our approach shows very competitive performance with the best MOTA score. It demonstrates that our approach performs favorable over the stateoftheart and is suitable for various unconstrained environments. In particular, the MOTA score and the number of ID switches are substantially improved compared with both local and global dataassociation methods. It is ascribed to the hybrid data association framework that is able to find optimal associations for the existing trajectories over multiple video frames. Errors caused by inaccurate detections and occlusions, which are the most challenging issues in complex scenes, are significantly alleviated by our approach to produce consistent trajectories.
As expected, hybrid dataassociation methods performs better than both local and global methods by a large margin. This superior performance is mainly due to the integration of local targetspecific models and global optimization over multiple frames. Compared with the local methods, hybrid data association takes multiple frames into account and therefore is much more stable against noise when association decisions are made. Moreover, compared with the global methods, hybrid data association utilizes the local targetspecific models to ensure the local consistency of estimated trajectories, meanwhile retains the ability to handle online data. Benefitting from the superiority of hybrid data association, the NOMT tracker also achieves good scores on the challenging dataset, as we can observed in Table II. In contrast, our approach produces apparently lower FN and IDS scores with a reasonable number of false alarms, and thus provides a better MOTA score. This is because that our mincost multicommodity flow formulation models the multiobject tracking problem in a compact form and enables the efficient nearoptimal solution to obtain more accurate trajectories.
On the other hand, our approach produces slightly more fragmented trajectories in return. The reason is that our approach can perform multiobject tracking in an online manner, even though the global optimization over multiple frames are involved. Our approach tends to terminate the trajectory when it has no associated detections in the future frames and thus increases the FG scores. The number of ID switches is significantly reduced due to the consideration of multiple future frames, as shown in Table II.
Several qualitative examples of tracking results produced by our approach on the MOTChallenge 2015 are shown in Fig. 6. Consistency of the estimated trajectories is indicated by bounding boxes of the same color on the same object over time. Our method is able to accurately track the objects against the inference of abundant false positive detections, shortterm occlusions, abrupt motions etc. (Videos suitable for qualitative evaluation of the results across all frames are available at the website of the MOTChallenge http://motchallenge.net/results/2D_MOT_2015/, as well as the detailed tracking results provided by our approach and the stateoftheart algorithms.)
Vi Conclusion
In this paper, we have proposed a hybrid data association framework for multiobject tracking. Instead of only considering local associations between adjacent video frames, we explored the superior abilities of global optimization over multiple frames to carry out online tracking. It was formulated as a mincost multicommodity flow problem where the local targetspecific information is modeled to cooperate with the global association. We employed a powerful online similarity learning algorithm to explicitly build targetspecific appearance models to compute the edge costs of our multicommodity network, improving the discriminative ability of the framework. In addition, we introduced an efficient and effective solution with empirical suboptimality certificates, and validated its superiority in terms of multiobject tracking. Extensive experiments on various challenging datasets have demonstrated that our approach outperforms the stateoftheart methods.
Our future work will explore more effective approaches to learn edge costs for the multicommodity network since it is the most critical issue for good performance. Online similarity learning is just one example of using the appearance cue to compute edge costs, and we believe that our hybrid data association framework can be further improved in terms of multiobject tracking by introducing more useful cues such as motion and shape.
Comments
There are no comments yet.