1 Introduction
Multiple object tracking, and in particular people tracking, is one of the key problems in computer vision with potential impact for many applications such as video surveillance or crowd analysis
[1]. A common approach to generate the trajectories of multiple people is trackingbydetection: first a person detector is applied to each individual frame to find the putative locations of people. Then, these hypotheses are linked across frames to form trajectories. By building on the advances in person detection over the last decade, trackingbydetection has been very successful [15, 16, 37, 57]. However, the dependence on detection results, typically bounding boxes, is also a major limitation. A lot of potentially useful information is lost during the nonmaxima suppression. A tracker typically does not use direct image data, except in the form of appearance models in order to discriminate different people. Recently, a number of approaches [14, 21, 44] have proposed to use other image features aside from fullbody detections, with the main goal of recovering partially occluded pedestrians.In this paper, we present a framework for offline multiple object tracking using two detector types, namely, fullbody detections together with head detections, since heads can be detected very accurately, as they are barely prone to pose variations or occlusions. This is especially useful in crowded scenarios: Fig. 1 shows a heavily occluded pedestrian. While the fullbody detector is unable to detect that person, due to the occlusion, its head is still visible so that our tracker localizes that pedestrian correctly.
Our tracking formulation ensures longterm temporal consistency by taking all detections assigned to a person (we denote detection to person assignments as labelings) into account. Therefore, our clustering concept shares similarities to correlation clustering approaches [56, 57, 63, 16], but we propose a very efficient labeling formulation that avoids the exponential growth in the constraints. Due to our powerful solver, we are able to optimize our problem globally on the input detections without the need of potentially errorprone tracklets.
We compute the best labeling by solving a Binary Quadratic Problem (BQP). A straightforward approach to solve that BQP would thus be to optimize an equivalent Binary Linear Program (BLP) using branch&bound. However, due to the high dimensionality of the problem, such a BLP is computationally expensive and memory demanding.
We propose to use the FrankWolfe algorithm (FW) to solve the relaxation of the BQP. By using a standard implementation of FW, the result is often far away from the binary optimal solution. Therefore, we propose several crucial improvements that lead in practice to a much better solution in terms of the objective value and the tracking performance, as we show in Section 4.2. At the same time, the proposed algorithm is much faster than the standard branch&bound approach. Finally, an analysis on the effect of the fusion of head detections with fullbody detections shows that the best tracking accuracy is obtained by using both input sources. The fusion helps especially to remove false positive fullbody detections that are not consistent with the head detections and to recover heavily occluded persons.
1.1 Contributions
To summarize, our contribution is threefold:

We propose a novel detector fusion multiobject tracking system, which solves a graph labeling problem and is represented by a BQP with very few constraints.

We propose a new solver that significantly improves over standard BQP solvers when applied to our discrete optimization problem.

We present detailed evaluations on the improvements due to our solver as well as the detector fusion. Our framework sets a new stateoftheart in tracking.
1.2 Related Work
Data association models. Trackingbydetection has become the standard paradigm for multiobject tracking. It splits the problem into two steps: object detection and data association. In crowded environments, where occlusions are common, even stateoftheart detectors [18, 19, 48, 62] are prone to false alarms and missed detections. The goal of the data association step is then to fill in the gaps between detections and filter out false positives. In order to do this robustly, data association is mostly performed for all frames and all trajectories simultaneously. This is usually done in discrete space, using graph based methods [25, 54, 64, 5, 47, 9], or BLPs [28, 12].
Most of these trackers were derived from a Markov chain model
[64]. Recent systems utilize correlation clustering based formulations that ensure consistency within all links of a trajectory [16, 44, 56, 57, 58, 59, 63]. Thereby, simplified models were used initially, which created trajectories iteratively, computing one best clique [63] or dominant set [59] corresponding to exactly one person and then removing the respective detections from the loop. This concept has been extended [16, 44] to obtain trajectories in a global manner, for all persons at the same time. However, the inference relies on potentially errorprone initial tracklets to keep the approach computationally feasible. In contrast, our solver is fast and accurate enough to optimize directly on the detections, thereby avoiding error propagation that might have been introduced by the tracklets. Further progress has been made by computing the correlation clustering directly on the input detections [56, 57], using a huge set of clique constraints in a BLP, that has exponential growth. Accordingly, a heuristic solver has to be applied. In contrast, our formulation needs only very few constraints, making it capable for the usage of many detections.
BQP Optimization.
Tracking methods that need to solve a BQP have been rare so far, due to the computational challenge, although many advanced tracking models are naturally expressed as a BQP. For instance, the Markov model
[64] can be augmented by one additional detector [12], resulting in a BQP. While this problem can be solved by rewriting the BQP as an equivalent BLP, we show in our experiments, that this simple trick is not applicable to our more demanding correlation clustering based model, due to the problem size of our BQP. Another work [17] formulates online tracking via a BQP and solves it using the FrankWolfe algorithm, which is also the basis for our solver. While [17] shows good performance, we propose a hierarchical solving scheme that can be easily integrated into their formulation, thereby further improving their result. Furthermore, during the FrankWolfe algorithm, the step size for an iterate update has to be computed. We derive an optimal, algebraic computation, that is cheap to compute and improves over existing methods [2, 17, 35]. Note that our improvements may be applied to methods of other fields in computer vision as well, such as person reidentification [2], colocalization [29] or object segmentation [53].Incorporating different features. Limiting the input of the tracker to a single detector has clearly several drawbacks, since much of the information of the image is not taken into account, potentially ignoring semioccluded objects. In recent literature, several works have started incorporating different image features for the task of multitarget tracking. Few works use supervoxels as input for tracking, obtaining as a byproduct a silhouette of the pedestrian. In [14], the optimization is done via greedy propagation, while in [44], supervoxel labeling is formulated as CRF.
There are several works that use dense point tracks (DPT) [10] or KLT [60, 42] together with detections to improve tracking performance. In [4], corner features are tracked using KLT to obtain a motion model between detections. In [20], multitarget tracking is tackled by clustering DPTs and further combined with detectionbased tracklets in a twostep approach in [21]. Further improvement is achieved using a globally optimal fusion formulation [26].
In [12], a BQP fuses head and body detections to track pedestrians, modeling nonmaxima suppression as well as overlap consistency between features. In contrast to our model, only cooccurrences of active features are considered, while we directly model the grouping of features to different persons, allowing to ensure consistency within each cluster over long time periods. Also in the extension [53] to motion segmentation using superpixels, the perperson consistency is not considered.
2 Detector Fusion for MultiTarget Tracking
In this section, we describe the data association that couples multiple detectors and detections in a correlation clustering fashion to ensure longterm temporal consistency. As correlation clustering is NPComplete [3], we rely on finding a good approximation to the solution. We propose to use a BQP formulation for the clustering problem that can be well approximated using the FrankWolfe [22] solver. In particular, we compute the relaxed solution of the BQP first, and perform a rounding step afterwards. FrankWolfe is well suited for continuous quadratic problems with linear constraints, as each iteration step involves solving a computationally efficient linear optimization problem. The binary solution is then obtained by an efficient rounding step.
When applied to a nonconvex problem, like our model, the FrankWolfe algorithm delivers only a local optimum [33]. Hence, simply applying the standard algorithm will result in a solution that is far away from the global optimum. We thus focus on enhancing the solution of FrankWolfe by: (i) regularizing the cost function, (ii) computing the optimal step size within the solver’s algorithm algebraically and (iii) introducing a hierarchical solving scheme that enhances the solution produced by the FrankWolfe algorithm.
Our regularizer prevents the FrankWolfe algorithm from falling to quickly into a local optimum. The hierarchical solving scheme gains the improvement by revoking or connecting clusters of the discretized solution, while having the guarantee of operating optimally. The presented approach is not specific to the FrankWolfe solver. It can be applied after any approximating algorithm. It further allows to correct errors introduced by the initial solver.
Experiments in Sect. 4
show that our proposed solver provides good solutions close to the estimated bound, while being considerably faster than the commercial solver Gurobi
[23], which uses the branchandbound/cut algorithm [36, 45] to find the globally optimal solution.2.1 Joint Data Association
We cast the data association using two detectors as a graph labeling problem: Consider a weighted complete graph , where the vertex set consists of all input detections. We set . Each node has costs reflecting the likelihood of being a correct detection. An edge encodes a possible linking of two detections to the same person. The nodes are labeled , if and belong to person . Likewise, reflects how likely and belong to the same person.
Finally, the goal of the data association problem is then to find the labeling for all detection nodes that minimizes the total costs.
Hence, for each node , consider a decision variable that equals , if node has label , and otherwise. For being an upper bound on the number of persons, let
. Then, the vector
stacks all decision variables in a vector.Given the unary and pairwise potentials
(1) 
and
(2) 
we define the cost function
(3) 
Finally, our tracking model BQP is described by the labeling problem:
(4) 
where and
(5) 
The constraints (5) ensure that each detection is assigned to at most one label , i.e. to at most one person. Note that BQP has only linear constraints.
We model the binomial distribution for the selection of nodes
and edgesusing logistic regression. Then, finding the most likely selection is equivalent to solving BQP
, if the costs are defined via the function, see [56]. Therefore, we set the unary costs as(6) 
with
denoting the probability of detection
(inferred from the detection’s score). For the pairwise costs, we learn model parameters to obtain probabilities(7) 
given and a feature vector . Since we model using logistic regression, the pairwise costs are
(8) 
In Sect. 3, we describe the model features
used for the classifier.
Detections, which are temporally too far apart can neither be compared reliably nor meaningful. For such edges , we set their weight to . This strategy effectively sparsifies the graph and keeps the proposed approach memory and computationally efficient.
2.2 FrankWolfe Optimization
Solving BQP is a challenging task due to the fact that it belongs to the NPhard problems [51] and that our domain space is very highdimensional. We thus follow a common practice and consider the relaxed problem:
(9) 
However, even the relaxation is still NPhard to solve [46], as is nonconvex, in general. Thus even for commercial quadratic solvers like Gurobi [23], solving BQP or QP is computationally very expensive.
This paper proposes to use the FrankWolfe algorithm to approximate QP, and points out ways to further improve the solution. We present a pseudocode of the standard FrankWolfe algorithm, together with a discretization step, in Alg. 1 and its evaluation, as a baseline, in Sect. 4.
FrankWolfe minimizes the linear approximation of at the current solution (Ln. 5 of Alg. 1), resulting in . The next iterate is the vector between and that minimizes (Ln. 6 and Ln. 16). For the optimal step size in Ln. 6, we present an efficient algebraic description in Sect. 2.3. The algorithm is stopped in case of a small duality gap or a maximal number of iterations .
The binary solution equals either a binarized iterate (Ln. 1115), or, (Ln. 710), as the constraint matrix corresponding to our set is totally unimodular, so that is already binary and thus feasible [52, 29].
In order to enhance the convergence rate, we use in our implementation a slightly improved variant of the algorithm, that adds socalled awaysteps. We refer the interested reader to [34] for further details.
Using and , we define
(10) 
Then, we obtain in matrixvector form:
(11) 
Due to design of our problem BQP, we can run Alg. 1 without the need of storing the huge matrix or the vector. Instead, all computations of Alg.1 can be deduced from the upper triangle matrix of and from . Therefore, our approach is memory efficient.
BINARIZE: In order to obtain feasible, binary vectors, we discretize an iterate by selecting the closest feasible point in w.r.t. euclidean distance. To this end, let be the vector with all entries equal to . It is straightforward to show that
(12)  
(13) 
2.3 Computing the Optimal Step Size
2.4 Regularization of the Objective Function
Since our cost function is nonconvex, FrankWolfe delivers only a local optimum [33]. Given , our next proposed improvement is to replace the objective function by
(14) 
For , we have . Using has the effect of pushing the FW algorithm towards discrete solutions, as has its minimum at and , within . For , we observed better behavior in staying out of local optima, as for a value sufficiently large, becomes convex [8, 11, 24]. On the other hand, a high value brings the optimal solution too close to the constant vector. For , we set and . Starting with , we compute QP, using in Alg. 1. Empirically, we observed that a short number of iterations of Alg. 1 corresponds to a too strong convexification term, resulting in a bad local optimum. Thus, if Alg. 1 terminates in too few steps (which we set to 10), we set , and run Alg. 1 again with the updated function . In all our experiments, an appropriate was found in at most two calls of Alg. 1. In Sect. 4, we demonstrate the impact of using the modified cost function, with the solver we call .
2.5 Hierarchical Solving Scheme
Since delivers only a local optimum, we propose a new hierarchical solving scheme that enhances the solution of by removing, correcting and connecting clusters, thus resulting in an improved objective value. Our approach is computationally efficient and continues optimizing problem BQP. Compared to other hierarchical approaches like [27] that define specific parameter changes in each iteration, our formulation is generic and can be applied to many clustering problems without the need of heuristically set parameter update rules.
In the following, we present all parts of our proposed solving scheme and present a pseudocode in Alg. 2.
CorrectionContraction: Let be the current best labeling of . Initially, we obtain using . We apply a relabeling strategy that corrects obvious errors within the clusters that may have been introduced due to the rounding or local optimality. For , let be the set of all adjacent nodes that have the same label as . If , or, if and , we assign a new and unique label to (see Fig.2 middle). Let be the set comprising all nodes labeled . We build a contracted graph by using these virtual, new nodes: We set and connects any two different vertices. Accordingly, we obtain the stacked decisions variables for the current labeling of .
LabelExpand: Let the current labeling result in clusters. To compute the optimal labeling on according to BQP, we define the unary costs
(15) 
and pairwise costs
(16) 
Consider the stacked decision variables where equals 1, if (and thus ) and if was not a rejected node by ; and otherwise. Then, assigns each node of a unique label, except for nodes that have been rejected by . Therefore, sums up only the unary costs (15), which equal or are improved by the refinement, implying
(17) 
Furthermore, solving BQP results in a solution with
(18) 
The result is converted to a labeling by graph expansion: All nodes are assigned the new label of , according to , see also Fig. 2. Thus, the hierarchical step can improve the last solution, since
(19) 
The graph contraction reduces the dimensionality significantly: There are nodes to be labeled using at most labels, w.r.t. BQP. If is small enough, we can solve BQP quickly to optimality using Gurobi [23]. Otherwise we use the solver. The algorithm is stopped once no new clusters are merged. We demonstrate the effect of the hierarchical solving scheme in Sect. 4.
3 Regression Training
In the following, we introduce spatial and temporal costs, which describe how likely two detections within the same and between different frames belong to the same person, respectively. For each cost type, we train a logistic regression model to obtain weights , as described in Sect. 2.1.
For our tracking system, we consider two input sources: (i) head and (ii) fullbody detections (see also Fig. 1).
Head detections. To obtain accurate head detections, we employ [55]
based on Convolutional Neural Networks and finetune it on the MOT16 training set
[43].Relative positioning. In order to obtain meaningful features between differently sized boxes, features have to be formulated respecting the different scales.
To this end, consider a person detection box with the positions of lower left, upper left and upper right corners and , respectively and . For a pixel , we obtain barycentric coordinates of w.r.t. , so that (see Fig. 3). We fix a standard box . Then, is mapped to , keeping the relative position as in . Now, all subsequent distance measurements are computed using the mapped position w.r.t. .
Spatial costs. We introduce two features that set the position of the head in relation to the fullbody box.
For a pair of head and fullbody detection, we mirror the head detection to the left half side of the detection box , resulting in the pixel , thereby making the position robust against different orientations of the person. From the MOT16 training data, we learned the expected relative position of a head w.r.t. the standard detection , corresponding to a fullbody detection of the same person. Finally, we obtain the feature , measuring distance between the detected and expected position. We introduce a second feature which uses the angle between expected and detected position, with the anchor at the box’s center (see Fig. 3).
We set the spatial costs between detections from the same detector to a constant high value.
Temporal costs. Temporal costs are defined via correspondences of pixels between two frames. DeepMatching [61] (DM) provides such assignments, which are more reliable than spatiotemporal affinities, see [57]. Given rectangles and , DM samples and many pixels in and , respectively. Let denote the number of correspondences, found by DM. Comparing two heads or two fullbody detections, we use the features , and , as in [57]. As head detections are significantly smaller than fullbody detections, we only use the temporal head to fullbody feature , where denotes the head detection and the fullbody detection. From the MOT16 training data, we learned the mean ratios and between a head and body detection, w.r.t. width and height, respectively, if both belong to the same person. Then, we obtain features and , for the observed ratios and w.r.t. width and height, respectively, given a pair of detected head and fullbody detection.
4 Experimental Results
In this section, we first analyze the gain both in speed as well as in tracking performance by our proposed solver. Next, we investigate the impact of the detector fusion on the tracking performance, using the training sequences of the challenging MOT16 benchmark [43]. This benchmark consists of 7 sequences for training and 7 for testing, with footage of crowded scenes. In the last experiment, we show our performance on the test set of the benchmarks MOT16 and MOT17, where we achieve stateoftheart performance. We evaluate our experiments using wellestablished tracking metrics [6, 41, 49].
4.1 Implementation Details
In our implementation, we set the temporal costs of two nodes being more than frames apart to zero. The maximal number of labels is fixed to . We process a sequence in batches containing no more than nodes. We stop the FrankWolfe iterations of Alg.1 in case the duality gap is below or iterations are reached.
4.2 FrankWolfe Optimization
Our first experiment analyzes the impact of our modifications on the FrankWolfe optimization. To this end, we choose a representative batch of frames from the MOT1613 training sequence and perform tracking using fullbody detections only. It consists of detections, so that we have decision variables. In Tab. 1 we show the number of iterations performed by the solver until the duality gap is below the defined threshold, the runtime, the final objective value of as well as the corresponding Multiple Object Tracking Accuracy (MOTA).
Method  Iters  Time[sec]  Obj Value  MOTA 

16  0.7  3060  14.2  
676  27  5481  26.8  
  27+0.5  5925  27.5  
Gurobi    1000  (5531)  24.9 
Gurobi bound    1000  (5973)   
Our proposed modification improves the objective value considerably compared to the standard FrankWolfe algorithm . This naturally translates to almost double MOTA accuracy, 14.2% vs 27.5%. Note also that the objective value comes very close to the global optimum. The commercial solver Gurobi [23], which uses the branchandbound algorithm is still far away from the global optimum after 1000 seconds, while we obtain a much better energy after only seconds. While Gurobi was not able to compute the global optimum in the given time span, it delivers at each time step a lower bound (Gurobi bound) on the optimal value, showing that the optimal solution to the BQP has an objective value .
The energy evolution of the different solvers is plotted in Fig. 4. Here we clearly see where stops (red line), how our modification improves the energy by a large margin (blue line), and how finally (green line) comes even closer to the estimated lower bound (purple line), as provided by Gurobi. In contrast, Gurobi (yellow line) has a much slower convergence.
To separate the quality of our solver from the detections, we further evaluate the performance on groundtruth person detections for 40 frames of each MOT16 training sequence in Tab.4.2, where we also report the (relative) duality gap to the optimal solution (GAP). The results show a consistent and huge improvement by the hierarchical concept over FW+r. At the same time, the solutions are close to optimality w.r.t. to the objective value and w.r.t. to tracking performance. The sequences MOT1605 and MOT1611 both contain many partial occlusions that makes it difficult for the DM features to be correct in any situation, thus resulting in lower tracking scores. However this shows that a second type of detections (head detections) is necessary for high quality tracking results. On the other hand, the solver reaches the perfect result on MOT1609 (which has far less occlusions), thereby justifying our solver.
r—[1.5pt]c—c—c—c—c—[1.5pt]c—c—c—c—c & FW+r & FW+r+h
Seq & IDF1 & ID & FM & MOTA & GAP & IDF1 & ID & FM & MOTA & GAP
02 & 87.4 & 5 & 1 & 84.0 & 6.424 & 90.9 & 3 & 0 & 90.8 & 0.428
04 & 85.0 & 5 & 0 & 73.2 & 7.506 & 92.4 & 0 & 0 & 85.8 & 0.120
05 & 57.4 & 10 & 8 & 74.2 & 9.130 &70.1 & 8 & 7 & 75.1 & 0.071
09 & 80.6 & 3 & 0 & 98.9 & 5.353 & 100.0 & 0 & 0 & 100.0 & 0.000
10 & 82.0 & 10 & 6 & 80.4 & 7.410 & 87.0 & 7 & 6 & 89.4 & 0.638
11 & 76.8 & 13 & 2 & 78.2 & 12.846 & 89.4 & 5 & 3 & 96.3 & 0.084
13 & 87.2 & 10 & 2 & 85.3 & 10.332 & 96.3 & 2 & 3 & 96.9 & 0.434
4.3 Ablation studies on head and body detections
We analyze how our formulation exploits the information from two detectors. For this experiment, we use all MOT16 training sequences with the fullbody detections only (B) against body and head detections (B+H). We use the body detections provided by the benchmark while we train the head detector and the regression model on MOT16 training sequences in a leaveoneout fashion.
In Tab. 3, we report several ablation results with: (i) different inputs (body and heads) and (ii) different solvers, namely our proposed FW+r+h (Ours) is compared to: tracking heads and bodies independently and then using our solver to fuse them. The tracklets are computed by our system (Oursfusion) and from LP2D [39] (LP2Dfusion). We use the affinities as defined in Sect.3, but set the spatial and temporal costs between two tracklets that originate from the same detector to a constant high value, as the tracklets are already separating the persons (Sect.3*). We further provide the quality of the head trajectories, which we evaluated on the head ground truth boxes.
Feature  Affinities  Solver  SolverID  MOTA  MT  FP  FN  IDs 

H 
2D dist  1  14.9  70  14829  50991  472  
H  Sect.3  2  16  70  14168  50959  331  
B 
2D dist  3  31.7  44  3557  71332  467  
B  Sect.3  4  33.0  76  11949  61603  378  
B  [16]  GMMCP[16]  5  33.7  46  4053  68675  499 
B+H 
Sect.3*  6  33.0  54  3501  70163  358  
B+H  Sect.3*  7  34.2  87  11852  60401  376  
B+H  Sect.3  8  31.1  75  5315  69563  1207  
B+H  Sect.3  9  33.4  82  6497  66238  807  
B+H  Sect.3  10  38.2  86  4972  62935  372  
B+H  Sect.3  NLLMPa [40]  11  37.4  86  4954  63831  336 

Our system performs comparable on fullbody detections to the SolverID 5, using their defined affinities. By using the two detectors, our system significantly improves almost all relevant tracking metrics, justifying our tracking framework (SolverID 10 vs 4). Due to the coupling of head detections with fullbody detections, the number of false positives (FP) is halved and the system is less prone to partial occlusions, which results in an increase of the number of mostly tracked (MT) trajectories. Overall, the MOTA score increases by more than 5pp (percentage points). Performing the fusion directly on the input detections is clearly more effective than using initial tracklets. SolverID 6 and 7 use our solver and precomputed trajectories from SolverID 3 and 4, where the gain is no more than 1.3pp, justifying our fusion concept. Using another heuristic solver [40] (SolverID 11) performed worse on the fusion than FW+r+h, using exactly the same graph. The comparison SolverID 810 show the improvement on MOT16train due the regularizer and the hierarchical step (up to 7.1 pp on the MOTA score).
4.4 Benchmark Evaluation
We evaluate the tracking performance of our formulation with body and heads on the benchmarks MOT16 and MOT17 with the fullbody detections provided by the benchmarks. Due to space constraints, we show some of the best performing published trackers in Tab. 4, as well as the worst performing tracker. For the full table of results, please visit the benchmark’s website.
Our system creates slightly higher identity switches. This can be resolved in future work with more advanced features that include a foreground/background mask in each detection or in a postprocessing step where tracklet consistency is checked, though this is beyond the scope of this paper. However, our proposed tracker performs on par with stateoftheart in terms of tracking accuracy on MOT16 and sets a new stateoftheart on MOT17. Furthermore, the tracker won, together with [30], the MOT 2017 Tracking challenge at the CVPR 2017 ^{1}^{1}1https://motchallenge.net/MOT17_results_2017_07_26.html. Note that the MOTA metric is regarded as the most representative metric [38]. With our proposed formulation, we have the lowest ML (mostly lost) score within all trackers in both benchmarks, showing that we can recover more trajectories than any other tracker. Also our MT score is highest on the MOT16 benchmark and ranks second on the MOT17 benchmark, demonstrating that we recover very long trajectories. In contrast, the GMMCP model approach is not able to produce longterm consistent trajectories possibly due to erroneous initial tracklets, that could not be connected (we used the official code of [16] to produce the results). We note that the LMP tracker uses very advanced and stable convolutional neural network image features that can reliably link boxes over 200 image frames, thus resulting in a better MOTA score.
Method  Rank  MOTA  IDF1  MT  ML  FP  FN  ID 
MOT16  
LMP [58]  1  48.8  51.3  18.2  40.1  6654  86245  481 
Ours  2  47.8  47.8  19.1  38.2  8886  85487  852 
NLLMPa [40]  3  47.6  47.3  17.0  40.4  5844  89093  629 
AMIR [50]  4  47.2  46.3  14.0  41.6  2681  92856  774 
NOMT [15]  5  46.4  53.3  18.3  41.4  9753  87565  359 
GMMCP [16]  15  38.1  35.5  8.6  50.9  6607  105315  937 
DP_NMS [47]  23  26.6  31.2  4.1  67.5  3689  130557  365 
MOT17  
Ours  1  51.3  47.6  21.4  35.2  24101  247921  2648 
MHT_DAM[31]  2  50.7  47.2  20.8  36.9  22875  252889  2314 
EDMT17[13]  3  50.0  51.3  21.6  36.3  32279  247297  2264 
GMPHD_KCF[32]  6  30.5  35.7  9.6  41.8  107802  277542  6774 
5 Conclusion
We presented a global formulation for multidetector multitarget tracking, and showed its stateoftheart performance with head and fullbody detectors. We proposed to cast the problem into a quadratic program, which is solved efficiently via the FrankWolfe algorithm. We improved the solver in three ways; (i) regarding time by providing complete and efficient computation of the optimal step size and (ii) regarding minimization by a reformulation of the objective function, resulting in better discrete solutions. Finally (iii), we showed that our hierarchical solving scheme improves a feasible solution, often close to optimality and yet is easy to integrate and fast.
The detector fusion delivered superior results when compared to single detector tracking, thus proving the benefits of our formulation. The overall performance on two challenging tracking benchmarks showed stateoftheart results.
References
 [1] A. Alahi, V. Ramanathan, and L. FeiFei. Sociallyaware largescale crowd forecasting. In CVPR, 2014.
 [2] S. M. Assari, H. Idrees, and M. Shah. Reidentification of humans in crowds using personal, social and environmental constraints. arXiv preprint arXiv:1612.02155, 2016.
 [3] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(13):89–113, 2004.
 [4] B. Benfold and I. Reid. Stable multitarget tracking in realtime surveillance video. In CVPR, 2011.
 [5] J. Berclaz, F. Fleuret, E. Türetken, and P. Fua. Multiple object tracking using kshortest paths optimization. TPAMI, 2011.
 [6] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008.
 [7] D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 [8] A. Billionnet, S. Elloumi, and A. Lambert. Extending the qcr method to general mixedinteger programs. Mathematical programming, 131(1):381–401, 2012.
 [9] W. Brendel, M. Amer, and S. Todorovic. Multiobject tracking as maximum weight independent set. In CVPR, 2011.
 [10] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, 2010.
 [11] S. Burer and A. N. Letchford. Nonconvex mixedinteger nonlinear programming: A survey. Surveys in Operations Research and Management Science, 17(2):97–106, 2012.
 [12] V. Chari, S. LacosteJulien, I. Laptev, and J. Sivic. On pairwise costs for network flow multiobject tracking. In CVPR, 2015.
 [13] J. Chen, H. Sheng, Y. Zhang, and Z. Xiong. Enhancing detection model for multiple hypothesis tracking. In CVPRW, 2017.
 [14] S. Chen, A. Fern, and S. Todorovic. Multiobject tracking via constrained sequential labeling. In CVPR, 2014.
 [15] W. Choi. Nearonline multitarget tracking with aggregated local flow descriptor. In ICCV, 2015.
 [16] A. Dehghan, S. Assari, and M. Shah. Gmmcptracker: Globally optimal generalized maximum multi clique problem for multiple object tracking. In CVPR, 2015.
 [17] A. Dehghan and M. Shah. Binary Quadratic Programing for Online Tracking of Hundreds of People in Extremely Crowded Scenes. TPAMI, 2017.
 [18] P. Dolĺar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. TPAMI, 2014.
 [19] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, 2010.
 [20] K. Fragkiadaki and J. Shi. Detection free tracking: Exploiting motion and topology for segmenting and tracking under entanglement. In CVPR, 2011.
 [21] K. Fragkiadaki, W. Zhang, G. Zhang, and J. Shi. Twogranularity tracking: mediating trajectory and detections graphs for tracking under occlusions. In ECCV, 2012.
 [22] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.
 [23] I. Gurobi Optimization. Gurobi Optimizer Reference Manual. 2015.
 [24] P. L. Hammer and A. A. Rubin. Some remarks on quadratic programming with 01 variables. Revue française d’informatique et de recherche opérationnelle. Série verte, 1970.
 [25] R. Henschel, L. Laura LealTaixé, and B. Rosenhahn. Efficient multiple people tracking using minimum cost arborescences. In GCPR, 2014.
 [26] R. Henschel, L. LealTaixé, B. Rosenhahn, and K. Schindler. Tracking with multilevel features. arXiv preprint arXiv:1607.07304, 2016.
 [27] C. Huang, B. Wu, and R. Nevatia. Robust object tracking by hierarchical association of detection responses. In European Conference on Computer Vision, pages 788–801. Springer, 2008.
 [28] H. Jiang, S. Fels, and J. Little. A linear programming approach for multiple object tracking. In CVPR, 2007.
 [29] A. Joulin, K. Tang, and F. F. Li. Efficient image and video colocalization with frankwolfe algorithm. In ECCV, 2014.
 [30] M. Keuper, S. Tang, Y. Zhongjie, B. Andres, T. Brox, and B. Schiele. A multicut formulation for joint segmentation and tracking of multiple objects. arXiv preprint arXiv:1607.06317, 2016.
 [31] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg. Multiple hypothesis tracking revisited: Blending in modern appearance model. In ICCV, 2015.
 [32] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora. Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multiobject tracking in video data. IEEE AVSS Workshop, 2017.
 [33] S. LacosteJulien. Convergence rate of frankwolfe for nonconvex objectives. arXiv preprint arXiv:1607.00345, 2016.
 [34] S. LacosteJulien and M. Jaggi. On the global linear convergence of frankwolfe optimization variants. NIPS, 2015.
 [35] S. LacosteJulien, M. Jaggi, M. Schmidt, and P. Pletscher. BlockCoordinate FrankWolfe Optimization for Structural SVMs. In ICML, 2013.
 [36] A. H. Land and A. G. Doig. An automatic method of solving discrete programming problems. Econometrica: Journal of the Econometric Society, 1960.
 [37] L. LealTaixé, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese. Learning an imagebased motion context for multiple people tracking. In CVPR, 2014.
 [38] L. LealTaixé, A. Milan, K. Schindler, D. Cremers, I. Reid, and S. Roth. Tracking the trackers: an analysis of the state of the art in multiple object tracking. arXiv preprint arXiv:1704.02781, 2017.
 [39] L. LealTaixé, G. PonsMoll, and B. Rosenhahn. Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker. ICCV. 1st Workshop on Modeling, Simulation and Visual Analysis of Large Crowds, 2011.
 [40] E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition & node labeling: Problem, algorithms, applications. In CVPR, 2017.
 [41] Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hybridboosted multitarget tracker for crowded scene. In CVPR, 2009.
 [42] B. D. Lucas, T. Kanade, et al. An iterative image registration technique with an application to stereo vision. 1981.
 [43] A. Milan, L. LealTaixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multiobject tracking. arXiv preprint arXiv:1603.00831, 2016.
 [44] A. Milan, L. LealTaixé, K. Schindler, and I. Reid. Joint tracking and segmentation of multiple targets. In CVPR, 2015.

[45]
J. E. Mitchell.
Branchandcut algorithms for combinatorial optimization problems.
Handbook of applied optimization, pages 65–77, 2002. 
[46]
P. M. Pardalos and S. A. Vavasis.
Quadratic programming with one negative eigenvalue is nphard.
Journal of Global Optimization, 1(1):15–22, 1991.  [47] H. Pirsiavash, D. Ramanan, and C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011.
 [48] S. Ren, R. G. K. He, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. NIPS, 2015.
 [49] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multitarget, multicamera tracking. In ECCV, 2016.
 [50] A. Sadeghian, A. Alahi, and S. Savarese. Tracking the untrackable: Learning to track multiple cues with longterm dependencies. ICCV, 2017.
 [51] S. Sahni. Computationally related problems. SIAM Journal on Computing, 1974.
 [52] A. Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.
 [53] G. Seguin, P. Bojanowski, R. Lajugie, and I. Laptev. Instancelevel video segmentation from object tracks. In CVPR, 2016.
 [54] F. Solera, S. Calderara, and R. Cucchiara. Learning to divide and conquer for online multitarget tracking. In CVPR, 2015.
 [55] R. Stewart, M. Andriluka, and A. Y. Ng. Endtoend people detection in crowded scenes. In CVPR, 2016.
 [56] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Subgraph decomposition for multitarget tracking. In CVPR, 2015.
 [57] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Multiperson tracking by multicuts and deep matching. In ECCV Workshops  Benchmarking MultiTarget Tracking, 2016.
 [58] S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple people tracking by lifted multicut and person reidentification. In CVPR, 2017.
 [59] Y. T. Tesfaye, E. Zemene, M. Pelillo, and A. Prati. Multiobject tracking using dominant sets. IET computer vision, 10(4):289–297, 2016.
 [60] C. Tomasi and T. Kanade. Detection and tracking of point features. 1991.
 [61] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow: Large displacement optical flow with deep matching. In ICCV, 2013.
 [62] F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In CVPR, 2016.
 [63] A. Zamir, A. Dehghan, and M. Shah. Gmcptracker: Global multiobject tracking using generalized minimum clique graphs. In ECCV, 2012.
 [64] L. Zhang, Y. Li, and R. Nevatia. Global data association for multiobject tracking using network flows. In CVPR, 2008.
5 Conclusion
We presented a global formulation for multidetector multitarget tracking, and showed its stateoftheart performance with head and fullbody detectors. We proposed to cast the problem into a quadratic program, which is solved efficiently via the FrankWolfe algorithm. We improved the solver in three ways; (i) regarding time by providing complete and efficient computation of the optimal step size and (ii) regarding minimization by a reformulation of the objective function, resulting in better discrete solutions. Finally (iii), we showed that our hierarchical solving scheme improves a feasible solution, often close to optimality and yet is easy to integrate and fast.
The detector fusion delivered superior results when compared to single detector tracking, thus proving the benefits of our formulation. The overall performance on two challenging tracking benchmarks showed stateoftheart results.
References
 [1] A. Alahi, V. Ramanathan, and L. FeiFei. Sociallyaware largescale crowd forecasting. In CVPR, 2014.
 [2] S. M. Assari, H. Idrees, and M. Shah. Reidentification of humans in crowds using personal, social and environmental constraints. arXiv preprint arXiv:1612.02155, 2016.
 [3] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(13):89–113, 2004.
 [4] B. Benfold and I. Reid. Stable multitarget tracking in realtime surveillance video. In CVPR, 2011.
 [5] J. Berclaz, F. Fleuret, E. Türetken, and P. Fua. Multiple object tracking using kshortest paths optimization. TPAMI, 2011.
 [6] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008.
 [7] D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 [8] A. Billionnet, S. Elloumi, and A. Lambert. Extending the qcr method to general mixedinteger programs. Mathematical programming, 131(1):381–401, 2012.
 [9] W. Brendel, M. Amer, and S. Todorovic. Multiobject tracking as maximum weight independent set. In CVPR, 2011.
 [10] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, 2010.
 [11] S. Burer and A. N. Letchford. Nonconvex mixedinteger nonlinear programming: A survey. Surveys in Operations Research and Management Science, 17(2):97–106, 2012.
 [12] V. Chari, S. LacosteJulien, I. Laptev, and J. Sivic. On pairwise costs for network flow multiobject tracking. In CVPR, 2015.
 [13] J. Chen, H. Sheng, Y. Zhang, and Z. Xiong. Enhancing detection model for multiple hypothesis tracking. In CVPRW, 2017.
 [14] S. Chen, A. Fern, and S. Todorovic. Multiobject tracking via constrained sequential labeling. In CVPR, 2014.
 [15] W. Choi. Nearonline multitarget tracking with aggregated local flow descriptor. In ICCV, 2015.
 [16] A. Dehghan, S. Assari, and M. Shah. Gmmcptracker: Globally optimal generalized maximum multi clique problem for multiple object tracking. In CVPR, 2015.
 [17] A. Dehghan and M. Shah. Binary Quadratic Programing for Online Tracking of Hundreds of People in Extremely Crowded Scenes. TPAMI, 2017.
 [18] P. Dolĺar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. TPAMI, 2014.
 [19] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, 2010.
 [20] K. Fragkiadaki and J. Shi. Detection free tracking: Exploiting motion and topology for segmenting and tracking under entanglement. In CVPR, 2011.
 [21] K. Fragkiadaki, W. Zhang, G. Zhang, and J. Shi. Twogranularity tracking: mediating trajectory and detections graphs for tracking under occlusions. In ECCV, 2012.
 [22] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.
 [23] I. Gurobi Optimization. Gurobi Optimizer Reference Manual. 2015.
 [24] P. L. Hammer and A. A. Rubin. Some remarks on quadratic programming with 01 variables. Revue française d’informatique et de recherche opérationnelle. Série verte, 1970.
 [25] R. Henschel, L. Laura LealTaixé, and B. Rosenhahn. Efficient multiple people tracking using minimum cost arborescences. In GCPR, 2014.
 [26] R. Henschel, L. LealTaixé, B. Rosenhahn, and K. Schindler. Tracking with multilevel features. arXiv preprint arXiv:1607.07304, 2016.
 [27] C. Huang, B. Wu, and R. Nevatia. Robust object tracking by hierarchical association of detection responses. In European Conference on Computer Vision, pages 788–801. Springer, 2008.
 [28] H. Jiang, S. Fels, and J. Little. A linear programming approach for multiple object tracking. In CVPR, 2007.
 [29] A. Joulin, K. Tang, and F. F. Li. Efficient image and video colocalization with frankwolfe algorithm. In ECCV, 2014.
 [30] M. Keuper, S. Tang, Y. Zhongjie, B. Andres, T. Brox, and B. Schiele. A multicut formulation for joint segmentation and tracking of multiple objects. arXiv preprint arXiv:1607.06317, 2016.
 [31] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg. Multiple hypothesis tracking revisited: Blending in modern appearance model. In ICCV, 2015.
 [32] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora. Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multiobject tracking in video data. IEEE AVSS Workshop, 2017.
 [33] S. LacosteJulien. Convergence rate of frankwolfe for nonconvex objectives. arXiv preprint arXiv:1607.00345, 2016.
 [34] S. LacosteJulien and M. Jaggi. On the global linear convergence of frankwolfe optimization variants. NIPS, 2015.
 [35] S. LacosteJulien, M. Jaggi, M. Schmidt, and P. Pletscher. BlockCoordinate FrankWolfe Optimization for Structural SVMs. In ICML, 2013.
 [36] A. H. Land and A. G. Doig. An automatic method of solving discrete programming problems. Econometrica: Journal of the Econometric Society, 1960.
 [37] L. LealTaixé, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese. Learning an imagebased motion context for multiple people tracking. In CVPR, 2014.
 [38] L. LealTaixé, A. Milan, K. Schindler, D. Cremers, I. Reid, and S. Roth. Tracking the trackers: an analysis of the state of the art in multiple object tracking. arXiv preprint arXiv:1704.02781, 2017.
 [39] L. LealTaixé, G. PonsMoll, and B. Rosenhahn. Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker. ICCV. 1st Workshop on Modeling, Simulation and Visual Analysis of Large Crowds, 2011.
 [40] E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition & node labeling: Problem, algorithms, applications. In CVPR, 2017.
 [41] Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hybridboosted multitarget tracker for crowded scene. In CVPR, 2009.
 [42] B. D. Lucas, T. Kanade, et al. An iterative image registration technique with an application to stereo vision. 1981.
 [43] A. Milan, L. LealTaixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multiobject tracking. arXiv preprint arXiv:1603.00831, 2016.
 [44] A. Milan, L. LealTaixé, K. Schindler, and I. Reid. Joint tracking and segmentation of multiple targets. In CVPR, 2015.

[45]
J. E. Mitchell.
Branchandcut algorithms for combinatorial optimization problems.
Handbook of applied optimization, pages 65–77, 2002. 
[46]
P. M. Pardalos and S. A. Vavasis.
Quadratic programming with one negative eigenvalue is nphard.
Journal of Global Optimization, 1(1):15–22, 1991.  [47] H. Pirsiavash, D. Ramanan, and C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011.
 [48] S. Ren, R. G. K. He, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. NIPS, 2015.
 [49] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multitarget, multicamera tracking. In ECCV, 2016.
 [50] A. Sadeghian, A. Alahi, and S. Savarese. Tracking the untrackable: Learning to track multiple cues with longterm dependencies. ICCV, 2017.
 [51] S. Sahni. Computationally related problems. SIAM Journal on Computing, 1974.
 [52] A. Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.
 [53] G. Seguin, P. Bojanowski, R. Lajugie, and I. Laptev. Instancelevel video segmentation from object tracks. In CVPR, 2016.
 [54] F. Solera, S. Calderara, and R. Cucchiara. Learning to divide and conquer for online multitarget tracking. In CVPR, 2015.
 [55] R. Stewart, M. Andriluka, and A. Y. Ng. Endtoend people detection in crowded scenes. In CVPR, 2016.
 [56] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Subgraph decomposition for multitarget tracking. In CVPR, 2015.
 [57] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Multiperson tracking by multicuts and deep matching. In ECCV Workshops  Benchmarking MultiTarget Tracking, 2016.
 [58] S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple people tracking by lifted multicut and person reidentification. In CVPR, 2017.
 [59] Y. T. Tesfaye, E. Zemene, M. Pelillo, and A. Prati. Multiobject tracking using dominant sets. IET computer vision, 10(4):289–297, 2016.
 [60] C. Tomasi and T. Kanade. Detection and tracking of point features. 1991.
 [61] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow: Large displacement optical flow with deep matching. In ICCV, 2013.
 [62] F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In CVPR, 2016.
 [63] A. Zamir, A. Dehghan, and M. Shah. Gmcptracker: Global multiobject tracking using generalized minimum clique graphs. In ECCV, 2012.
 [64] L. Zhang, Y. Li, and R. Nevatia. Global data association for multiobject tracking using network flows. In CVPR, 2008.
References
 [1] A. Alahi, V. Ramanathan, and L. FeiFei. Sociallyaware largescale crowd forecasting. In CVPR, 2014.
 [2] S. M. Assari, H. Idrees, and M. Shah. Reidentification of humans in crowds using personal, social and environmental constraints. arXiv preprint arXiv:1612.02155, 2016.
 [3] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(13):89–113, 2004.
 [4] B. Benfold and I. Reid. Stable multitarget tracking in realtime surveillance video. In CVPR, 2011.
 [5] J. Berclaz, F. Fleuret, E. Türetken, and P. Fua. Multiple object tracking using kshortest paths optimization. TPAMI, 2011.
 [6] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008.
 [7] D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 [8] A. Billionnet, S. Elloumi, and A. Lambert. Extending the qcr method to general mixedinteger programs. Mathematical programming, 131(1):381–401, 2012.
 [9] W. Brendel, M. Amer, and S. Todorovic. Multiobject tracking as maximum weight independent set. In CVPR, 2011.
 [10] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In ECCV, 2010.
 [11] S. Burer and A. N. Letchford. Nonconvex mixedinteger nonlinear programming: A survey. Surveys in Operations Research and Management Science, 17(2):97–106, 2012.
 [12] V. Chari, S. LacosteJulien, I. Laptev, and J. Sivic. On pairwise costs for network flow multiobject tracking. In CVPR, 2015.
 [13] J. Chen, H. Sheng, Y. Zhang, and Z. Xiong. Enhancing detection model for multiple hypothesis tracking. In CVPRW, 2017.
 [14] S. Chen, A. Fern, and S. Todorovic. Multiobject tracking via constrained sequential labeling. In CVPR, 2014.
 [15] W. Choi. Nearonline multitarget tracking with aggregated local flow descriptor. In ICCV, 2015.
 [16] A. Dehghan, S. Assari, and M. Shah. Gmmcptracker: Globally optimal generalized maximum multi clique problem for multiple object tracking. In CVPR, 2015.
 [17] A. Dehghan and M. Shah. Binary Quadratic Programing for Online Tracking of Hundreds of People in Extremely Crowded Scenes. TPAMI, 2017.
 [18] P. Dolĺar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. TPAMI, 2014.
 [19] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, 2010.
 [20] K. Fragkiadaki and J. Shi. Detection free tracking: Exploiting motion and topology for segmenting and tracking under entanglement. In CVPR, 2011.
 [21] K. Fragkiadaki, W. Zhang, G. Zhang, and J. Shi. Twogranularity tracking: mediating trajectory and detections graphs for tracking under occlusions. In ECCV, 2012.
 [22] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.
 [23] I. Gurobi Optimization. Gurobi Optimizer Reference Manual. 2015.
 [24] P. L. Hammer and A. A. Rubin. Some remarks on quadratic programming with 01 variables. Revue française d’informatique et de recherche opérationnelle. Série verte, 1970.
 [25] R. Henschel, L. Laura LealTaixé, and B. Rosenhahn. Efficient multiple people tracking using minimum cost arborescences. In GCPR, 2014.
 [26] R. Henschel, L. LealTaixé, B. Rosenhahn, and K. Schindler. Tracking with multilevel features. arXiv preprint arXiv:1607.07304, 2016.
 [27] C. Huang, B. Wu, and R. Nevatia. Robust object tracking by hierarchical association of detection responses. In European Conference on Computer Vision, pages 788–801. Springer, 2008.
 [28] H. Jiang, S. Fels, and J. Little. A linear programming approach for multiple object tracking. In CVPR, 2007.
 [29] A. Joulin, K. Tang, and F. F. Li. Efficient image and video colocalization with frankwolfe algorithm. In ECCV, 2014.
 [30] M. Keuper, S. Tang, Y. Zhongjie, B. Andres, T. Brox, and B. Schiele. A multicut formulation for joint segmentation and tracking of multiple objects. arXiv preprint arXiv:1607.06317, 2016.
 [31] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg. Multiple hypothesis tracking revisited: Blending in modern appearance model. In ICCV, 2015.
 [32] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora. Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multiobject tracking in video data. IEEE AVSS Workshop, 2017.
 [33] S. LacosteJulien. Convergence rate of frankwolfe for nonconvex objectives. arXiv preprint arXiv:1607.00345, 2016.
 [34] S. LacosteJulien and M. Jaggi. On the global linear convergence of frankwolfe optimization variants. NIPS, 2015.
 [35] S. LacosteJulien, M. Jaggi, M. Schmidt, and P. Pletscher. BlockCoordinate FrankWolfe Optimization for Structural SVMs. In ICML, 2013.
 [36] A. H. Land and A. G. Doig. An automatic method of solving discrete programming problems. Econometrica: Journal of the Econometric Society, 1960.
 [37] L. LealTaixé, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese. Learning an imagebased motion context for multiple people tracking. In CVPR, 2014.
 [38] L. LealTaixé, A. Milan, K. Schindler, D. Cremers, I. Reid, and S. Roth. Tracking the trackers: an analysis of the state of the art in multiple object tracking. arXiv preprint arXiv:1704.02781, 2017.
 [39] L. LealTaixé, G. PonsMoll, and B. Rosenhahn. Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker. ICCV. 1st Workshop on Modeling, Simulation and Visual Analysis of Large Crowds, 2011.
 [40] E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition & node labeling: Problem, algorithms, applications. In CVPR, 2017.
 [41] Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hybridboosted multitarget tracker for crowded scene. In CVPR, 2009.
 [42] B. D. Lucas, T. Kanade, et al. An iterative image registration technique with an application to stereo vision. 1981.
 [43] A. Milan, L. LealTaixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multiobject tracking. arXiv preprint arXiv:1603.00831, 2016.
 [44] A. Milan, L. LealTaixé, K. Schindler, and I. Reid. Joint tracking and segmentation of multiple targets. In CVPR, 2015.

[45]
J. E. Mitchell.
Branchandcut algorithms for combinatorial optimization problems.
Handbook of applied optimization, pages 65–77, 2002. 
[46]
P. M. Pardalos and S. A. Vavasis.
Quadratic programming with one negative eigenvalue is nphard.
Journal of Global Optimization, 1(1):15–22, 1991.  [47] H. Pirsiavash, D. Ramanan, and C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011.
 [48] S. Ren, R. G. K. He, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. NIPS, 2015.
 [49] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multitarget, multicamera tracking. In ECCV, 2016.
 [50] A. Sadeghian, A. Alahi, and S. Savarese. Tracking the untrackable: Learning to track multiple cues with longterm dependencies. ICCV, 2017.
 [51] S. Sahni. Computationally related problems. SIAM Journal on Computing, 1974.
 [52] A. Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.
 [53] G. Seguin, P. Bojanowski, R. Lajugie, and I. Laptev. Instancelevel video segmentation from object tracks. In CVPR, 2016.
 [54] F. Solera, S. Calderara, and R. Cucchiara. Learning to divide and conquer for online multitarget tracking. In CVPR, 2015.
 [55] R. Stewart, M. Andriluka, and A. Y. Ng. Endtoend people detection in crowded scenes. In CVPR, 2016.
 [56] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Subgraph decomposition for multitarget tracking. In CVPR, 2015.
 [57] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Multiperson tracking by multicuts and deep matching. In ECCV Workshops  Benchmarking MultiTarget Tracking, 2016.
 [58] S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple people tracking by lifted multicut and person reidentification. In CVPR, 2017.
 [59] Y. T. Tesfaye, E. Zemene, M. Pelillo, and A. Prati. Multiobject tracking using dominant sets. IET computer vision, 10(4):289–297, 2016.
 [60] C. Tomasi and T. Kanade. Detection and tracking of point features. 1991.
 [61] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow: Large displacement optical flow with deep matching. In ICCV, 2013.
 [62] F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In CVPR, 2016.
 [63] A. Zamir, A. Dehghan, and M. Shah. Gmcptracker: Global multiobject tracking using generalized minimum clique graphs. In ECCV, 2012.
 [64] L. Zhang, Y. Li, and R. Nevatia. Global data association for multiobject tracking using network flows. In CVPR, 2008.