1 Introduction
The task of visual multiobject tracking is to recover spatiotemporal trajectories for a number of objects in a video sequence. Tracking multiple objects, like people or vehicles, has a wide range of applications from Robotics to video surveillance [28]. Despite recent progress in the field [3, 5, 8, 20, 21, 22, 27], tracking remains a challenging problem especially in crowded and cluttered scenes.
With the advances in object detection, “trackingbydetection” have recently become a popular paradigm for object tracking [5, 8, 13, 17]
. Given object detections in every frame of a video sequence, the tracking is formulated as selection and clustering of corresponding object detections over time. Such selection and clustering problems can be solved in an optimization framework using carefully designed cost functions. Given an appropriate cost function, trackingbydetection is typically setup as a MAP estimation problem
[29]. Among different formulations of this problem, mincost network flow [2] is particularly attractive as it allows for optimal and efficient solutions [22].The energy minimization approach to tracking enables global solutions to track selection and avoids early and errorprone local decisions. Moreover, it also enables for a principled modeling of interactions among different tracks. In the past, models of track interactions have been shown to improve human tracking in crowds [21], to identify unusual behavior [15] as well as to resolve ambiguous tracks [20, 22]. Such previous methods, however, either resort to local nonconvex optimization [21, 15, 20], or use greedy methods to enforce interactions [22].
Unlike previous work, we here propose to model track interactions within the mincost network flow tracking approach. We introduce pairwise costs to the objective function and design a convex relaxation solution with an efficient rounding heuristic. Although our final integer solution can be suboptimal, our method is generic and empirically provides certificates of small suboptimality. Tracking results using two particular examples of pairwise costs discussed in this paper are illustrated in Figure 1.
In summary, this paper makes the following contributions:

We propose a new nongreedy approach to optimize pairwise terms within a mincost network flow framework. Our solution is generic and allows the simultaneous optimization of any type of pairwise costs.

We propose a global optimization strategy with a convex relaxation that allows us to minimize pairwise costs using linear optimization, and a principled FrankWolfe style rounding procedure to obtain integer solutions with a certificate of suboptimality. The optimization procedure is empirically stable, allowing the practitioner to focus on modeling.

To illustrate our method, we propose two particular examples of pairwise costs: the first discourages significant overlaps between distinct tracks; the second models the spatial cooccurrence of different types of detections. This allows us to better model complex dynamic scenes with substantial clutter and partial occlusions.

Using our method, we show improved tracking results on several realworld videos. In addition, we propose a new strategy to evaluate tracking results that better measures the longevity of overlap between output tracks and ground truth.
This paper is organized as follows. Section 2 presents related work and the overview of our approach. Section 3 summarizes mincost flow tracking. Section 4 describes our optimization framework with pairwise costs presented in Sections 4.1.1 and 4.1.2. The optimization strategy is described in Section 5, with initial quadratic optimization formulation in Section 5.1 and subsequent linear relaxation in Section 5.3. Finally we present results of our method and compare them to the state of the art on challenging datasets in Section 6, and conclude with a discussion in Section 7.
2 Related work
Recent approaches have formulated multiframe, multiobject tracking as a mincost network flow optimization problem [29, 22, 5]
, where the optimal flow in a connected graph of detections encodes the selected tracks. While earlier mincost network flow optimization methods have used linear programming, recently proposed solutions to the mincost flow optimization include pushrelabel methods
[29], successive shortest paths [22, 5], and dynamic programming [22]. To ensure globally optimal and efficient solutions, previous methods have often restricted the cost to unary terms over all edges. While nonunary terms break the optimality of solutions in general, dependencies between detections have been enforced by greedy approaches, such as greedily eliminating the overlapping detections after each step of a sequential selection of distinct tracks in [22]. This nonglobal optimization approach, however, cannot recover from early suboptimal decisions.Additional dependencies among detections can also be incorporated into the mincost network flow tracking by modifying the underlying graph structure. Butt and Collins [8] follows this approach and minimizes the modified objective using Lagrangian methods. While the method works well for the particular type of introduced cost, generalizing this method to the new types of pairwise costs would require appropriate modifications of the graph structure which is nontrivial in general. Moreover, combining multiple costs within such a framework would be difficult. In contrast, our framework allows addition of terms without any modification to the underlying optimization framework.
Brendel et al. [7] and Milan et al. [20, 19] formulate the problem in a framework that first selects tracklets and then connects them using a learned distance measure [7] or a CRF [20, 19]. Long term occlusions are handled in [7] by merging appearance and motion similarity. While [20, 19] propose to alternate between discrete and continuous optimizations in order to minimize several cost functions, the presence of two levels of optimization makes theoretical or empirical guarantees of optimality hard to give. Unlike this work, we use a convex relaxation in our approach that allows us to give an empirical guarantee of optimality to our solutions.
Other methods [17, 26, 27] use offline or online training to learn a similarity measure between tracklets. These methods do not provide any optimality guarantee, though. In addition, training might be difficult in some conditions. For example, online training to discriminate appearances might be erroneous when objects move very close to each other (Figure 1). We avoid such problems by using pairwise terms to robustify the tracker to detection errors.
Incorporation of pairwise terms into the mincost network flow formulation has been previously attempted by Choi and Savarese [9]. Their work, however, is focused on jointly optimizing tracking and activity recognition. In contrast, we focus on tracking in particular, and propose a generic framework enabling inclusion of multiple types of pairwise costs and providing empirical measures of small suboptimality.
2.1 Overview of our approach
We propose an algorithm that incorporates quadratic pairwise costs into the traditional mincost flow network. Unlike previous methods [5, 17], which either build on top of mincost flow solutions [20] or change the network structure [8], we propose a modification to the standard optimization algorithm. Such quadratic costs can represent several useful properties like similar motion of people in a rally, cooccurrence of tracks for different parts of the same object instance and others.
While in such a case obtaining the global optimum is NPhard [18], we outline an approach to obtain near optimum solutions, while we empirically verify its optimality. We present a linear relaxation to the quadratic term that is fast to optimize, followed by a FrankWolfe based rounding heuristic to obtain an integer solution.
3 Background: Mincost flow tracking
In this section, we describe the traditional formulation of multiobject tracking as a mincost flow optimization problem [29]. We extend this framework in Section 4.
Given a video with objects in motion, the goal is to simultaneously track moving objects in a “detectandtrack” framework [29]. The input to the approach is twofold. First a set of candidate object locations is assumed to be given, provided, for example, as output of an object detector. Henceforth we refer to these locations as detections. The approach also requires a measure of correspondence between detections across video frames. This could be obtained for example from optical flow, or using some other form of correspondence. Based on these inputs, the tracking problem is setup as a joint optimization problem of simultaneously selecting detections of objects and connections between them across video frames. Such a problem can be modeled through a MAP objective [29] with specific constraints encoding the structure of the tracks. The MAP optimization problem can be cast as the following integer linear program (ILP):
(1)  
The above formulation encodes the joint selection of tracks using the following selection variables: is a binary indicator variable taking the value when the detection is selected in some track; is a binary indicator variable taking the value when detection and detection are connected through the same track in nearby time frames. The index ranges over possible detections across the whole video. denotes the cost of selecting detection in a specific frame (and represents the negative detection confidence) while represents the negative of the correspondence strength between detections and . The set of possible connections between detections is represented by and could be a subset of all pairs of detections in nearby frames by using choice heuristics (such as spatial proximity). The quality of track selection is quantified by the objective in (1).
The constraint , which has the structure of a flow conservation constraint [2], encodes the correct claimed semantic that can take the value if and only if both and take the value , and moreover, that each detection belongs to at most one track, enforcing the fact that two objects cannot occupy the same space. Finally, the constraint ensures that exactly tracks are selected (dummy “source” and “sink” variables with the fixed value are added; the connection variables and represent the start and end of tracks respectively).
We have grouped the linear constraints in (1) under the name as they actually correspond to constraints in a mincost network flow problem where one would like to push units of flow with minimum cost in a network with unit capacity edges. In fact, these linear constraints have the property of being totally unimodular [2]. This implies that the polytope they determine has only vertices with integer coordinates, and so relaxing the integer constraints in (1) and solving it as a linear program is still guaranteed to produce integer solutions, making it a tight relaxation. Figure 2 illustrates the correspondence between a network flow structure and the formulation (1).
To summarize, the above optimization problem with relaxed integer constraint can be solved efficiently using existing network flow or linear algebra packages [2], and provides a convenient framework to transform the tracking problem into a track selection problem. We use this conversion as a starting point to add additional constraints and costs on the selection process to influence it in desirable ways to address challenging scenarios that are shown in later sections.
4 Modeling pairwise costs with an IQP
The above formulation in (1) represents a linear objective with linear equality constraints (where the integer constraint is not needed). While linear terms are both simple and easy to minimize, higher order models can represent more useful properties [21]. We suggest to add a quadratic cost between pairs of selection variables. To simplify the notation for the optimization sections, we collect the and
variables in a long vector
. The product then encodes joint selection of and – these choices could correspond to a pair of connections, a pair of detections, or even a connection and a detection. A term of the form can then either encourage (or discourage) the joint selection of and by having negative (or positive), respectively. Our approach is to consider a small set of such joint selections, and add the term to the objective. Our new optimization problem can thus be expressed as the integer quadratic program (IQP):(2)  
where the matrix is sparse with for .
Unfortunately, the above formulation can encode the quadratic assignment problem which is NPhard to optimize in general [18]. Nevertheless, we propose an efficient (convex) linear relaxation in Section 5 as well as a powerful rounding heuristic that provides empirical certificates of suboptimality. Our main modeling strategy is thus twofold: first, we encode our prior knowledge about the joint selection of variables using the sparse cost matrix (which can be arbitrary); second we add additional constraints to the IQP as long as they can be encoded as network flow constraints (this is a requirement of our rounding heuristic presented in Section 5.3). In the rest of this section, we provide two examples of pairwise costs used in our experiments. We then focus on the optimization aspects in Section 5.
4.1 Designing pairwise costs
In the following subsections, we show how some traditional constraints [15, 21] could be incorporated in our quadratic mincost network flow framework. We focus on elements that cannot be simply encoded with traditional linear terms in (1).
4.1.1 Overlap penalty
Object detectors often produce multiple responses per object. This issue is typically addressed by the Non Maxima Suppression (NMS) step, which retains most confident detections within spatial neighborhoods. While NMS works well for tracking isolated objects, independent decisions produced by NMS for each object and frame often become suboptimal in crowded scenes where multiple objects may occupy the same spatial neighborhood. To address this problem, we avoid taking independent decisions and propose to discourage overlapping detections within the network flow tracking framework. For this purpose we extend the cost function with the following pairwise overlap cost:
(3)  
where and represent two selection variables associated with sufficiently overlapping^{1}^{1}1The overlap threshold is set to 0.5 in our experiments. detections and .
In previous approaches like [22], NMS was implemented in a greedy fashion. Greedy approaches, however, have the disadvantage of making nonreversible decisions in the early stages of optimization. In contrast, our approach of incorporating the cost (3) into the overall cost function ensures that NMS is optimized simultaneously with other tracking objectives. As a result, overlapping detections may become tolerated, for example, in situations when two tracks intersect. On the other hand, continuously overlapping tracks resulting from multiple outputs of detectors will be discouraged.
4.1.2 Enforcing consistency between two signals
In many tracking scenarios, multiple signals are available for use. For example, we might have a body detector as well as a head detector. In case they give complementary information about the presence of the object, we can be more robust to detection noise by ensuring that the two tracks are consistent using a pairwise cost.
For example, let and denote the selection variables (detection or connection) for the head and body respectively. Each set can be associated with its own flow feasible set and . We can encourage the consistent “cooccurrence” of the two flows by adding the following negative cost:
(4)  
In our experiment, we say that and are consistent in two scenarios. Either and are detection variables such that their corresponding boxes^{2}^{2}2For the body detection box, we only consider its top 25% region when computing overlap or looking at intersection. overlap more than . Or we have a head detection with a box that intersects the edge
connecting its respective body detection boxes (and similarly for a body detection and head edge). The idea behind the latter possibility is to be more robust to missing detections on some frames: it corresponds to a situation where a head and body detection would have overlapped if we were interpolating detections along an edge that skips frames. Note that the cost (
4) is difficult to minimize greedily, since both head and body tracks need to be optimized simultaneously.5 Optimization
In the previous section, we presented examples of quadratic cost functions that we could include in our extension to the mincost flow network formulation to encourage cooccurrence preferences for individual variables in the minimization. Finding a global minimum is NPhard [18] if we keep the integer constraints on the variables (which is necessary to ensure the correct track encoding). Our suggested strategy is to instead find a global solution to the relaxed version of the problem with the integer constraints removed, and then use a powerful heuristic to search for nearby integer solution that satisfies the flow constraints (see Section 5.3). By comparing the objective value between the “rounded” integer solution and the global solution to the relaxed problem, which provides a lower bound, we obtain a certificate of optimality. In our experiments, we observed that suboptimality upper bounds were quite small, thus indicating that our optimization framework is stable and we can instead focus on designing good cost functions. We now describe several approaches to optimize (4).
5.1 Quadratic optimization
If is positive definite, then the quadratic program (QP) in (4) with relaxed integer constraint is convex and can be robustly optimized using interior point methods implemented in commercial solvers such as MOSEK/CPLEX. These methods can scale to mediumsize problems^{3}^{3}3A few millions variables, which translates to several hundreds frames with a high number of detections for our datasets. by exploiting the sparseness of suggested in Section 4.1.
In our general formulation, is not necessarily positive definite. We can nevertheless use a standard trick to make it positive definite by defining its diagonal entries to be , while using as the linear coefficient for in the objective. As
for binary variables, this transformation sill yields an (equivalent) IQP.
is now positive semidefinite [11, Thm. 6.1.10], and so the relaxation gives a convex problem.In order to scale to very large scale datasets (billions of variables), one could use the FrankWolfe algorithm [12] which is a first order gradient based method that iteratively minimizes a linearization of the quadratic objective. An advantage of this approach is that each step of the FrankWolfe algorithm reduces in our case to the minimization of a mincost network flow problem, which can scale to much larger sizes than a generic linear program solver. Moreover, each step of this algorithm yields an integer solution. Thus, while optimizing the relaxed objective (which will provide a lower bound certificate), we can keep track of which integer iterate had the best objective thus far. This perspective also motivates a powerful rounding heuristic that we describe in Section 5.3. Building on a preliminary version of our paper, [14] used this approach successfully for performing efficient colocalization in videos, where the constraint set also had a network flow structure.
5.2 Equivalent integer linear program
Another way to make the approach more scalable is to transform the integer QP (4) into an equivalent integer linear program (ILP) by introducing wellchosen additional variables and constraints. We present such an approach in this section, which generalizes the line of reasoning from [16].
We introduce a new set of variables that encode the joint selection of the edge and , and thus we would like to enforce . The quadratic cost component could then be replaced with a linear cost . An equivalent integer linear program is thus the following:
(5)  
Here and represents the vector whose elements are and respectively. The new constraint enforces that should be if and are both ; while the pair of constraints and enforce if either or is zero. We call these constraints ‘’ as it turns out that they define a polytope which can be obtained by a projection of the local marginal consistency polytope for the overcomplete representation of a discrete Markov random field (MRF) [24, (4.6)] with edges defined by the nonzero entries of ^{4}^{4}4More specifically, this representation defines one indicator variable per possible joint assignment of values on the cliques of the MRF. If we do FourierMotzkin elimination [6, 24] on the local consistency polytope to eliminate the extra variables and to only keep the three variables for each edge, then we obtain back the constraints for .. Removing the integer constraint in (5.2) thus yields a LP relaxation that is similar to one for MAP inference in MRFs, but with additional constraints, yielding a crucial structural difference with the previous works.
An advantage of this formulation is that its relaxed form is a LP, which can usually be optimized by MOSEK or CPLEX to larger scale than the QP formulation, even though there is an increase in the number of variables and constraints. Note though that the number of new variables created is the same as the number of nonzero coefficients in the sparse , which was indicated by the set in (5.2) to stress that we do not need to look at all pairs of edges. In exploratory experiments, we observed that the LP relaxation yielded similar quality solutions as the QP relaxation, but was faster to optimize; we have thus focused on the LP relaxation in our experiments. Another advantage of (5.2) is that we can easily generalize it to handle higher order terms in the objective. For a clique of decision variables that we want to encourage or discourage jointly, we introduce a new variable . This semantic can be readily enforced with the constraints for all , and , which generalizes for higher order terms and yields another ILP that can be relaxed to a LP.
5.3 FrankWolfe rounding heuristic
The solution of the LP relaxation of (5.2) can have fractional components because the additional linear constraints from essentially violate the total unimodularity property, in contrast to which yields a polytope with only integer vertices. Since naively rounding the obtained fractional variables to the nearest integer might not result in a feasible point (in other words a valid flow), we need a strategy to obtain an integer solution with cost similar to the minimum. Given the relaxed global solution , the simplest approach would be to look for the point closest in Euclidean norm in which is an integer. As for binary variables, we have which is a linear function of . We can thus obtain the closest integer point by solving a LP over , as all its vertices are integers. We call this approach Hamming rounding as reduces to the Hamming distance when evaluated on pair of binary vectors. On the other hand, the closest point in Euclidean norm does not necessarily yield a good objective value (as the search was agnostic to the objective). Inspired by the FrankWolfe algorithm, our suggested heuristic is to minimize instead the firstorder linear underestimator of the quadratic objective constructed with the gradient at the relaxed global solution . Specifically, we obtain the following LP, which has the usual network flow constraint structure and thus can be solved very efficiently:
(6) 
The objective here can be interpreted as modifying the distance function on binary vectors to take the cost function in consideration. As previously mentioned, the relaxed LP solution provides a lower bound on the true ILP (which is equivalent to the IQP) solution. The difference between the objective evaluated on any feasible integer solution and the lower bound is thus an upper bound certificate on its suboptimality. In our experiments, we obtained small suboptimality certificates () for our returned integer solutions, indicating that our rounding heuristic was effective at returning nearglobal optimal solutions (we note that we define and so that the objective is normalized between and ). We also observed that Hamming rounding generally produced a suboptimality that was around to times worse than the solution produced by FrankWolfe rounding. These worse objective values also translated in worse tracking accuracy (see Appendix A in the supplementary material^{5}^{5}5The supplementary material (with videos and code) is available at [1].). We finally note that in contrast to the previous work [8] which could not guarantee that their algorithm would converge to an integer solution, our approach will always give some integer solution (by solving a simple mincost network flow problem), and can provide a certificate of suboptimality aposteriori.
6 Experiments
In this section, we evaluate our approach on several real world videos and compare results to the stateoftheart methods [4, 20, 22]. First we illustrate the effect of the two pairwise costs proposed in Section 4.1 and evaluate improvement over the basic mincost network flow tracking. We also argue that the standard MOTA score is often insufficient to capture the quality of tracking results and propose a new measure for tracking evaluation, termed redetection measure (Section 6.2.1).
Second, we evaluate our method on six videos from the two standard datasets PETS and TUD. For both of these datasets, we obtain part of the input (person detections) from Milan et al. [20], and show improvements over their approach using the standard MOTA metrics.
6.1 Tracking datasets
We test our algorithm on several publicly available videos. The first video MarchingRally corresponds to a crowd walking in a rally along a street (see Figure 1, top row). The video consists of 120 frames recorded at 25 fps, and has about 50 people. This video is challenging due to the high number of people moving close to each other. We have manually annotated ground truth tracks for all people in this video for the purpose of tracking evaluation^{6}^{6}6The original MarchingRally video and the corresponding ground truth tracks are available from [1]..
The second video illustrated in Figure 1 (bottom row) is called TownCenter [4] and consists of 4500 frames recorded at 25 fps. The video shows approximately 230 people walking across the street. Finally, we use videos from the wellknown PETS and TUD datasets. These videos depict frequently occluded people moving in multiple directions.
Preprocessing.
We run a “head” detector [23] to detect heads of people in every frame of the MarchingRally and PETS videos. While we use only head detections for the MarchingRally sequence, for PETS we use our head detections in combination with readilyavailable body detections from [20]. Head detections complement frequently overlapping body detections and help resolving partial occlusions as well as IDswitches. For each of these videos, we run a KLT tracker after initializing features within detection bounding boxes. Finally, for every pair of nearby frames ( 10 frames apart), we connect pairs of detections with high correspondence strength. The strength of correspondence between two detections is the ratio of their common KLT tracks and the total number of KLT tracks passing through both detections.
6.2 Tracking in video experiment
6.2.1 Evaluation strategy
Evaluating results of multiobject tracking is nontrivial because errors might be present in various forms including ID switches, broken tracks, imprecisely localized tracks and false tracks. Measures such as MOTA [4, 20] combine different errors into a single score and enable the global ranking of tracking methods. Such measures, however, lack interpretability. On the other hand, independent assessment of different errors can also be misleading. For example, in dense crowd videos such as in Figure 1(a), tracks may have relatively low localization error while being incorrect due to switches between neighboring people. Similarly, low error of ID switches can be a consequence of many broken tracks.
We argue that a meaningful evaluation of tracking methods should be related to a task. One task with particular relevance to crowd videos is to detect the location of a given person after frames. To evaluate the performance of tracking methods on such a task we propose the redetection measure as described below.
Redetection measure.
The proposed redetection measure evaluates the ability of a tracker to find the correct location of a given object after frames. The measure is inspired by the common evaluation procedure for object detection in still images [10] and extends it to tracking. For each pair of detections and associated to the same track by a tracker, we check if there exists a ground truth track that overlaps with and on frames and respectively.^{7}^{7}7The overlap between ground truth and detections is measured by the standard Jaccard similarity of corresponding bounding boxes. If the answer is negative, the subtrack is labeled as false positive. Otherwise, it is labeled as true positive unless there exist multiple subtracks overlapping with the same ground truth. To avoid multiple responses, in the latter case only one subtrack is labeled as true positive while others are declared as false positives.
For the given we collect subtracks from all video intervals and sort them according to their confidence.^{8}^{8}8The confidence for a subtract in this paper is given by the sum of its constituent detection confidences and correspondence strengths. Given the subtrack labels defined above, we evaluate PrecisionRecall and Average Precision (AP). High AP values indicate the good performance of the tracker in the redetection task. On the other hand, common errors such as ID switches and imprecise localization reduce AP values. Note that in the case of , our measure reduces to the standard measure for object detection. Larger values of enable evaluation of redetection for longer time intervals. To compare different methods, we plot the redetection AP for different values of as illustrated in Figure 3.
Rcll  Prcn  GT  MT  PT  ML  FP  FN  IDs  FM  MOTA  MOTP  
TUD Stadtmitte  NF  67.9  72.0  10  4  6  0  305  371  26  26  39.3  59.5 
NF+pairwise  59.6  89.9  10  2  8  0  77  467  15  22  51.6  61.6  
Milan [20]  69.1  85.6  10  4  6  0  134  457  15  13  56.2  61.6  
PETS S2L1  NF  93.7  83.4  19  17  2  0  870  293  64  66  73.6  72.9 
NF+pairwise  92.4  94.3  19  18  1  0  262  354  56  74  85.5  76.2  
Milan [20]  96.8  94.1  19  18  1  0  282  148  22  15  90.3  74.3  
PETS S2L2  NF  47.7  87.6  43  1  37  5  693  5383  291  531  38.1  60.7 
NF+pairwise  60.6  88.6  43  6  34  3  807  4050  244  379  50.4  60.6  
Milan [20]  65.1  92.4  43  11  31  1  549  3592  167  153  58.1  59.8  
PETS S2L3  NF  44.5  92.2  44  9  15  20  164  2428  121  189  38.0  69.3 
NF+pairwise  45.5  91.2  44  12  15  17  155  2125  44  50  40.3  61.2  
Milan [20]  43.0  94.2  44  8  17  19  115  2493  27  22  39.8  65.0  
PETS S1L12  NF  62.9  89.1  44  18  15  11  295  1425  289  140  47.8  65.2 
NF+pairwise  68.9  92.0  44  20  16  8  230  1198  35  74  62.0  62.1  
Milan [20]  64.9  92.4  44  21  12  11  169  1349  22  19  60.0  61.9  
PETS S1L21  NF  31.3  87.4  42  4  15  23  208  3501  101  243  23.7  57.9 
NF+pairwise  37.9  89.6  42  4  20  18  223  3141  67  122  32.2  55.0  
Milan [20]  30.9  98.3  42  2  19  21  27  3494  42  34  29.6  58.8 
6.2.2 Experimental results
We compare our algorithm with the stateoftheart approaches on several video sequences. For the MarchingRally and TownCenter sequences, the baseline approaches for comparison are a greedy implementation of the basic mincost network flow algorithm with the greedy NMS heuristic from [22], and a network flow (NF) implementation as a linear program. In all graphs in Figure 3, the corresponding results are represented by black (“Greedy + NMS”) and blue (“NF Basic”) curves. We note that we perform a careful grid search over the parameter space for all three algorithms and show the results corresponding to the best parameters, to make sure the differences observed are not arising from different parameter choices, but rather from limitations of the framework. On the other hand, we have used only one fixed set of parameter values to produce the results on the different sequences in the PETS and TUD datasets given in Table 1. See [1] for the parameters used and information about the runtime.
In the MarchingRally video sequence, several people are moving in a crowd in a similar direction. The angle of viewing and the number of people alleviate the issue of clutter, which leads to failure of tracking algorithms that tend to confuse tracking identities. Our algorithm with overlap constraints (red curve) outperforms the state of the art by a large margin. Figure 3(a) shows the redetection accuracy results with/without the overlap constraints. Note that the difference in performance between our algorithm and [22] grows together with the redetection time interval. In fact, for the intervals of frames or more, our algorithm outperforms the baseline by over AP.
The TownCenter sequence is a video with two complementary sets of detections corresponding to heads and upper bodies. While head detections are noisy but have high recall, body detections are more precise but are also prone to more clutter. In such a case, as shown in Figure 3(b) we leverage body detections to improve noisy head tracks. Again in this case, there is more than improvement in AP over the head baseline. Finally, the table in Figure (3) compares our method with a stateoftheart [4] algorithm in terms of traditional MOTA evaluation measure. Note that while we compare with a “greedy” version of the overlap term [22], designing a greedy version of the cooccurrence term is not obvious.
For the PETS and TUD sequence, we compare the results of our method based on MOTA metrics with those presented in Milan et al. [20]. These sequences are challenging for a variety of reasons. First, there is a crowd of people walking in different directions and crisscrossing each other, which makes sustained tracking difficult. Second, few full body detections are available per frame in each video, which makes adding new terms to the objective function difficult. Third, since people walk sidebyside there is a lot of overlap between detections that belong to two different persons, hence enforcing the overlap criterion is difficult. However, as can be seen in Table 1, our method generally has comparable MOTA, MOTP and recall scores with [20]. This shows that our method is able to address complex scenarios effectively and our cost function is easy to adapt to general scenarios. Note also that the camera angle in PETS and TUDS are very different from each other, which means that our algorithm is sufficiently robust to these changes. Thus, we estimate trajectories better (sum of MT and PT of our method is usually high). This also results from the use of both overlap and cooccurrence terms in our approach, which can take into account head detections as additional information.
7 Discussion and conclusion
We have presented a generic optimization procedure enabling addition of quadratic costs to the mincost network flow tracking methods. Our method enables modeling of track interactions in a principled way and provides empirical certificates of small suboptimality. We have shown practical benefits of our method for two particular examples of pairwise costs on challenging video sequences.
Combining different types of pairwise costs into a single (linear) cost opens up the possibility of tracking complicated motions. Moreover, while complex cost functions have more tunable parameters, they could be learnt from labeled data using structured output learning [16]. This opens up the possibility of learning quadratic costs for specific crowd actions such as panic, street crossing or stampede.
8 Acknowledgements
This research was supported in part by the projects FluidTracks, EIT ICT Labs, Google research award, ERC grant Activia (no. 307574) and ERC grant LEAP (no. 336845). We thank Patrick Perez for discussions on the multitarget tracking evaluation.
References
 [1] http://www.di.ens.fr/willow/research/flowtrack/.
 [2] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network flows: theory, algorithms, and applications. PrenticeHall, Inc., 1993.
 [3] M. Andriluka, S. Roth, and B. Schiele. Peopletrackingbydetection and peopledetectionbytracking. In CVPR, 2008.
 [4] B. Benfold and I. Reid. Stable multitarget tracking in realtime surveillance video. In CVPR, 2011.
 [5] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple object tracking using kshortest paths optimization. TPAMI, 33(9):1806–1819, 2011.
 [6] D. Bertsimas and J. N. Tsitskiklis. Introduction to Linear Optimization. Athena Scientific, 1997.
 [7] W. Brendel, M. Amer, and S. Todorovic. Multiobject tracking as maximum weight independent set. In CVPR, 2011.
 [8] A. A. Butt and R. T. Collins. Multitarget tracking by Lagrangian relaxation to mincost network flow. In CVPR, 2013.
 [9] W. Choi and S. Savarese. A unified framework for multitarget tracking and collective activity recognition. In ECCV, 2012.
 [10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal visual object classes (VOC) challenge. IJCV, 2010.
 [11] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.
 [12] M. Jaggi. Revisiting FrankWolfe: Projectionfree sparse convex optimization. In ICML, 2013.
 [13] H. Jiang, S. Fels, and J. Little. A linear programming approach for multiple object tracking. In CVPR, 2007.
 [14] A. Joulin, K. Tang, and L. FeiFei. Efficient image and video colocalization with FrankWolfe algorithm. In ECCV, 2014.
 [15] L. Kratz and K. Nishino. Anomaly detection in extremely crowded scenes using spatiotemporal motion pattern models. In CVPR, 2009.
 [16] S. LacosteJulien, B. Taskar, D. Klein, and M. I. Jordan. Word alignment via quadratic assignment. In NAACL, 2006.
 [17] Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hybridboosted multitarget tracker for crowded scene. In CVPR, 2009.
 [18] E. M. Loiola, N. M. M. de Abreu, P. O. BoaventuraNetto, P. Hahn, and T. Querido. A survey for the quadratic assignment problem. European Journal of Operational Research, 176(2), 2007.
 [19] A. Milan, S. Roth, and K. Schindler. Continuous energy minimization for multitarget tracking. TPAMI, 36(1):58–72, 2014.
 [20] A. Milan, K. Schindler, and S. Roth. Detection and trajectorylevel exclusion in multiple object tracking. In CVPR, 2013.
 [21] S. Pellegrini, A. Ess, K. Schindler, and L. van Gool. You’ll never walk alone: Modeling social behavior for multitarget tracking. In CVPR, 2009.
 [22] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011.
 [23] M. Rodriguez, I. Laptev, J. Sivic, and J.Y. Audibert. Densityaware person detection and tracking in crowds. In ICCV, 2011.

[24]
M. J. Wainwright and M. I. Jordan.
Graphical models, exponential families, and variational inference.
Foundations and Trends in Machine Learning
, 1(1–2):1–305, 2008.  [25] B. Wu and R. Nevatia. Tracking of multiple, partially occluded humans based on static body part detection. In CVPR, 2006.
 [26] B. Yang, C. Huang, and R. Nevatia. Learning affinities and dependencies for multitarget tracking using a CRF model. In CVPR, 2011.
 [27] B. Yang and R. Nevatia. An online learned CRF model for multitarget tracking. In CVPR, 2012.
 [28] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. ACM Comput. Surv., 38(4), 2006.
 [29] L. Zhang, Y. Li, and R. Nevatia. Global data association for multiobject tracking using network flows. In CVPR, 2008.
Appendix A Superiority of FrankWolfe rounding heuristic vs. Hamming rounding
In Section 5.3, we described two approaches to round the fractional solution obtained after optimizing the LP relaxation (5.2). “Rounding” here meant finding a valid track encoding for prediction, i.e. a with integer coordinates. The first approach was to find the vertex (binary vector) in with minimal Euclidean distance to . We called this approach Hamming rounding and is standard for problems operating on binary vectors. We also proposed a novel alternative rounding heuristic called FrankWolfe rounding which instead minimizes the linear approximation of the quadratic objective (4), and is given by problem (5.3). In our experiments, we observed that FrankWolfe rounding yielded solutions with better objective values, as well as better tracking accuracy, than Hamming rounding. We illustrate these observations in this section.
For the MarchingRally experiment (where we only have head detections), we parameterized the objective with two parameters: a multiplicative constant in front of the detection confidences, and the value of the overlap penalty mentioned in (3) (set to a constant).^{9}^{9}9We suppose a multiplicative constant of one in front of the correspondence strengths; changing it as well would just amount to multiply the whole objective by a constant, which would not change the solution. In Table 2, we compare the suboptimality certificate values for FrankWolfe rounding vs. Hamming rounding for 6 different parameter settings on the MarchingRally dataset. More specifically, for each parameter setting, we first obtain the global relaxed solution to the LP relaxation (5.2), then we either round by FrankWolfe rounding or by Hamming rounding and compare their suboptimality certificates. We also compare their redetection accuracy in Figure 4, which shows that FrankWolfe rounding systematically yields better results than Hamming rounding.
Case 1 in Table 2 is the reference case where we use the best parameter values found by grid search, which were used to produce the results in Figure 3(a) in the paper. For Case 2 and 3, we vary the overlap penalty weight. Case 2 is a very low value for the overlap term encouraging tracks to crisscross each other, while Case 3 has a very high overlap weight which means even small amount of overlap is unacceptable. Results for these cases are shown in the first row of Figure 4. The next three cases vary the weight for detection confidence. In particular in Case 6, the presence of negative weight actually “discourages” any detections from being picked unless they are connected to edges with extremely high connection strength. This results in poor performance as shown in Figure 4 but note that even here, Hamming rounding results are worse than the FrankWolfe rounding ones. Also note that worse suboptimality certificates usually result in worse tracking.
Detection  Overlap  FW  Ham.  
Case1  0.1  0.0223  4.7e03  1.4e02 
Case2  0.1  0.0007  8.7e06  9.3e03 
Case3  0.1  2.23  4.3e03  1.0e01 
Case4  3.0  0.0223  9.3e06  8.9e03 
Case5  0.074  0.0223  3.1e02  1.0e01 
Case6  1.0  0.0223  1.0e01  1.3e01 
Appendix B Video Results
The following images in Figure 5 shows the tracks overlaid on top of the first frame of the MarchingRally sequence. Each track is shown in a separate color. The output on the top illustrates our result (NF+Overlap) and the one on the bottom illustrates the results of [22] (Greedy + NMS). Note how in our case one gets nonoverlapping tracks while in the case of [22] there are places where tracks overlap and crisscross. We highlight this in videos available from [1] by drawing cyan colored boxes at places where such ID swaps happen. See Figure 3(a) for the corresponding redetection curves. For the more classical metrics, the (MOTA, MOTP, IDswap) numbers for NF+Overlap are (27.7%, 66.5%, 11) vs. (22.5%, 66.0%, 24) for Greedy+NMS.
Appendix C Runtime and Constraints
For the PETS and TownCenter dataset, typically we have approximately 10–40 detections per frame. For PETS data, each detection is connected to a detection in another frame (with a pairwise term) if they are less than 6 frames apart. On average, each detection is connected to about 10 other detections for pairwise terms (overlap+CO), which means the number of pairwise terms is linear in the number of unary terms. For TownCenter data, we connect detections over 30 frames to account for slower motion of people and missing detections, resulting in about 15 pairwise terms (overlap+CO) per detection on average. While our algorithm runs in about 5–10 seconds on the PETS dataset, it takes about 30–45 minutes on the TownCenter dataset. This difference is due to the larger number of frames in the TownCenter dataset (one order of magnitude greater than for the PETS videos), and also the larger number of pairwise terms per detection on average, resulting in a LP with about 5 million variables in comparison to about 50 thousand for the PETS sequences.
Comments
There are no comments yet.