Learning to Divide and Conquer for Online Multi-Target Tracking

09/14/2015 ∙ by Francesco Solera, et al. ∙ 0

Online Multiple Target Tracking (MTT) is often addressed within the tracking-by-detection paradigm. Detections are previously extracted independently in each frame and then objects trajectories are built by maximizing specifically designed coherence functions. Nevertheless, ambiguities arise in presence of occlusions or detection errors. In this paper we claim that the ambiguities in tracking could be solved by a selective use of the features, by working with more reliable features if possible and exploiting a deeper representation of the target only if necessary. To this end, we propose an online divide and conquer tracker for static camera scenes, which partitions the assignment problem in local subproblems and solves them by selectively choosing and combining the best features. The complete framework is cast as a structural learning task that unifies these phases and learns tracker parameters from examples. Experiments on two different datasets highlights a significant improvement of tracking performances (MOTA +10 the art.



There are no comments yet.


page 1

page 7

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multiple Target Tracking (MTT) is the task of extracting the continuous path of relevant objects across a set of subsequent frames. Due to the recent advances in object detection [10, 4], the problem of MTT is often addressed within the tracking-by-detection paradigm. Detections are previously extracted independently in each frame and then objects trajectories are built by maximizing specifically designed coherence functions [19, 5, 21, 2, 9, 24]. Tracking objects through detections can mitigate drifting behaviors introduced by prediction steps but, on the other hand, it forces the tracker to work in adverse conditions, due to the frequent occurrence of false and miss detections.

The majority of approaches address MTT offline, i.e. by exploiting detections from a set of frames [19, 5, 9] through global optimization. Offline methods benefit from the bigger portion of video sequence they dispose of to establish spatio-temporal coherence, but can not be used in real-time applications. Conversely, online methods track the targets frame-by-frame; they have a larger spectra of application but must be both accurate and fast despite working with less data. In this context, the robustness of the features play a major role in the online MTT task. Some approaches claim the adoption of complex targets models [2, 24] to be the solution, while others argue that this complexity may affect the long-term robustness [23]. For instance, in large crowds people appearance is rarely informative. As a consequence, tracking robustness is often achieved by focusing on spatial features [21], finding them more reliable than visual ones.

Figure 1: The scene is partitioned in local zones. Green zones is where the same number of tracks and detections are present. Red zones, where miss and false detections (white dashed contours) are discovered and solving the associations may call for complex appearance or motion features.

We do believe that many of the ambiguities in tracking could be solved by a selective use of the features, by working with more reliable features if possible and exploiting a deeper representation of the target only if necessary. In fact, a simple spatial association is often sufficient while, as clutter or confusion arise, an improved association scheme on more complex features is needed (Fig. 1).
In this paper a novel approach for online MTT in static camera scenes is proposed. The method selects the most suitable features to solve the frame-by-frame associations depending on the surrounding scene complexity.
Specifically, our contributions are:

  • an online method based on Correlation Clustering that learns to divide the global association task in smaller and localized association subproblems (Sec. 5),

  • a novel extension to the Hungarian association scheme, flexible enough to be applied to any set of preferred features and able to conquer trivial and complex subproblems by selectively combining the features (Sec. 6),

  • an online Latent Structural SVM (LSSVM) framework to combine the divide and conquer steps and to learn from examples all the tracker parameters (Sec. 7).

The algorithm works by alternating between (a) learning the affinity measure of the Correlation Clustering as a latent variable and (b) learning the optimal combinations for both simple and complex features to be used as cost functions by the Hungarian. Results on public benchmarks underline a major improvement in tracking accuracy over current state of the art online trackers (+10% MOTA).

The work takes inspiration from the human perceptive behavior, further introduced in Sec. 3. According to the widely accepted two-streams hypothesis by Goodale and Milner [12], the use of motion and appearance information is localized in the temporal lobe (what pathway), while basic spatial cues are processed in the parietal lobe (where pathway). This suggests our brain processes and exploits information in different and specific ways as well.

2 Related works

Tab. 1 reports an overview of recent tracking-by-detection literature approaches separating online and offline methods and indicates the adoption of tracklets (T), appearance models (A) and complex learning schemes (L). Offline methods [5, 19, 13, 18] are out of the scope of the paper and are reported for the sake of completeness.
Tracklets are the results of an intermediate hierarchical association of the detections and are commonly used by both offline and online solutions  [18, 13, 25]. In these approaches, high confidence associations link detections in a pre-processing step and then optimization techniques are employed to link tracklets into trajectories. Nevertheless tracklets creation involves solving a frame by frame assignment problem by thresholding the final association cost and errors in tracklets affect the tracking results as well.

In addition, online methods often try to compensate the lack of spatiotemporal information through the use of appearance or other complex features model. Appearance model is typically handled by the adoption of a classifier for each tracked target 

[24] and data associations is often finalized through an averaged sum of classifiers scores,  [7, 2]. As a consequence, learning is on targets model, not on associations. Moreover, online methods also need to cope with drifting when updating their targets model. One possible solution is to avoid model updating when uncertainties are detected in the data, i.e.  a detection cannot be paired to a sufficiently close previous trajectory [2]. Nevertheless, any error introduced into the model can rapidly lead to tracking drift and wrong appearance learning. Building on these considerations, Possegger et al. [21] does not consider appearance at all and only work with distance and motion data.

Offline methods
Berclaz et al. [5] 2011
Milan et al. [19] 2014
Hoffman et al. [13] 2013
Li et al. [18] 2009
Online methods
Yang and Nevatia [25] 2014
Breitenstein et al. [7] 2009
Bae and Yoon [2] 2014
Possegger et al. [21] 2014
Wu et al. [24] 2013
Our proposal 2015
Table 1:

Overview of offline and online related works in terms of code availability (C), appearance models (A), tracklets computation (T), associations learning (L) and presence in the MOT Challenge competition (M). In our method, use of appearance set to

means only when needed.

Differently from the aforementioned online learning methods, our approach is not hierarchical and we do not compute intermediate tracklets because errors in the tracklets corrupt the learning data. Similarly to [2], we model a score of uncertainty but based on distance information only and not on the target model, since distance can not drift over time. This enables us to invoke appearance and other less stable features only when truly needed as in the case of missing detections, occluded objects or new tracks.

3 Related perception studies

The proposed method is inspired by the human cognitive ability to solve the tracking task. In fact, events such as eye movements, blinks and occlusions disrupt the input to our vision system, introducing challenges similar to the ones encountered in real world video sequences and detections. Perception psychologists have studied the mechanisms employed in our brain during multiple object tracking since the ’80s [14, 22, 1], though only recently RMI experiments have been used to confirm and validate proposed theories. One of these preeminent theories is given in a seminal work by Kahneman, Treisman and Gibbs in 1992 [14]. They proposed the theory of Object Files to understand the dominant role of spatial information in preserving target identity. The theory highlights the central role of spatial information in a paradigm called Spatio-Temporal Dominance. Accordingly, target correspondence is computed on the basis of spatio-temporal continuity and does not consult non-spatial properties of the target. If spatio-temporal information is consistent with the interpretation of a continuous target, the correspondence will be established even if appearance features are inconsistent. “Up in the sky, look: It’s a bird. It’s a plane. It’s Superman!” - this well known quote, from the respective short animated movie (1941), suggests that the people pointing at Superman changed their visual perception of the target to the extent of giving him a completely different meaning, while they never had any doubt they kept referring to the same object. Nevertheless, when correspondence cannot be firmly established on the basis of spatial information, appearance, motion, and other complex features can be consulted as well. In particular, in [14] the tracking process is divided into a circular pipeline of three steps (Fig 2, top row). The correspondence uses only positional information and aims at establishing if detected objects are either a new target or an existing one appearing at a different location. The review activates when ambiguity in assignments arises, and recomputes uncertain target links by also taking into account more complex features. Eventually, the impletion is the final task to assess and induce the perception of targets temporal coherence.

Figure 2: First row shows the human tracking process according to Kahneman, Treisman and Gibbs theory [14]. Below a schematic view of the inference and learning steps underpinning our method.

4 The proposal

As depicted in Fig. 2, the proposed method relates the 3 steps of correspondence, review and impletion to a divide and conquer approach. Targets are divided in the where pathway by checking for incongruences in spatial coherence. Eventually, the tracking solution is conquered by associating coherent elements in the where (spatial) domain and incoherent ones in the what (visual) domain.

The core of the proposal is twofold. First, a method to divide potential associations between detections and tracks into local clusters or zones. A zone can be either simple or complex, calling for different features to complete the association. Targets can be directly associated to their closest detections if they are inside a simple zone (when we have the same number of tracks and detections, green area in Fig. 2(b)). Conversely, targets inside complex areas (red in Fig. 2(b)) are subject to a deeper evaluation where appearance, motion and other features may be involved.

Second, we cast the problems of splitting potential associations and solving them by selecting and weighting the features inside a unified structural learning framework that aims at the best set of partitions and adapts from scene to scene.

4.1 Problem formulation

Online MTT is typically solved by optimizing, at frame , a generic assignment function for a set of tracks and current detections :



is a permutation vector of

and is a cost matrix. The cost matrix is designed to include dummy rows and columns to account for new detected objects () or leaving targets (). More formally, if matrix contains association costs for currently tracked targets and detections, the cost matrix is:


where , contain the cost of creating a new track on the diagonal and elsewhere. Similarly, is a full matrix of value .

The formulation in Eq. (1) evaluates all the associations through the same cost function, built upon a preferred set of features. In order to consider different cost functions for specific subsets of associations, we reformulate Eq. (1) as:


where we explicit the different contribution of trivial and difficult associations, whose costs are given by the functions and respectively. Associations are locally partitioned in zones as shown in Fig. 2(b). Hereinafter, we seamlessly refer to a zone z as a portion of the scene or the set of detections and tracks that lie onto it. A zone can be simple or complex to solve depending on the set of associations it involves.

(a) frame input
(b) divide associations
(c) conquer associations
Figure 3: Overview of the inference procedure. (a) In the image targets are represented by bird eye view sketches (shaded when occluded) and detections by crosses. (b) In the divide step detections and non-occluded targets are spatially clustered into zones. A zone with an equal number of targets and detections is simple (solid green contours), complex otherwise (dashed red contours). (c) Associations in simple zones are independently solved by means of distance features only. Complex zones are solved by considering more complex features such as appearance or motion and accounting for potentially occluded targets, which are shared across all the complex zones.

5 Learning to divide

In this section, we propose a method to generate zones and decide whether associations in those zones are simple or difficult . A zone can be defined as an heterogeneous set of tracks and detections characterized by spatial proximity. Even if simple, the concept of proximity may vary across sequences, and the importance of distances on each axis depends on targets dominant flows in the scene. Zones are computed through the Correlation Clustering (CC) method [3] on the cost matrix

suitably modified to obtain an affinity matrix

as required by the CC algorithm. To move from cost features (distances) in to affinity features in , the cost features vector is augmented with their similarity counterpart and the affinity value is computed as the scalar product between this vector and a parameter vector :


where and are the -th track and -th detection respectively. The vector has the triple advantage of weighting differently distances on each axis, avoiding to set thresholds in the affinity computation and controlling the compactness and the balancing of clusters. Further detail on learning are provided in the following sections.

To prevent the creation of clusters composed only of detections or tracks, a symmetric version of is created having a zero block diagonal structure:


Through this shrewdness, two tracks (detections) can be in the same cluster only if close to a common detection (track). The CC algorithm, applied on , efficiently partition the scene in a set of zones so that the sum of the affinities between track-detection pairs in the same zone is maximized:


Eventually, a zone is defined as simple if it contains an equal number of targets and detections, otherwise is complex. As previously stated, associations in a complex zone cannot be solved with the use of distance information only (Fig. 2(b)), but require more informative features to disambiguate the decision.

6 Learning to conquer

The divide mechanism brings the advantage of splitting the problem into smaller local subproblems. Associations belonging to simple zones can be independently solved trough any bipartite matching algorithm. The complete tracking problem must deal also with occluded target as well. We consider a target as occluded when it is not associated to a detection (e.g. a miss detection in frame occurred, shaded people in Fig. 3). Since occluded targets are representation of disappeared objects, they are not included in the zones at the current frame. All the subproblems related to complex zones are consequently connected by sharing the whole set of occluded targets. In order to simultaneously solve the whole set of subproblems, we construct an augmented version of the matrix in Eq. (2) where the block accounts for potential associations between occluded tracks and current detections:


is a -diagonal matrix ( elsewhere) used to keep occluded tracks still occluded in the current frame. The solution of the optimization problem in Eq. 1 on matrix , obtained by applying the Hungarian algorithm, provides the final tracking associations for this frame.

More precisely, thanks to the peculiar block structure of a single call to Hungarian results in solving the partitioned association problem in Eq. (3), subject to the constraint that each occluded element can be inserted in a single complex zone subproblem solution. In , simple zones subproblems are isolated by setting the association cost outside the zone to . Similarly, complex zones results in independent blocks as well, but are connected through the presence of occluded elements, non-infinite entries in .

By casting the problem using the cost matrix , it is possible to learn, in a joint framework, to combine features in order to obtain a suitable cost for both the association (either in simple or complex zones) and the partition in zone as well. To this end we introduce a linear -parametrization on and with a mask vector that selects the features according to the complexity of the belonging zone :


being the Hadamard product. The feature vector contains both simple and complex information between the -th track and the -th detection:


where are distance functions between track and detection on complex features and respectively. Precisely, selectively activates features according to the following rules:


where the pair target-detection in may (a) belong to the same simple zone, (b) be composed by elements belonging to complex zones and (c) have elements belonging to different zones.
The feature vector is computed only on pairs of (possibly occluded) tracks and detections. To extend the parametrization to the whole matrix , it is sufficient to set outside and . Analogously, for elements outside or , we set and when and respectively. The learning procedure in Sec. 7 computes the best weight vector and consequently is learnt as a bias term. Recall that governs tracks initiation and termination. Eq. (3) becomes a linear combination of the weights and a feature map :


The feature map is a function evaluating how well the set of zones and the proposed tracking solution for frame fit on the input data and .

Given a set of weights , the tracking problem in Eq. (11) can be solved by first computing the zones through the divide step on matrix of Eq. (5) and then by conquering the associations in each zone through the Hungarian method on matrix . Note that now and is a subset of .

1:  Let for
2:  for  to  do
3:     Compute simple features for learning to divide Eq. (9)
4:     Latent completion: through Correlation Clustering on of Eq. (5)
5:     Compute complex features for learning to conquer Eq. (9)
6:     Max Oracle: ( through Hungarian on Eq. (14)
7:     Let and
8:     Let and clip to
9:     Update and
10:     Update and
11:  end for
Algorithm 1 Block-Coordinate Primal-Dual Frank-Wolfe Algorithm for learning on a sequence of frames

7 Online subgradient optimization

The problem of Eq. (11) requires to identify the complex structured object such that is the set of zones that best explain the -th frame tracking solution for an input . Zones are modelled as latent variables, since they remain unobserved during training. To this end, we learn the weight vector in through Latent Structural SVM [26] by solving the following unconstrained optimization problem over the training set :


with being the structured hinge-loss. results from solving the loss-augmented maximization problem



is a loss function that measures the error of predicting the output

instead of the correct output while assuming to hold instead of , and we defined for notation convenience.

Solving Eq. (13) is equivalent to finding the output-latent pair generating the most violated constraint, for a given input and a latent setting .

(a) Inference Step
(b) Maximization Oracle
Figure 4: Thanks to the choice of the Hamming loss, the maximization oracle is reduced to an assignment problem efficiently solved through the Hungarian algorithm, as for the inference step.

Despite the generality of the learning framework, the loss function is problem dependent and must be accurately chosen. In particular, we adopted the Hamming loss function that, substituted in Eq. (13), behaves linearly making the maximization oracle solvable as a standard assignment problem, Fig. 3(b):


where was dropped as not dependent on either or .
The learning step of Eq. (12) can be efficiently solved online, under the premise that the dual formulation of LSSVM results in a continuously differentiable convex objective after latent completion. We designed a modified version of the Block-Coordinate Frank-Wolfe algorithm [16] presented in Alg. 1. The main insight here is to notice that the linear subproblem employed by Frank-Wolfe (line 5) is equivalent to the loss-augmented decoding subproblem of Eq. (14), which can be solved efficiently through the Hungarian algorithm [15]. To deal with latent variables during optimization, we added the latent completion process (line 4) where, given an input/output pair, the latent variable which best explain the solution to the observed data is found. Through the latent completion step, the objective function optimized by Frank-Wolfe has guarantees to be convex.

Figure 5: Tracking results on PETS09-S2L3, 1shatian3 and GVEII from the MCD dataset (top row). AVG-TownCentre, ADL-Rundle-3 and Venice-1 from the MOT Challenge sequences (bottom). Next to images, simple (green) and complex (red) zones are displayed.

8 Experimental results

In this section we present two different experiments that highlight the improvement of our method over state of the art trackers in static camera sequences. The first experiment is devoted to stress the method in clutter scenarios where moderate crowd occurs and our divide and conquer approach gives its major benefits in terms of both computational speed and performances. The second experiment is on the publicly available MOT Challenge dataset that is becoming a standard for tracking by detection comparison. Test were evaluated employing the CLEAR MOT [6] measures and trajectory based measures (MT,ML,FRG) as suggested in [20]. All the detections, where not provided by authors, have been computed using the method in [10] as suggested by the protocol in [20]

. Results are averaged per experiment in order to have a quick glimpse on the tracker performances. Individual sequences results are provided in the additional material. To train the parameters acting on the complex zones, the LSSVM have been trained with ground truth (GT) trajectories and the addition of different levels of random noise simulating miss and false detections. In all the tests, occluded objects locations are updated in time using a Kalman Filter with a constant velocity state transition model, and discarded if not reassociated after 15 frames.

8.1 Features

The strength of the proposal is the joint LSSVM framework that learns to weight features for both partitioning the scene and associating targets. On these premises, we purposely adopted standard features. Without loss of generality, the method can be expanded through additional and more complex features as well. The features always refer to a single detection and a single track , occluded or not, and its associated history, in compliance with Eq. (9).

In the experiments, the appearance of the targets is modeled through a color histogram in the RGB space. Every time a new detection is associated to a track, its appearance information is stored in the track history. The appearance feature is then computed as the average value of the Kullback-Leibler distance of the detection histogram from track previous instances. Additionally, we designed tracks to contain their full trajectories over time. By disposing of the trajectories, we modeled the motion coherence of a detection w.r.t a track by evaluating the smoothness of the manifold fitted on the joint set of the new detected point and the track spatial history. More precisely, given a detected point, an approximate value of the Ricci curvature is computed by considering only the subset of detections of the trajectory lying inside a given neighborhood of the detected point. An extensive presentation of this feature is in [11].

Figure 6: Tracks length curves (TL) on MCD sequences. The gray shaded area indicates the performances reached by a simple global NN algorithm (lower bound) and the highest score obtained for each track combining all different methods results (upper bound).

8.2 Datasets and Settings

Midly Crowded Dataset (MCD): the dataset is a collection of moderately crowded videos taken from both public benchmarks with the addition of ad-hoc sequences. This dataset consists of 4 sequences: the well-known PETS09 S2L2 and S2L3 sequences, and 2 new sequences. GVEII is characterized by a high number of pedestrian crossing the scene (up to 107 people per frame), while 1shatian3, captured by [27], is a sequence characterized by a high density and clutter (up to 227 people per frame). A single training stage was performed by gathering the first of each video. These frames have not been used at test time.

MOT Challenge: the dataset consists of several public available sequences in different scenarios. Detections and annotations are provided by the MOTChallenge website. In our test we consider the subset of the sequences coming from fixed cameras since distances are not meaningful in the moving camera settings: TUD-Crossing, PETS09-S2L2, AVG-TownCentre, ADL-Rundle-3, KITTI-16 and Venice-1.
Learning was performed on a distinct set of sequences provided on the website for training.

LDCT w.n. 47.7 68.8 88 26 209 103
LDCT (all features) 40.6 66.3 61 43 446 193
LDCT (only simple) 36.4 64.7 58 50 586 276
Bae and Yun [2] 39.0 65.8 84 35 637 289
Possegger et al. [21] 38.7 65.0 79 37 455 440
Milan et al. [19] 40.6 66.7 64 42 242 141
Table 2: Average results on MCD. In the appearance column, w.n. is when needed. More details on the light gray baselines in the text.
LDCT 43.1 74.5 9 10 682 2780 161 187
RMOT 30.4 70.2 2 27 1011 3259 74 125
TC_ODAL 24.2 70.9 1 31 1047 3528 75 152
MotiCon 32.0 70.6 2 30 777 3280 110 105
SegTrack 32.3 72.1 3 38 520 3454 80 76
CEM 28.1 71.2 5 24 1256 3088 87 97
SMOT 23.9 71.7 2 27 706 3627 120 208
TBD 28.0 71.3 3 25 1233 3083 192 193
DP_NMS 22.7 71.4 3 17 1062 3052 529 325
Table 3: Averaged results of our method (LDCT) and the other MOT Challenge competitors on the 6 fixed camera sequences. See: http://www.motchallenge.net for detailed results.

8.3 Comparative evaluation

Results on MCD: Quantitative results of our proposal on the MCD dataset compared with the state of the art trackers are presented in Tab. 2, while visual results are in Fig. 5. We compared against two very recent online methods [21, 2] that focus either on target motion or appearance. Moreover, the offline method [19] has been considered being one of the most effective MTT methods up to now. In the MCD challenging sequences, we outperform the competitors in terms of MT values having also the lowest number of IDS and FRAG. This is basically due to the selective use of the proper features depending on the outcomes of the divide phase of our algorithm. This solution allows our tracker to take the best of both worlds against [21] and [2]. MOTA measure is higher as well testifying the overall quality of the proposed tracking scheme. Additionally, in Fig. 6 we reported the track lenght curves (TL) on the MCD dataset. TL curve is computed by considering the length of the correctly tracked GT trajectories plotted in descending order. The plot gives information on the ability of the tracker to build continuous and correct tracks for all the ground truth elements in the scene, neglecting the amount of false tracks inserted. Our AUC is always greater than competitors’ thanks to the adoption of complex zones that effectively deals with occluded/disapperared objects and keep the tracks longer.

To evaluate the improvement due to the adoption of the divide and conquer steps, which is the foundation of our tracker, in Tab. 2 we also test two baselines: when either all features or spatial features only were used for all the assignments independently of the zone type. In both tests, the divide step, the parameter learning and occlusion handling remain as previously described. Improvement of the complete method (dark gray) over these baselines (light gray) suggests that complex features are indeed more beneficial when used selectively.

Results on MOT Challenge: Tab 3 summarizes the accuracy of our method compared to other state of the art algorithms on the MOT Challenge dataset. Similarly to the MCD experiment, we observe that our algorithm outperforms the other state of the art methods. Our method achieves best results in most of the metrics, keeping IDS and FRG relatively low as well. In turn, our method records the highest MOTA compared to others with a significant margin (+10%). Excellent results on this dataset highlight the generalization ability of our method, which was trained on sequences different (although similar) from the ones in the test evaluation. Fig. 5 shows some qualitative examples of our results.
Furthermore, our online tracker has been designed to perform considerably fast. We report an average performances of 10 fps on the MOT Challenge sequences. The runtime is strongly influenced by the number of detections as well as by the number of tracks created up to a specific frame. The performances are in line or faster than the majority of the current methods that report an average of 3-5 fps.

The computational complexity of solving Eq. (1) using the Hungarian algorithm is with the number of tracks and detections to be associated and the number of occluded tracks. Since the complexity of the divide step is linear in the number of targets, our algorithm reduced the assignment complexity to . The first term applies for simple zones and is linear in being dominated by that is the average number of detections in every partition (). The second term modulates the complexity of the association algorithm in complex zones by the factor, is the percentage of complex zones in the scene. Eventually the term is related to the recall of the chosen detector. As an example can be realistically set to and, if the percentage of complex zones is , the algorithm is faster than its original counterpart.

9 Conclusion

In this work, we proposed an enhanced version of the Hungarian online association model to match recent features advancement and cope with different sequences peculiarities. The algorithm is able to learn to effectively partition the scene and choose the proper feature combination to solve simple and complex association in an online fashion. As observed in the experiments, the benefits of our divide and conquer approach are evident in terms of both computational complexity of the problem and tracking accuracy.

The proposed tracking framework can be extended/enriched with a different set of simple and complex features and it can learn to identify the relevant ones for the specific scenario111Although analogy with cognitive theory holds for spatial features only.. This can open a major room for improvement by allowing the community to test the method with more complex and sophisticated features. We invite the reader to download the code and to test it by adding her favorite features.

Appendix A: Block-Coordinate Frank-Wolfe optimization of Latent Structural SVM

In a recent paper by Lacoste-Julien et al. [16] the efficient use of Block-Coordinate Frank-Wolfe optimization for the training of structural SVM was demonstrated. They noted that by disposing of a maximization oracle, subgradient methods could be applied to solve the non-smooth unconstrained problem of Eq. (15). The notation follows the one used in the paper.


where is exactly the optimal value for the necessary max oracle. The Lagrange dual of the above -slack formulation of Eq.(15) has potential support vectors. Writing for the dual variables associated with the the -th training example and potential output , the dual problem is given by


Here matrix and vector are constructed by simple Lagrangian derivation. The only two requirements that need to be satisfied in order to apply Frank-Wolfe algorithm on the problem of Eq.(16) are:

  • the domain of has to be compact, and

  • the convex objective has to be continuously differentiable.

Observe that the domain of Eq. (16) is the product of probability simplices, . It is thus compact by the geometrical definition of simplex. We now present matrix and vector for the latent formulation of SSVM and check that is continuously differentiable. Recalling that for LLSVM the loss augmented decoding subproblem is expressed as


and omitting the lagrangian dual derivation as it is a simple mathematical procedure, we obtain to be composed of a set of columns:


where indicates the row index. Analogously, the column vector is built as follows.


The function is now differentiated by


where is the stationarity KKT condition that has to hold in order to make the duality strong. By substituting the definition of and for specific values of we obtain


Which is the same hinge loss of Eq. (17). So once again the intuition that the linear subproblem that the Frank-Wolfe algorithm has to solve is strongly connected to the loss augmented decoding subproblem is true. Nevertheless, as opposite to the non-latent case, here the hinge loss is also dependent on the latent variable which makes the problem non convex. Thus, before computing the , a latent completion step (line 4 of Alg. 1) is needed in order to ensure to be continuously differentiable over all the domain except for a finite number of points (sufficient condition). Once we attend these precautions, the latent formulation reduces to the standard SSVM case, and as such, all convergence results also apply to the latent case.

Appendix B: Computational complexity details

In this section we provide some details on the computational complexity results we presented in the paper. Recall that if the problem of data association is tackled as a whole, the complexity of finding a perfect minimum matching amounts to


according to widely used Hungarian implementations, where is the number of currently active tracks and detections and is the number of occluded tracks that can still be considered for association.

When considering our method, we have to consider two distinct contributions to the complexity of the overall algorithm:

  • divide which is accomplished through a correlation clustering (CC) step on elements.

  • conquer or the application of the Hungarian to the generated sub-problems.

The division step can be linear or even sublinear with respect to the number of elements to be splitted . This can be obtained through the many recent approximate solutions of the CC (which is NP-hard optimally),  [17, 8]. The conquer step has to be evaluated considering that complex and simple zones are solved differently. Simple zones result in independent subproblems that can be solved directly through the Hungarian. Conversely, the complex zones are not independent and occluded targets have to be considered as well during the overall association subproblem.

Let be the number of clusters created by the CC and suppose a uniform partition of tracks and detections among these clusters. To simplify the notation, let us call the average number of elements (tracks and detections) inside a zone. If is the overall number of active tracks and detections, each cluster has approximately tracks and detections that need to be associated. Independently of the zone, these two quantities must coincide for the zone to be simple, so the complexity of solving a simple zone is . Note that for large , will typically be much smaller than . If simple clusters are a fraction of the overall number of clusters , the final complexity of solving simple zones is which reduces to in the worst case hypotheses. Note that these subproblem can be solved in parallel and typically when is large. This is because , the number of interfering tracks/detections, is limited by the non-maxima suppression response of a detector.

Complex zones have to be solved altogether due to the shared occluded tracks. If is the number of simple groups, the number of complex groups is or for notation convenience. Since the number of tracks and detections are not equal anymore, the number of rows/columns to consider for each group is . We thus obtain a complexity of and due to the addition of occluded targets, the complexity increases to . Summing up we obtain that the overall complexity of the conquer step is


Note that when , all zones are simple, only the first term matters; while if than the contribution of the first term vanishes and the second term reduces to a stand hungarian over all the tracks/detections as in Eq. (22).

Appendix C: Kalman filtering

In order to propagate tracks position over occlusions, we employed a simple Kalman Filter predictor with a constant velocity measurement model. This basically means that while unobserved, tracks keep moving by assuming their velocity will not change over time. More formally, the standard discrete Kalman Filter formulation, when no input is considered, is:


being the first equation the state equation and the second one the measurement equation. Here represents the state of a track at time , while its measured position. is the matrix which relates these two variables, namely relates the state and the measurement. is called state space matrix and explain how the model should evolve over time by means of its physical intrinsic peculiarities. and

are the measurement and state noise random variables. During occlusions, the observation cannot be directly measured so we need to rely on the second relation

, and cannot correct the model. A state for a track is usually represented by a four dimensional vector containing its position and velocity as follows:


To describe a constant velocity linear model, we need to specify and as follows:


By substituting the equations, and ignoring the noises just for the sake of simplicity, we obtain the measurement vector:


which corresponds exactly to a constant velocity model due to the identity 2x2 submatrix in the lower right corner of .

Appendix D: Detailed experimental results

In the paper we had to omit some detail on the experimental results. Due to space limitations we presented the results only averaged over the whole sequence set of the Mildly Crowded Dataset (MCD) and the fixed camera sequences of the MOT Challenge (MOT) benchmark. In Tab. 4 and Tab. 5 we report per sequence results. In particular, for MCD we also report results for the considered competitors; while for the MOT benchmark results are reported for our method only and we let the reader refer to the benchmark site for competitors results:

Sequence Method onl. app. MOTA MOTP   MT   ML   IDS   FRG
PETS09-S2-L2 LDCT (our) w.n. 47.4 70.8 6 3 297 300
42 pedestrian LDCT (all features) 41.3 69.7 4 7 411 252
up to 33 for frame LDCT (only simple) 35.7 68.8 3 9 497 323
Bae and Yun [2] 30.2 69.2 1 8 284 499
Possegger et al. [21] 40.0 68.6 8 3 211 342
Milan et al. [19] 44.9 70.2 5 6 150 165
PETS09-S2-L3 LDCT (our) w.n. 35.2 66.7 6 15 120 12
44 pedestrian LDCT (all features) 30.6 65.1 1 20 235 45
up to 42 for frame LDCT (only simple) 26.1 63.2 1 25 316 62
Bae and Yun [2] 28.8 62.3 8 17 96 150
Possegger et al. [21] 32.2 64.1 5 12 79 111
Milan et al. [19] 31.3 64.6 7 23 71 56
GVEII LDCT (our) w.n. 65.6 73.5 208 63 285 71
630 pedestrian LDCT (all features) 55.6 70.5 172 101 548 320
up to 107 for frame LDCT (only simple) 50.9 67.9 151 113 753 418
Bae and Yun [2] 57.9 71.1 200 75 1023 320
Possegger et al. [21] 51.1 69.8 153 98 844 652
Milan et al. [19] 49.3 71.2 147 87 312 244
1shatian3 LDCT (our) w.n. 42.6 61.0 133 23 137 32
239 pedestrian LDCT (all features) 34.7 59.9 68 45 592 154
up to 227 for frame LDCT (only simple) 32.8 58.9 79 52 776 301
Bae and Yun [2] 38.9 60.7 150 40 1146 185
Possegger et al. [21] 31.5 57.4 110 35 686 654
Milan et al. [19] 36.8 60.8 98 50 435 98
Table 4: Comparison of the proposed method (dark grey) with the state of the art methods on the MCD dataset. In the appearance column, w.n. means when needed. For each sequence, we also run our code by always associating based on the whole feature set and simple features only (light grey baselines).
TUD-Crossing 67.7 82.9 13 9 1 100 205 51 30
PETS09-S2L2 47.4 70.8 42 6 3 995 3779 297 300
AVG-TownCentre 31.7 72.2 226 36 24 1878 2608 395 509
ADL-Rundle-3 25.2 73.4 44 2 27 453 7039 110 101
KITTI-16 53 79 17 2 3 91 665 44 57
Venice-1 33.5 68.4 17 0 5 578 2384 73 129
Mean scores 43.1 74.5 59.8 9.2 10.5 682.5 2780 161.7 187.7
Table 5: Per sequence results of our method on the MOT Challenge fixed camera sequences. Last row contains mean values and is the one reported in the paper for comparison. Refer to the benchmark website (http://motchallenge.net/results_detail) for competitors detailed results.


  • [1] G. A. Alvarez and S. L. Franconeri. How many objects can you track?: Evidence for a resource-limited attentive tracking mechanism. Journal of Vision, 7(13), Oct. 2007.
  • [2] S.-H. Bae and K.-J. Yoon. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In

    2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 1218–1225, June 2014.
  • [3] N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56:89–113, July 2004.
  • [4] R. Benenson, M. Omran, J. Hosang, and B. Schiele. Ten years of pedestrian detection, what have we learned? In European Conference on Computer Vision Workshops, pages 613–627. 2015.
  • [5] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple object tracking using k-shortest paths optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9):1806–1819, Sept. 2011.
  • [6] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing, 2008(1):246309, 2008.
  • [7] M. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Robust tracking-by-detection using a detector confidence particle filter. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1515–1522, Sept 2009.
  • [8] E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation clustering in general weighted graphs. Theoretical Computer Science, 361(2–3):172 – 187, 2006. Approximation and Online Algorithms.
  • [9] C. Dicle, O. Camps, and M. Sznaier. The way they move: Tracking multiple targets with similar appearance. In 2013 IEEE International Conference on Computer Vision (ICCV), pages 2304–2311, Dec. 2013.
  • [10] P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1532–1545, Aug. 2014.
  • [11] D. Gong, X. Zhao, and G. G. Medioni. Robust multiple manifold structure learning. In ICML, 2012.
  • [12] M. A. Goodale and A. Milner. Separate visual pathways for perception and action. Trends in Neurosciences, 15(1):20 – 25, 1992.
  • [13] M. Hofmann, M. Haag, and G. Rigoll. Unified hierarchical multi-object tracking using global data association. In 2013 IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), pages 22–28, Jan. 2013.
  • [14] D. Kahneman, A. Treisman, and B. J. Gibbs. The reviewing of object files: Object-specific integration of information. Cognitive Psychology, 24(2):175–219, Apr. 1992.
  • [15] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, Mar. 1955.
  • [16] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate frank-wolfe optimization for structural SVMs. In International Conference on Machine Learning, 2013.
  • [17] J. Li, X. Huang, C. Selke, and J. Yong. A fast algorithm for finding correlation clusters in noise data. In Z.-H. Zhou, H. Li, and Q. Yang, editors, Advances in Knowledge Discovery and Data Mining, volume 4426 of Lecture Notes in Computer Science, pages 639–647. Springer Berlin Heidelberg, 2007.
  • [18] Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hybridboosted multi-target tracker for crowded scene. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2953–2960, June 2009.
  • [19] A. Milan, S. Roth, and K. Schindler. Continuous energy minimization for multitarget tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1):58–72, Jan. 2014.
  • [20] L. Milan, A. Leal-Taixé, K. Schindler, S. Roth, and I. Reid. MOT Challenge. http://www.motchallenge.net, 2014.
  • [21] H. Possegger, T. Mauthner, P. M. Roth, and H. Bischof. Occlusion geodesics for online multi-object tracking. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1306–1313, June 2014.
  • [22] Z. Pylyshyn. The role of location indexes in spatial perception: a sketch of the FINST spatial-index model. Cognition, 32(1):65–97, June 1989.
  • [23] A. Smeulders, D. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: An experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1442–1468, July 2014.
  • [24] Z. Wu, J. Zhang, and M. Betke. Online motion agreement tracking. In Proceedings of the British Machine Vision Conference. BMVA Press, 2013.
  • [25] B. Yang and R. Nevatia. Multi-target tracking by online learning a crf model of appearance and motion patterns. Int. J. Comput. Vision, 107(2):203–217, Apr. 2014.
  • [26] C.-N. J. Yu and T. Joachims. Learning structural SVMs with latent variables. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1169–1176, New York, NY, USA, 2009. ACM.
  • [27] B. Zhou, X. Tang, H. Zhang, and X. Wang. Measuring crowd collectiveness. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(8):1586–1599, Aug 2014.