A common concern in Multi Object Tracking (MOT) approaches is to prevent identity switching, the erroneous merging of trajectories corresponding to different targets into a single one. This is difficult in crowded scenes, especially when the appearances of the individual target objects are not distinctive enough. Many recent approaches rely on tracklets—short trajectory segments—rather than individual detections, to keep track of the target objects. Tracklets can be merged into longer trajectories, which can be split again when an identity switch occurs.
Most state-of-the-art approaches rely on deep networks, often in the form of RNN architectures that operate on such tracklets [34, 18, 25, 56, 27]. This requires training the sequence models and is subject to one or both of two well-known problems, which our approach overcomes:
Loss-evaluation mismatch. It occurs when training by optimizing a metric poorly aligned with the actual desired performance during inference. In MOT, one example is the use of a classification loss to create trajectories optimal for a tracking-specific metric, such as MOTA  or IDF . To eliminate this mismatch, we introduce an original way to score tracklets that is an explicit proxy for the IDF
metric and can be computed without the ground truth. We use it to identify how confidently the person is tracked, predict tighter bounding box locations, and estimate whether the real trajectory extends beyond the observed tracklet.
Exposure bias. It stems from the model not being exposed to its own errors during training and results in very different data distribution observed during training and inference/tracking. We remove this bias by introducing a much more exhaustive, yet computationally feasible, approach to exploiting the data while training the model. To this end, during training, we do not limit ourselves to only using tracklets made of detections of one or two people as in [40, 35, 48]. Instead, we consider any grouping of tracklets produced by the tracking algorithm to be a potential trajectory but prevent a combinatorial explosion by controlling the number of tracklets that share many common detections. This yields a much richer training dataset, solves the exposure bias problem, and enables our algorithm to handle confusing situations in which a tracking algorithm may easily switch from one person to the next or miss someone altogether. Fig. 1 depicts one such case. Note that this can be done even when appearance information is not available.
Our contribution is therefore a solution to these two problems. By integrating it into an algorithm that only uses very simple features—-bounding boxes, detector confidence—we outperform other approaches that do not use appearance features either. By also exploiting appearance-based features, we similarly outperform state-of-the-art approaches that do. Taken together, these results demonstrate the effectiveness of our training procedure.
In the remainder of this paper, we first briefly review related work and current approaches to mitigating loss-evaluation mismatch and exposure bias. We then introduce our approach to tracking; it is a variation of multiple hypothesis tracking designed for learning to efficiently score the tracklets. Next, we describe the exact form of our scoring function and its ability to reduce both mismatch and bias. Finally, we present our results.
2 Related work
Multiple Object Tracking (MOT) has a long tradition, going back many years for applications such as radar tracking . With the recent improvements of object detectors, the tracking-by-detection paradigm  has become a de facto
standard and has proven effective for many applications such as surveillance or sports player tracking. It involves first detecting the target objects in individual frames, associating these detections into short but reliable trajectories known as tracklets, and then concatenating these tracklets into longer trajectories. They can then be used to solve tasks such as social scene understanding[1, 3], future location prediction , or human dynamic modeling .
While grouping individual detections into trajectories it is difficult to guarantee that each resulting trajectory represents a whole track of a single individual , that is, that there are no identity switches.
Many approaches rely on appearance [17, 29, 57, 59, 11, 33, 47], motion , or social cues [20, 44]. They are mostly used to associate pairs of detections, and only account for very short-term correlations. However, since people trajectories are often predictable over many frames once a few have been seen, superior performance could be obtained by modeling behavior over longer time periods [22, 28, 37]. Increasing availability of annotated training data and benchmarks, such as MOT15-17 [30, 39], DukeMTMC , PathTrack , and Wildtrack  now makes it possible to learn the data association models required to leverage this knowledge. Since this is what our method does, we briefly review here a few state-of-the-art approaches to achieving this goal.
2.1 Modeling Longer Sequences
is one of the first recent approaches to modeling long trajectories using a recurrent neural network. The algorithm estimates ground-plane occupancy, but does not perform explicit data association. presented an approach to performing data association without using appearance features by predicting the future location of the target. Several MOT approaches have followed, using sequence models to make data association more robust for the purpose of people re-identification [48, 35], learning better social models , forecasting future locations [31, 54] or joint detection, tracking, and activity recognition .
These models are usually trained on sample trajectories that perfectly match a single person’s trajectory or only marginally deviate from that, making them vulnerable to exposure bias. Furthermore, the loss function is usually designed primarily for localization or identification rather then fidelity to a ground truth trajectory. This introduces a loss-evaluation mismatch with the metric, usuallyIDF  or MOTA , which reflects more reliably the desirable behavior of the algorithm.
Most state-of-the-art approaches that use sequence models rely on one of two optimization techniques, hierarchical clustering for data association[49, 59, 46, 34, 18, 25] or multiple hypothesis tracking [56, 27, 10]. The former involves valid groups of observations without shared hypotheses while the latter allows for conflicting sets of hypotheses to be present until the final solution is found. The approach most similar to our is that of . It also uses a combination of multiple hypothesis tracker and a sequence model for scoring. However, the training procedure mostly relies on ground truth information and is therefore more subject to exposure bias. Another closely related method is that of  that trains a sequence model for data association solely from geometric features and is therefore well-suited for comparison with our approach when also using only geometric cues. These methods are all recent and collectively represent the current state-of-the-art. In Section 5, we will therefore treat them as baselines against which we can compare our approach.
2.2 Reducing Bias and Loss-Evaluation Mismatch
Since exposure bias and loss-evaluation mismatch are also problems in Natural Language Processing (NLP) and in particular in machine translation , several methods have been proposed in these fields to reduce it [45, 4]. Most of them, however, operate under the assumption that output sequences can comprise any character from a predefined set. As a result, they typically rely on a beam-search procedure, which itself frequently uses a language model to produce a diverse set of candidates that contains the correct one. More generally, techniques that allow training models without differentiable relation between inputs and outputs such as policy gradient , straight-through estimation , and Gumbel-Softmax  can be seen as methods reducing exposure bias.
Unfortunately, in the case of MOT, the detections form a spatio-temporal graph in which many nearly identical trajectories can be built. This can easily overwhelm standard beam-search techniques: when limiting oneself to only the top scoring candidates to prevent a combinatorial explosion, it can easily happen that only a set of very similar but spurious trajectories will be considered and the real one ignored. This failure mode has been addressed in the context of single-object tracking and future location prediction in [21, 36]
with a tracking policy learned by reinforcement learning and in by introducing a spatio-temporal attention mechanism over a batch of images, thus ensuring that within the batch there is no exposure bias. Instead, the algorithm relies on historical positive samples from already obtained tracks, thus re-introducing it. For MOT, a reinforcement learning-based approach has been proposed  to decide whether to create new tracklets or terminate old ones. This is also addressed in  but the learning of sequence models is done independently and is still subject to exposure bias. Approach of  attempts to explicitly optimize for the IDF metric. It does so by refining the output of other tracking methods. This reduces the loss-evaluation mismatch but the sequence scoring model is hard-coded rather than learned and we will show that learning it yields better results.
3 Tracklet-Based Tracking
Our approach to tracking relies on creating and merging tracklets to build high-scoring trajectories as in multiple hypothesis tracking . In this section, we formalize it and describe its components, assuming that the scoring function is given. The scoring function and how it is learned will be discussed in the following section.
Let us consider a video sequence made of frames, on which we run a people detection algorithm on each frame individually. This yields a set of people detections , where the four elements of are the coordinates of the corresponding bounding box in the image. We represent a tracklet as a matrix of the form . In practice, tracklets only rarely span the whole sequence. We handle this by setting to zero for frames in which the person’s location is unknown. The first non-zero column of a tracklet is therefore its start and the last its end. Two tracklets and can be merged into a single one if there is no single frame in which they contain different detections.
Let be a feature
function that assigns a feature vector of dimensionto each column of a tracklet. In practice, these features can be bounding box coordinates, confidence level, and shift from the nearest detection in a previous frame. They can also be image-based features associated to the detection and we list them all in Section 5.3. Let us further assume that we can compute from these features a score that is high when the tracklet truly represents a single person’s trajectory and low otherwise. Tracking can then be understood as building the set of non-overlapping tracklets that maximizes the objective function
In the remainder of this section, we will assume that is given and assigns low scores to the wide range of bad candidate trajectories that can be generated, and high scores to the real ones.
3.2 Creating and Merging Tracklets
We iteratively merge tracklets to create ever longer candidate trajectories that include the real ones while suppressing many candidates to prevent a computationally infeasible combinatorial explosion. We then select an optimal subset greedily. We consider two trajectories to be overlapping when they have a large intersection over union. More specifically, if the total number of pixels shared by bounding boxes of the two tracklets, normalized by the minimum of the sum of areas of bounding boxes in each of them, is above a threshold . We also eliminate tracklets that are either shorter than - the length of the batch, or whose score is below another threshold . and are hyper-parameters that we estimate on a validation set. Outlined procedure involves the two main steps described below.
3.2.1 Generating Candidate Trajectories
The set of candidate trajectories must contain the real ones but its size must be kept small enough to prevent a combinatorial explosion. To this end, given the initial set of detections , which we take to be the initial tracklet set.
We then iterate the following two steps for .
Growing: Merge pairs of tracklets that can be merged and result would be bigger than the biggest of two by 1. Tracklets with and non-zero detections yields a tracklet of non-zero detections, that includes non-zero detections from both of them.
Pruning: Given tracklet , for all that were merged with it during growing phase, only retain the merger that maximizes the score .
This process keeps the number of hypotheses linear with respect to the number of detections. Yet, it retains a candidate for every possible detection. This prevents the algorithm from losing people and terminating trajectories too early even if mistakes are made early in the pruning process. We give an example in Fig. 2
, (b). In appendix, we compare this heuristic to several others and show that it is effective at preventing combinatorial explosion without losing valid hypotheses.
3.2.2 Selecting Candidates
Given the resulting set of tracklets, we want to select a compatible subset that maximizes our objective function. To this end we select a subset of hypotheses with the best possible sum of scores, subject to a non-overlapping constraint. We do this greedily, starting with the highest scoring trajectories. As discussed in the appendix, we also tried a more sophisticated approach that casts it as an integer program solved optimally, and the results are similar.
4 Learning to Score
The scoring function of Eq. 1 is a the heart of the tracking procedure of Section 3. When we create and merge tracklets, we want it to favor those that can be associated to a single person without identity switch, that is, those that score well in terms of the IDF metric. We choose IDF over other popular alternative such as MOTA because it has been shown to be more sensitive to identity switches .
In the remainder of this section, we first define , which we implement using the deep network depicted by Fig. 2(a). We then describe how we train it.
4.1 Defining the Scoring Function
Ideally, we should have for every tracklet and the corresponding ground truth trajectory . Unfortunately, at inference time, is unknown by definition. To overcome this difficulty, recall from  that for tracklet and ground truth trajectory is defined as twice the number of detections matched by ground truth, over sum of total lengths of the two:
where is the intersection over union of the bounding boxes. To approximate it without knowing , we write
assuming that the deep network of Fig. 2, (a) has been trained to regress from to
: the prediction of intersection over union of the and boxes;
: the prediction of whether the ground truth trajectory exists in frame .
We also train our network to predict the necessary change to bounding box to produce the ground truth bounding box , which we denote . It is not used to compute , but can be used during inference to better align the observed bounding boxes with the ground truth.
To train the network to predict the , , and values introduced above, we define a loss function that is the sum of errors between predictions and ground truth. We write it as
where denotes the shifting the bounding box by .
Arguably, we could have trained the network to directly regress to IDF instead of first estimating , , and and then using the approximation of Eq. 3 to compute it. However, our experiments have shown that asking more detailed feedback for every time step, as we do, forces the network to better understand motion, while a good estimation of IDF can be often produced by an average prediction.
We chose not to apply any weight factors to the components of the loss function because its components could be seen as identifying the false positive (when should be zero) and false negative (when ) errors, and since we wanted to weigh the two equally, we did not use any weight factors to , , .
4.2 Training Procedure
The key to avoiding exposure bias while training the network is to supply a rich training set. To this end, we alternate between the following two steps:
Run the hypothesis generation algorithm of Section 3.2 using current network weights when evaluating ;
Add the newly created tracklets to the training set and perform a single epoch of training.
In addition to learning the network weights, this procedure helps refine the final tracking result: The tracking procedure of Section 3 makes discrete choices about which hypotheses to pick or discard, which is non-differentiable. We nevertheless help it make the best choices by training the model on all candidates, both good and bad, encountered during tracking. In other words, our approach makes discrete choices during training and then updates the parameters based on all hypotheses that could have been selected, which is similar in spirit to using a straight-through estimator .
While simple in principle, this training procedure must be carefully designed for optimal performance. We list here the most important details of our implementation and study their impact in the ablation study.
We start the process with random network weights and stop it when the training set size increases by less than after iterating the process times. We then fully train the model on the whole resulting training set. This process can be understood as a slow traverse of the search space. It starts with an untrained model that selects random hypotheses. Then, as the training progresses, new hypotheses are added and help the network both to differentiate between good and bad alternatives and to pick the best ones with increasing confidence.
During inference, we grow each tracklet by merging it with one that yields the highest possible score. By contrast, during training, we make the training set more diverse by randomizing the merging process. To do that, we assign to candidates for merger a probability proportional to softmax of the score of the merged result multiplied by a weight coefficient. We initially set the coefficient so that the optimal pair is almost always chosen and we then progressively reduce it to increase the randomness.
Balancing the dataset.
One potential difficulty is that this procedure may result in an unbalanced training set in terms of the IDF values to which we want to regress. We solve this by splitting the dataset into 10 groups by IDF value (), selecting all samples from the smallest group, and then the same number from each other group. This enables us to perform -hard-mining by selecting samples at random and retaining the that contribute most to the loss.
We now present the datasets we use, baselines we compare against, our results, and finally a qualitative analysis.
We used the following publicly available datasets to benchmark our approach.
. It contains 8 sequences, with 50 minutes of training data, and testing sequences of 10 (”Hard”, dense crowd traversing several camera views) and 25 minutes (”Easy”) with hidden ground truth, at 60fps.
. It contains 7 training-testing sequence pairs with similar statistics and hidden ground truth for the test sequences, spanning 785 trajectories and both static and moving cameras. For each, there are 3 different sets of detections using different algorithms, which makes it possible to evaluate the quality of the tracking without overfitting to a specific detector.
. It contains 11 training and 11 testing sequences, with moving and stationary cameras in various settings. The ground truth for testing is hidden, and for each testing sequence there is a sequence with roughly similar statistics in the training data.
We compared against a number of recent algorithms that collectively represent the state-of-the-art. We distinguish below between those that do not use appearance cues and those that do.
Algorithms that ignore Appearance Cues.
RNN  relies on a recurrent neural network and is similar to ours in spirit because it uses RNN for tracking in a straightforward way. However it is trained using a different loss and approach to create the training data.
PTRACK  aims to improve results of other methods by refining the trajectories they produce, to maximize an approximation of the IDF metric. The approximation is hand-designed, and not learned as in our approach.
Algorithms that exploit Appearance Cues.
BIPCC  clusters detections with similar appearance by solving a binary integer problem. This is a baseline method for the DukeMTMC dataset.
DMAN  uses dual attention networks to perform data association by focusing on relevant image parts and temporal fragments.
JCC  handles multiple object tracking and motion segmentation as a joint co-clustering problem. It solves it by local search to group pixels and bounding boxes. This returns both tracks and segmentation.
MOTDT17  performs hierarchical data association by grouping detections using a learned re-identification metric, exploiting geometric features, and Kalman filter.
MHTBLSTM  resembles our approach in spirit. It uses a multiple hypothesis tracker and a sequence model to score the tracks. However, it is trained using only ground-truth sequence with at most one false positive and sometimes missed detections.
EDMT17  relies on a multiple hypothesis tracker. Its growing and pruning phase depend on learned detection-detection and detection-scene association models that are used to better score detections and hypotheses.
FWT  solves a binary quadratic problem to optimally group head and body detections, obtained separately.
We will show in Section 5 that we outperform both classes of methods when using the same setting as they do.
5.3 Experimental Protocol
In this section, we describe the features we use in practice along with our approach to batch processing, training, and choosing hyperparameters.
For a fair comparison against the two classes of baselines described above, we use either features in which appearance plays no part or features that encode actual image information. We describe them below.
We use the following simple features that can be computed from the detections without further reference to the images:
Bounding box coordinates and confidence ().
Bounding box shift with respect to previous and next detection in the tracklet ().
Social feature - a description of the detections in the vicinity, . It comprises offsets to the nearest detections and their confidence values. All values are expressed relative to image size for better generalization.
As a basis for appearance, we used the 128-dimentional vector produced from a bounding box by the re-identification model of . Distance in euclidian space between such vectors indicate similarity between people appearances and likelihood that they are the same person. To this end, we provide following additional features in our appearance-based model:
Appearance vector for each bounding box ().
Euclidian distance from appearance in the bounding box to the appearance that best represents trajectory so far before the current batch, if one is available (). To pick the appearance that best represents trajectory so far, we computed euclidian distances between each pair of appearances in the trajectory, and picked one with the smallest sum of distances to all others.
Crowd density feature - distance from the center of current bounding box to the center of nearest 1st, 5th, and 20th detection in the current frame (). As we discuss in the ablation study, that feature made impact on the behavior of our model with appearance in very dense crowd scenarios.
In Section 3, we focused on processing a batch of images. In practice, we process longer sequence by splitting them into overlapping batches, shifting each one by frames. While pruning hypotheses, we never suppress all those that can be merged with trajectories from the previous batch. This ensures that we can incorporate all tracks from the previous batch. We used 3-second long batches for training as in . During inference, we observed that our model is able to generalize beyond 3s, and having longer batches can be beneficial in cases of long occlusions. Inference used 6-second long batches.
Training and Hyperparameters
For all datasets and sequences, cross-validation revealed that thresholds and of Sec. 3.2 equal to and the hard-mining parameter of Sec. 4.2 equal to 3 to be near-optimal choices. For DukeMTMC, we selected a validation set of 15’000 frames for each camera, pre-trained the model on data from all cameras simultaneously, and performed a final training on the training data for each individual sequence. We used only DukeMTMC training data to train the appearance model of . For each MOT15 pair of training and testing sequence pair, we used the training sequence for validation purposes and the remaining training sequences to learn the network weights. For MOT17, we pre-trained our model on PathTrack, the appearance model of  on on CUHK03  dataset, and used the MOT17 training sequences for validation purposes. More details are in the appendix.
5.4 Comparative Performance
We compared on DukeMTMC and MOT15 against methods that ignore appearance features because their results are reported on these two datasets. For the same reason, we used DukeMTMC and MOT17 to compare against those that exploit appearance. We summarize the results below, reporting IDF and MOTA tracking metrics, and a number of identity switches (IDs), and provide a much more detailed breakdown in the appendix. We present some tracking results in Fig. 3 and Fig. 4.
Comparing to Algorithms that exploit Appearance.
On DukeMTMC, our approach performs best both for the Easy and Hard sequences in terms of IDF, MOTA, and the raw number of identity switches. Furthermore, unlike other top scoring methods that use re-identification networks pre-trained on additional datasets, ours was trained using only the DukeMTMC training data.
On MOT17, our approach is best both in terms of IDF metric, and the number of identity switches. However, it does poorly on MOTA. Strikingly, FWT does the exact opposite: it yields best MOTA and the worst IDF on this dataset. Careful examination of the trajectories shows that this comes from producing many short trajectories that increase the overall number of tracked detections, and therefore MOTA, at the cost of assigning many spurious identities, increasing fragmentation, and decreasing IDF. This example illustrates why we believe IDF to be the more meaningful metric and why we have designed our tracklet scoring function to be a proxy for it.
Comparing to Algorithms that ignore Appearance.
We report our results on MOT15 in Tab. 2 and on DukeMTMC in Tab. 1. On MOT15 dataset, method most similar to ours is RNN, which also uses an RNN to perform data association. Despite the fact that RNN uses external data to pre-train their model, and we use only the MOT15 training data, our approach is able to outperform it with a large margin. Another interesting comparison is with SORT, which performs nearly as good as our approach. However, it can not leverage training data effectively, and to show that we additionally ran this approach on the validation data we used for DukeMTMC, where there is much more training data that in MOT15. This resulted in a MOTA score of 49.9 and IDF one of 24.9, whereas our method reaches 70.0 and 74.6 on the same data.
Remarkably, on DukeMTMC dataset, even though we ignored appearance for the purpose of this comparison, our approach also outperforms or rivals some the methods that exploit it [46, 49]. This shows that our training procedure is powerful enough to overcome this serious handicap.
We now analyze briefly some key components of our approach and provide additional details in the appendix.
We performed training on a single 2.5Hz CPU, and all other actions (computing IDF values for dataset balancing, generating training data, etc.) in parallel on 20 such CPUs. Training data contained at most tracklets (DukeMTMC dataset, camera 6), resulting in at most training data points after balancing the dataset. Generating training data took under 6 hours, and training on it achieved best validation scores within 30 epochs, taking under 10 minutes each. Inference runs at about 2 frames per second. However, adding a cutoff on sequence scores in the pruning step of Sec. 3.2.1 speeds up our python implementation to 30fps, at the cost of a very small performance decrease (IDF of 71 instead of 74.6).
The last 15’000 frames of training sequences of DukeMTMC were used for an ablation study. We varied the three main components of our solution to show their effect on the tracking accuracy: data composition, scoring function, and training procedure. We report the drop in IDF when applying such changes. Creating a fixed training set by considering tracklets with at most one identity switch as in [48, 35] decreased performance (-3.9). Pruning hypotheses based on their scores or total count like  resulted in either a computational explosion or reduced performance (-20). Computing loss on the prediction of , regressing IDF value directly, not regressing bounding box shifts, or using a standard classification loss as in  were equally counter-productive (-5.1, -13.2, -2.2, -32.8). Not balancing the training set or not using hard-mining also adversely affected the results (-4.7, -2.5). Selecting the final solution using an Integer Program instead of a greedy algorithm, pre-training model with each type of features separately, or training a deeper network had no significant effect.
We also performed an evaluation of how different features affect the quality of the solution. Appearance features improved overall IDF from 74.6 to 82.5, with appearance distance feature having the biggest effect. Crowd density feature mostly affected crowded scenarios, where our merging procedure preferred to merge detections that are further apart in time, but more visually similar, compared to less crowded scenarios, where it preferred to merge detections based more on the spatial vicinity. Social feature mostly affected appearance-less model, helping to preserve identities by ensuring that detections of the surrounding people are consistent throughout trajectory, improving IDF from 67.5 to 74.6. Probabilistic merging from Sec. 4.2 was vital to fuse appearance-based and geometry-based features together. Without it, picking only the best candidate resulted in a model that performed merges mostly either based on the appearance information (largely ignoring spatial vicinity), or based on the spatial and motion information (largely ignoring appearance information).
We have introduced a training procedure that significantly boosts the performance of sequence models by iteratively building a rich training set. We have also developed a sophisticated model that can regress from tracklets to the IDF multiple target tracking metric. We have shown that our approach outperforms state-of-the-art ones on several challenging benchmarks both in scenarios where appearance is used and where it is not. In the second case, we can even come close to what appearance-based method can do without using it. This could prove extremely useful to solve problems in which appearance is hard to use, such as cell or animal tracking .
In future work, we will extend our data association procedure to account for more advanced appearance features, such as 2D and 3D pose. We will also look into further reducing the loss-evaluation mismatch by using the actual IDF, instead of our proposed IDF regressor, which would require the use of reinforcement learning.
Appendix A An appendix
a.1 Ablation study
|1||Dataset: all pairs||71.5||Loss on IDF||69.5||-hardmining||72.1||Batch 6||72.6|
|2||Dataset: mix of two||70.7||Regressing IDF||63.4||-balanced dataset||69.9||IP solution||73.8|
|3||Selected only||63.6||-bbox regression||72.4||pretraining||71.9|
|4||Prunning by score||—-||-bbox loss||66.3||2 layer LSTM||74.1|
|5||Prunning by count||54.2||classification||41.8|
It was performed on the validation data of DukeMTMC using the last 15000 training frames in each camera view.
Tab. 4 depicts the results organized in four columns.
Changes in dataset. We quantify the impact of degrading the training dataset generation procedure of Section 3, by:
using random tracklets between all pairs of detections;
using tracklets obtained by combining at most two ground truth trajectories;
adding to the training data not all tracklets observed during growing phase, but only those present in the final solution;
doing prunning using predicted score of the tracklet as cutoff;
doing prunning by retaining fixed number of tracklets with best scores;
1), 2), and 3) yield a smaller and less diverse training data, which had a detrimental effect on the results of tracking. 4) did not allow us to train any reasonable model, because of the computational explosion of the trajectories with very similar scores, that were all taken into training data. 5) proved ineffective for the same reason - training data contained many very similar trajectories.
Changes in the loss function. We modify the loss function, described in Section 4.1 by:
using loss of ;
regressing the value of directly, without splitting the task into accounting for false positives or false negatives;
not modifying the input detections based on the regression of bounding box shifts;
removing component from the loss function;
posing task as a classification task, where tracklet belongs to the positive class iff all detections overlap with some ground truth trajectory with of at least 0.5.
1) resulted in small decrease, probably due to the fact that multiple loss components acted as regularizers. 2) gave even worse results, because understanding the behaviour of function is much harder than understanding behaviour of false positives and false negatives, which we regress through and . Difference between 3) and 4) shows that simply having as part of the loss function improves the results, acting as a regularizer. 5) doesn’t result in a very good trained model due to many overlapping sequences, some of which have greater than 0.5 in every frame, and some don’t, and it is hard for the model to distinguish between the two.
Changes in the training procedure. We modify the training procedure of Section 4.2 by:
not using hard-mining;
not balancing the dataset;
pre-training embedding for appearance and geometric features separately;
using 2 layer LSTM, instead of a single layer, as depicted in Fig.2, (a);
Changes in the tracking procedure. We modify the tracking procedure of Section 3 by:
using shorter batch in tracking (3s, same as during training, instead of 6s);
selecting final solution by an IP trying to maximize objective of Eq. 1, rather than adding trajectories one by one greedily;
Additionally, while it may seem logical to use to predict the IoU between the modified bounding box and the ground truth bounding box , in practice that makes it harder to train the network as if finds an easy solution of regressing empty bounding boxes, which never intersect with the ground truth, thus always making a perfect prediction of . Instead, we use the network during inference in the autocontext mode: we predict the bounding boxes, update the input tracklet with them, and then regress the intersection over union of the new tracklet to compute the value of .
a.2 Detailed Benchmark Results
|MOTA||higher||100||Multiple Object Tracking Accuracy . This measure combines three error sources: false positives, missed targets and identity switches.|
|MOTP||higher||100||Multiple Object Tracking Precision . The misalignment between the annotated and the predicted bounding boxes.|
|IDF1||higher||100||IDF . The ratio of correctly identified detections over the average number of ground-truth and computed detections.|
|FAF||lower||0||The average number of false alarms per frame.|
|MT||higher||100||Mostly tracked targets. The ratio of ground-truth trajectories that are covered by a track hypothesis for at least 80% of their respective life span.|
|ML||lower||0||Mostly lost targets. The ratio of ground-truth trajectories that are covered by a track hypothesis for at most 20% of their respective life span.|
|FP||lower||0||The total number of false positives.|
|FN||lower||0||The total number of false negatives (missed targets).|
|ID Sw.||lower||0||The total number of identity switches.|
|Frag.||lower||0||The total number of times a trajectory is fragmented (i.e. interrupted during tracking).|
|Hz||higher||Inf.||Processing speed (in frames per second excluding the detector) on the benchmark.|
|OURS-geom||22.2||27.2||3.1||61.6||5,591||41,531||700||1,240||8.9||2.5 GHz CPU|
|SORT||21.7||26.8||3.7||49.1||8,422||38,454||1,231||2,005||1,112.1||1.8 GHz CPU|
|LP2D||19.8||—||6.7||41.2||11,580||36,045||1,649||1,712||112.1||2.6Hz 16 CPU|
Here we now give a description of tracking metrics in Tab. 5 and full results for all benchmarks in Tab. 6, 7, 8. Legend information and results for MOT15 dataset were collected from the benchmark website https://motchallenge.net/ on the 6th of May, 2018, while results for MOT17 and DukeMTMC datasets were collected on the 30th of October, 2018. Our tracker results are available there under the names SAS and SAS_full for DukeMTMC benchmark, SAS_MOT15 for MOT15 benchmark, and SAS_MOT17 for MOT17 benchmark.
We also report results of our comparison to SORT on the validation data we used for DukeMTMC dataset in Tab. 9 We tuned the parameters of the method (max_age, min_hits, detection quality cutoff) on the same data we used for training for ablation study, using grid search.
a.3 Training Protocol
We have trained the model with Adam with the fixed learning rate of 0.001. Our embedding layer consists of a fully connected layer, followed by a batch normalization layer. Size of the hidden state of LSTM were 300. In all cases we kept, and trained with batches of length 3s. Thanks to abundance of training data, we used fps of 3 for DukeMTMC dataset. For MOT15 and MOT17 datasets, we trained the model with the maximum frequency every sequence allowed, to increase the amount of training data. During inference, we used batches of length 6s. We used the bounding box shift regression only in combination with the DPM  detector, as for other types of detectors it did not prove useful. Nevertheless, we kept
as a part of a loss function. We plan to make our implementation (in Python and using Tensorflow) publicly available upon acceptance of the paper.
For each MOT15 sequence group (KITTI, ADL, etc.), we trained on all sequences excluding the group, using them for validation purposes, and ran inference on the test sequences from the same group. For MOT17, we used PathTrack for pre-training of the model, and training sequences for validation. We trained re-identification network on CUHK03 dataset.
The coefficient, which we used to multiply the probabilities before softmax to allows the probabilistic merging, described in the paper, was annealed from 10 to 0.1 in 30 epochs.
A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-fei, and S. Savarese.
Social LSTM: Human Trajectory Prediction in Crowded Spaces.
Conference on Computer Vision and Pattern Recognition, 2014.
-  M. Andriluka, S. Roth, and B. Schiele. People-Tracking-By-Detection and People-Detection-By-Tracking. In Conference on Computer Vision and Pattern Recognition, June 2008.
-  T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, and S. Savarese. Social Scene Understanding: End-To-End Multi-Person Action Localization and Collective Activity Recognition. In Conference on Computer Vision and Pattern Recognition, 2017.
-  S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In Advances in Neural Information Processing Systems, 2015.
Y. Bengio, N. Léonard, and A. Courville.
Estimating or Propagating Gradients through Stochastic Neurons for Conditional Computation.ARXIV, 2013.
-  K. Bernardin and R. Stiefelhagen. Evaluating Multiple Object Tracking Performance: the Clear Mot Metrics. EURASIP Journal on Image and Video Processing, 2008, 2008.
-  A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple Online and Realtime Tracking. In International Conference on Image Processing, 2016.
-  S. Blackman. Multiple-Target Tracking with Radar Applications. Artech House, 1986.
-  T. Chavdarova, P. Baqué, S. Bouquet, A. Maksai, C. Jose, L. Lettry, P. Fua, L. V. Gool, and F. Fleuret. The Wildtrack Multi-Camera Person Dataset. In Conference on Computer Vision and Pattern Recognition, 2018.
-  J. Chen, H. Sheng, Y. Zhang, and Z. Xiong. Enhancing detection model for multiple hypothesis tracking. In Conference on Computer Vision and Pattern Recognition, 2017.
L. Chen, H. Ai, C. Shang, Z. Zhuang, and B. Bai.
Online Multi-Object Tracking with Convolutional Neural Networks.In International Conference on Image Processing, 2017.
-  Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu. Online Multi-Object Tracking Using Cnn-Based Single Object Tracker with Spatial-Temporal Attention Mechanism. In International Conference on Computer Vision, 2017.
-  C. Dicle, O. I. Camps, and M. Sznaier. The Way They Move: Tracking Multiple Targets with Similar Appearance. In International Conference on Computer Vision, 2013.
-  P. Felzenszwalb, D. Mcallester, and D. Ramanan. A Discriminatively Trained, Multiscale, Deformable Part Model. In Conference on Computer Vision and Pattern Recognition, June 2008.
-  K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik. Recurrent Network Models for Human Dynamics. In International Conference on Computer Vision, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  D. Held, S. Thrun, and S. Savarese. Learning to Track at 100 Fps with Deep Regression Networks. In European Conference on Computer Vision, 2016.
-  R. Henschel, L. Leal-Taixé, D. Cremers, and B. Rosenhahn. Fusion of head and full-body detectors for multi-object tracking. In Conference on Computer Vision and Pattern Recognition Workshops, 2018.
-  A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
-  M. Hu, S. Ali, and M. Shah. Detecting Global Motion Patterns in Complex Videos. In ICPR, 2008.
-  J. S. III and D. Ramanan. Tracking as Online Decision-Making: Learning a Policy from Streaming Videos with Reinforcement Learning. In International Conference on Computer Vision, 2017.
-  U. Iqbal, A. Milan, and J. Gall. Posetrack: Joint Multi-Person Pose Estimation and Tracking. In Conference on Computer Vision and Pattern Recognition, 2017.
-  E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144, 2016.
-  J.Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M. Yang. Online multi-object tracking with dual matching attention networks. In European Conference on Computer Vision, 2018.
-  M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele. Motion segmentation & multiple object tracking by correlation co-clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  C. Kim, F. Li, A. Ciptadi, and J. Rehg. Multiple Hypothesis Tracking Revisited. In International Conference on Computer Vision, 2015.
-  C. Kim, F. Li, and J. M. Rehg. Multi-object tracking with neural gating using bilinear lstm. In European Conference on Computer Vision, 2018.
-  Y. J. Koh and C.-S. Kim. CDTS: Collaborative Detection, Tracking, and Segmentation for Online Multiple Object Segmentation in Videos. In International Conference on Computer Vision, 2017.
-  L. Leal-taixé, C. Canton-ferrer, and K. Schindler. Learning by Tracking: Siamese CNN for Robust Target Association. In Conference on Computer Vision and Pattern Recognition, 2016.
-  L. Leal-taixe, A. Milan, I. Reid, S. Roth, and K. Schindler. Motchallenge 2015: Towards a Benchmark for Multi-Target Tracking. In ARXIV, 2015.
-  N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker. Desire: Distant Future Prediction in Dynamic Scenes with Interacting Agents. In Conference on Computer Vision and Pattern Recognition, 2017.
-  W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In Conference on Computer Vision and Pattern Recognition, 2014.
-  Z. Lin, H. Zheng, B. Ke, and L. Chen. Online Multi-Object Tracking Based on Hierarchical Association and Sparse Representation. In International Conference on Image Processing, 2017.
-  C. Long, A. Haizhou, Z. Zijie, and S. Chong. Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In International Conference on Multimedia and Expo, 2018.
-  C. Ma, C. Yang, F. Yang, Y. Zhuang, Z. Zhang, H. Jia, and X. Xie. Trajectory Factory: Tracklet Cleaving and Re-Connection by Deep Siamese Bi-GRU for Multiple Object Tracking. In ICME, 2018.
-  W.-C. Ma, D.-A. Huang, N. Lee, and K. M. Kitani. Forecasting Interactive Dynamics of Pedestrians with Fictitious Play. In Conference on Computer Vision and Pattern Recognition, 2017.
-  A. Maksai, X. Wang, F. Fleuret, and P. Fua. Globally Consistent Multi-People Tracking Using Motion Patterns. In International Conference on Computer Vision, 2017.
-  S. Manen, M. Gygli, D. Dai, and L. V. Gool. Pathtrack: Fast Trajectory Annotation with Path Supervision. In International Conference on Computer Vision, 2017.
-  A. Milan, L. Leal-taixe, I. Reid, S. Roth, and K. Schindler. Mot16: A Benchmark for Multi-Object Tracking. arXiv preprint arXiv:1603.00831, 2016.
-  A. Milan, S. Rezatofighi, A. Dick, I. Reid, and K. Schindler. Online Multi-Target Tracking Using Recurrent Neural Networks. In AAAI, 2017.
-  S. Murray. Real-Time Multiple Object Tracking-A Study on the Importance of Speed. arXiv preprint arXiv:1709.03572, 2017.
P. Ondruska, J. Dequaire, D. Wang, and I. Posner.
End-To-End Tracking and Semantic Segmentationusing Recurrent Neural
Workshop on limits and potentials of Deep Learning in Robotics, 2016.
-  P. Ondruska and I. Posner. Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks. In AAAI, 2016.
-  S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll Never Walk Alone: Modeling Social Behavior for Multi-Target Tracking. In International Conference on Computer Vision, 2009.
-  M. A. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence Level Training with Recurrent Neural Networks. In ICLR, 2015.
-  E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In European Conference on Computer Vision, 2016.
-  E. Ristani and C. Tomasi. Features for Multi-Target Multi-Camera Tracking and Re-Identification. In Conference on Computer Vision and Pattern Recognition, 2018.
-  A. Sadeghian, A. Alahi, and S. Savarese. Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies. In International Conference on Computer Vision, 2017.
-  Y. T. Tesfaye, E. Zemene, A. Prati, M. Pelillo, and M. Shah. Multi-Target Tracking in Multiple Non-Overlapping Cameras Using Constrained Dominant Sets. arXiv preprint arXiv:1706.06196, 2017.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and Tell: A Neural Image Caption Generator. In Conference on Computer Vision and Pattern Recognition, 2015.
-  S. E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional Pose Machines. In Conference on Computer Vision and Pattern Recognition, 2016.
-  L. Williams and D. Jacobs. Stochastic Completion Fields: A Neural Model of Illusory Contour Shape and Salience. Neural Computation, 9(4):837–858, 1997.
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun,
Y. Cao, Q. Gao, K. Macherey, et al.
Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation.arXiv Preprint, 2016.
-  J. Xiang, G. Zhang, J. Hou, N. Sang, and R. Huang. Multiple Target Tracking by Learning Feature Representation and Distance Metric Jointly. arXiv Preprint, 2018.
-  Y. Xiang, A. Alahi, and S. Savarese. Learning to Track: Online Multi-Object Tracking by Decision Making. In Conference on Computer Vision and Pattern Recognition, 2015.
-  K. Yoon, Y. m. Song, and M. Jeon. Multiple Hypothesis Tracking Algorithm for Multi-Target Multi-Camera Tracking with Disjoint Views. IET Image Processing, 2018.
-  M. Zhai, M. J. Roshtkhari, and G. Mori. Deep Learning of Appearance Models for Online Object Tracking. arXiv preprint arXiv:1607.02568, 2016.
-  X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J. Sun. Alignedreid: Surpassing Human-Level Performance in Person Re-Identification. arXiv preprint arXiv:1711.08184, 2017.
-  Z. Zhang, J. Wu, lx. Zhang, and C. Zhang. Multi-Target, Multi-Camera Tracking by Hierarchical Clustering: Recent Progress on DukeMTMC Project. arXiv preprint arXiv:1712.09531, 2017.