Scene understanding from video remains one of the big challenges of computer vision. Humans are often the center of attention of a scene, which leads to the fundamental problem of detecting and tracking them in a video. Tracking-by-detection has emerged as the preferred paradigm to solve the problem of tracking multiple objects as it simplifies the task by breaking it into two steps: (i) detecting object locations independently in each frame, (ii) linking corresponding detections across time to form trajectories. The linking step, or data association, is a challenging task on its own, due to missing and spurious detections, occlusions, and target interactions in crowded environments. To address these issues, research in this area has produced increasingly complex models that achieve only marginally better results, e.g., multiple object tracking accuracy has only improved 2.4% in the last two years on the MOT16 MOTChallenge  benchmark.
In this paper, we push tracking-by-detection to the limit by using only an object detection method to perform tracking. We show that one can achieve state-of-the-art tracking results by training a neural network only on the task ofdetection. As indicated by the blue arrows in Figure 1, the regressor of an object detector such as Faster-RCNN  is sufficient to construct object trajectories in a multitude of challenging tracking scenarios. This raises an interesting question that we discuss in this paper: If a detector can solve most of the tracking problems, what are the real situations where a dedicated tracking algorithm is necessary? We hope our work and the presented Tracktor allows researchers to focus on the still unsolved critical challenges of multi-object tracking.
This paper presents four main contributions:
We introduce the Tracktor which tackles multi-object tracking by exploiting the regression head of a detector to perform temporal realignment of object bounding boxes.
We present two simple extensions to our model, a re-identification Siamese network and a motion model. This tracker largely outperforms all trackers in three challenging multi-object tracking benchmarks.
We conduct a detailed analysis on failure cases and challenging tracking scenarios, and show none of the dedicated tracking methods perform substantially better than our regression approach.
We propose our method as a new tracking paradigm which exploits the detector and allows researchers to focus on the remaining complex tracking challenges. This includes an extensive study on promising future research directions.
1.1 Related work
Several computer vision tasks such as surveillance, activity recognition or autonomous driving rely on object trajectories as input. Despite the vast literature on multi-object tracking [47, 43], it still remains a challenging problem, especially in crowded environments where occlusions and false detections are common. Most state-of-the-art works follow the tracking-by-detection paradigm which heavily relies on the performance of the underlying detection method.
Recently, neural network based detectors have clearly outperformed all other methods for detection [37, 56, 1]. The family of detectors that evolved to Faster-RCNN , and further detectors such as SDP , rely on object proposals which are passed to an object classification and a bounding box regression head of a neural network. The latter refines bounding boxes to fit tightly around the object. In this paper, we show that one can rethink the use of this regressor for tracking purposes.
Tracking as a graph problem. The data association problem deals with keeping the identity of the tracked objects given the available detections. This can be done on a frame by frame basis for online applications [6, 17, 53] or track-by-track . Since video analysis can be done offline, batch methods are preferred since they are more robust to occlusions. A common formalism is to represent the problem as a graph, where each detection is a node, and edges indicate a possible link. The data association can then be formulated as maximum flow  or, equivalently, minimum cost problem with either fixed costs based on distance [29, 54, 72], including motion models , or learned costs . Alternative formulations typically lead to more involved optimization problems, including minimum cliques , general-purpose solvers like MCMC  or multi-cuts . A recent trend is to design ever more complex models which include other vision input such as reconstruction for multi-camera sequences [45, 65], activity recognition , segmentation , keypoint trajectories  or joint detection . In general, the significantly higher computational costs do not translate to significantly higher accuracy. In fact, in this work, we show that we can outperform all graph-based trackers significantly while keeping the tracker online. Even within a graphical model optimization, one needs to define a measure to identify whether two bounding boxes belong to the same person or not. This can be done by analyzing either the appearance of the pedestrian, or its motion.
Appearance models and re-identification. Discriminating and re-identifying (reID) objects by appearance is in particular a problem in crowded scenes with many object-object occlusions. In the exhaustive literature that uses appearance models or reID methods to improve multi-object tracking color-based models are very common  However, these are not always reliable for pedestrian tracking, since people can wear very similar clothes, and color statistics are often contaminated by background pixels and illumination changes. The authors of  borrow ideas from person re-identification and adapt them to “re-identify” targets during tracking. In , a CRF model is learned to better distinguish pedestrians with similar appearance. Both appearance and short-term motion in the form of optical flow can be used as input to a Siamese neural network to decide whether two boxes belong to the same track or not . Recently,  showed the importance of learned reID features for multi-object tracking. We confirm this view in our experiments.
Motion models and trajectory prediction. Several works resort to motion to discriminate between pedestrians, especially in highly crowded scenes. The most common assumption is the one of constant velocity (CVA) [13, 3], but pedestrian motion gets more complex in crowded scenarios for which researchers have turned to the more expressive Social Force Model [61, 53, 66, 44]. Such a model can also be learned from data 
. Deep Learning has been extensively used to learn social etiquette in crowded scenarios for trajectory prediction[44, 2, 59].  use single object tracking trained networks to create tracklets for further post-processing into trajectories. Recently, [8, 55]
proposed to use reinforcement learning to predict the position of an object in the next frame. While focuses on single object tracking, the authors of  train a multi-object pedestrian tracker composed of a bounding box predictor and a decision network for collaborative decision making between tracked objects.
Video object detection. Multi-object tracking without frame-to-frame identity prediction is a subproblem usually referred to as video object detection. In order to improve detections, many methods exploit spatio-temporal consistencies of object positions. Both  and  generate multi-frame bounding box tuplet proposals and extract detection scores and features with a CNN and LSTM, respectively. Recently, the authors of  improve object detections by applying optical flow to propagate scores between frames. Eventually,  proposes to solve the tracking and detection problem jointly. They propose a network which processes two consecutive frames and exploits tracking ground truth data to improve detection regression, thereby, generating two-frame tracklets. With a subsequent offline method these tracklets are combined to multi-frame tracks. However, we show that our regression tracker is not only online, but superior in dealing with object occlusions. In particular, we do not only temporally align detections, but preserve their identity.
2 A detector is all you need
We propose to convert a detector into a Tracktor performing multiple object tracking. Several CNN-based detection algorithms [56, 68] contain some form of bounding box refinement through regression. We propose an exploitation of such a regressor for the task of tracking. This has two key advantages: (i) we do not require any tracking specific training, and (ii) we do not perform any complex optimization at test time, hence our tracker is online. Furthermore, we show that our method achieves state-of-the-art performance on several challenging tracking scenarios.
2.1 Object detector
The core element of our tracking pipeline is a regression-based detector. In our case, we train a Faster R-CNN  with ResNet-101  and Feature Pyramid Networks (FPN)  on the MOT17Det  pedestrian detection dataset.
To perform object detection, Faster R-CNN applies a Region Proposal Network to generate a multitude of bounding box proposals for each potential object. Feature maps for each proposal are extracted via Region of Interest (RoI) pooling 
, and passed to the classification and regression heads. The classification head assigns an object score to the proposal, in our case, it evaluates the likelihood of the proposal showing a pedestrian. The regression head refines the bounding box location tightly around an object. The detector yields the final set of object detections by applying non-maximum-suppression (NMS) to the refined bounding box proposals. Our presented method exploits the aforementioned ability to regress and classify bounding boxes to perform multi-object tracking.
The challenge of multi-object tracking is to extract the spatial and temporal positions, i.e., trajectories, of objects given a frame by frame video sequence. Such a trajectory is defined as a list of ordered object bounding boxes , where a bounding box is defined by its coordinates , and represents a frame of the video. We denote the set object bounding boxes in frame with . Note, that each or can contain less elements than the total number of frames or trajectories in a sequence, respectively. At our tracker initializes tracks from the first set of detections . In Figure 1 we illustrate the two subsequent processing steps (the nuts and bolts of our method) for a given frame for all , namely, the bounding box regression and track initialization.
Bounding box regression. The first step, denoted with blue arrows, exploits the bounding box regression to extend active trajectories to the current frame . This is achieved by regressing the bounding box of frame to the object’s new position at frame . In the case of Faster R-CNN, this corresponds to applying RoI pooling on the features of the current frame but with the previous bounding box coordinates. Our assumption is that the target has moved only slightly between frames, which is usually verified from high frame rates.The identity is automatically transferred from the previous to the regressed bounding box, effectively creating a trajectory. This is repeated for all subsequent frames.
After the bounding box regression, our tracker considers two cases for killing (deactivating) a trajectory: (i) an object leaving the frame or occluded by a non-object is killed if its new classification score is below and (ii) occlusions between objects are handled by applying non-maximum suppression (NMS) to all remaining and their corresponding scores with an Intersection over Union (IoU) threshold .
Bounding box initialization. In order to account for new targets, the object detector also provides the detections for the entire frame . This second step, indicated in Figure 1 with red arrows, is analogous to the first initialization at . But a detection from starts a trajectory only if the IoU with any of the already active trajectories is smaller than . That is, we consider a detection for a new trajectory only if it is covering a potentially new object that is not explained by any trajectory. It should be noted again that our Tracktor does not require any tracking specific training or optimization and solely relies on an object detection method. This allows us to directly benefit from improved object detection methods and, most importantly, enables a comparatively cheap transfer to different tracking datasets or scenarios in which no ground truth tracking but only detection data is available.
2.3 Tracking extensions
In this section, we present two straightforward extensions to our vanilla Tracktor: a motion model and a re-identification algorithm. Both are aimed at improving identity preservation across frames and are common examples of techniques used to enhance, e.g., graph-based tracking methods [44, 67, 40].
Motion model. Our previous assumption that the position of an object changes only slightly from frame to frame does not hold in two scenarios: large camera motion and low video frame rates. In extreme cases, the bounding boxes from frame might not contain the tracked object in frame at all. Therefore, we apply two types of motion models that will improve the bounding box position in future frames. For sequences with a moving camera, we apply a straightforward camera motion compensation (CMC) by aligning frames via image registration using the Enhanced Correlation Coefficient (ECC) maximization as introduced in . For sequences with comparatively low frame rates, we apply a constant velocity assumption (CVA) for all objects as in [13, 3].
In order to keep our tracker online, we suggest a short-term re-identification (reID) based on appearance vectors generated by a Siamese neural network[7, 27, 58]. To that end, we store killed (deactivated) tracks in their non-regressed version for a fixed number of frames. We then compare the distance in the embedding space of the deactivated with the newly detected tracks and re-identify via a threshold. The embedding space distance is computed by a Siamese CNN and appearance feature vectors for each of the bounding boxes. It should be noted that the reID network is indeed trained on tracking ground truth data. To minimize the risk of false reIDs, we only consider pairs of deactivated and new bounding boxes with a sufficiently large IoU. The motion model is continuously applied to the deactivated tracks.
We demonstrate the tracking performance of our proposed Tracktor tracker as well as its extension Tracktor++ on several datasets focusing on pedestrian tracking. 111Tracktor code: https://github.com/coming_soon. In addition, we perform an ablation study of the aforementioned extensions and further show that our tracker outperforms state-of-the-art methods in tracking accuracy and excels at identity preservation.
MOTChallenge. The multi-object tracking benchmark MOTChallenge 222The MOTChallenge web page: https://motchallenge.net. consists of several challenging pedestrian tracking sequences, with frequent occlusions and crowded scenes. Sequences vary in their angle of view, size of objects, camera motion and frame rate. The challenge contains three separate tracking benchmarks, namely 2D MOT 2015 , MOT16 and MOT17 . The MOT17 test set includes a total of 7 sequences each of which is provided with three sets of public detections. The detections originate from different object detectors each with increasing performance, namely DPM , Faster R-CNN  and SDP . Our object detector is trained on the MOT17Det  detection benchmark which contains the same images as MOT17. The MOT16 benchmark also contains the same sequences as MOT17 but only provides DPM public detections. The 2D MOT 2015 benchmark provides ACF  detections for 11 sequences. The complexity of the tracking problem requires several metrics to measure different aspects of a tracker’s performance. The Multiple Object Tracking Accuracy (MOTA)  and ID F1 Score (IDF1)  quantify two of the main aspects, namely, object coverage and identity.
Public detections. For a fair comparison with other tracking methods, we perform all experiments with the public detections provided by MOTChallenge. That is, all methods compared in this paper, including our approach and its extension, process the same precomputed frame by frame detections. For our method, a new trajectory is only initialized from a public detection bounding box, i.e., we never use our object detector to detect a new bounding box. We only apply the bounding box regressor and classifier to obtain new and , respectively. The MOTChallenge public benchmark includes multiple methods [33, 10, 15] which classify the given detections with trained neural networks, hence, we consider our processing of the given detections as public as well.
3.1 Ablation study
|Tracktor++ (reID + CMC)||61.9||64.7||35.3||21.4||323||42454||326|
The ablation study on the MOT17  training set in Table 1 is intended to show three aspects: (i) the superiority of our approach to apply a detector for tracking, (ii) the potential from an improved object detection method and (iii) improvements from extending our vanilla Tracktor with tracking specific methods, namely, re-identification (reID) and camera motion compensation (CMC). It should be noted, that although MOT17Det and MOT17 contain the same images we refrained from a cross-validation on the training set as our vanilla Tracktor was never trained on tracking ground truth data. The video object detector and tracker D&T  trains a detector on tracking ground truth data which generates two-frame tracklets. However, despite a subsequent offline dynamic programming track generation their detector based tracker is inferior to our online regression based track generation over multiple frames. In addition, we demonstrate the potential of our framework with respect to improved detection methods by showing the tracking performance of Tracktor-no-FPN, i.e., our approach and a Faster R-CNN without Feature Pyramid Networks (FPN) . Despite the simple nature of our extensions, their contribution is significant towards the drastic reduction of identity switches and an increment of the IDF1 measure. In the next section, we show that this effect successfully translates to a comparison with other state-of-the-art methods on the test set.
3.2 Benchmark evaluation
2D MOT 2015
We evaluate the performance of our Tracktor++ on the test set of the respective benchmark, without any training or optimization on the tracking train set. Table 2 presents the overall results accumulated over all sequences, and for MOT17 over all three sets of public detections. For our comparison, we only consider officially published and peer-reviewed entries in the MOTChallenge benchmark. A detailed summary of our results on individual sequences will be provided in the supplementary material. For all sequences, camera motion compensation (CMC) and reID are used. The only low frame rate sequence is the 2D MOT 2015 AVG-TownCentre, for which we apply the mentioned constant velocity assumption (CVA). For the two autonomous driving sequences, originally from KITTI  benchmark, we apply the rotation as well as translation camera motion compensation. Note, we use the same Tracktor++ tracker, trained on MOT17Det object detections, for all benchmarks. As we show, it is able to achieve a new state-of-the-art in terms of MOTA on all three challenges.
In particular, our results on MOT16 demonstrate the ability of our tracker to cope with detections of comparatively minor performance. Due to the nature of our tracker and the robustness of the frame by frame bounding box regression, we outperform all other trackers on MOT16 by a large margin, specifically in terms of false negatives (FN) and identity preserving (IDF1). It should be noted, that we also provide a new state-of-the-art on 2D MOT 2015, even though the characteristics of the scenes are very different from MOT17. We do not use MOT15 training sequences, which further illustrates the generalization strength of our tracker.
|Method||Online||Graph||reID||Appearance model||Motion model||Other|
|FWT ||Dense||Face detection|
|jCC ||Dense||Point trajectories|
The superior performance of our tracker without any tracking specific training or optimization demands a more thorough analysis. Without sophisticated tracking methods, it is not expected to excel in crowded and occluded, but rather only in benevolent, tracking scenarios. Which begs the question whether more common tracking methods fail to specifically address these complex scenarios as well. Our experiments and the subsequent analysis ought to demonstrate the strengths of our approach for easy tracking scenarios and motivate future research to focus on remaining complex tracking problems. In particular, we question the common execution of tracking-by-detection and suggest a new tracking paradigm. The subsequent analysis is conducted on the MOT17 training data and we compare all top performing methods with publicly shared data.
4.1 Tracking challenges
For a better understanding of our tracker we want to analyse challenging tracking scenarios and compare its strengths and weaknesses to other trackers. To this end, we summarize their fundamental characteristics in Table 3. FWT  and jCC  both apply a dense offline graph optimization on all detections in a given sequence. In contrast, MHT_DAM  limits its optimization to a sparse forward view of hypothetical trajectories.
Object visibility. Intuitively, we expect diminished tracking performance for object-object or object-non-object occlusions, i.e., for targets with diminished visibility. In Figure 2, we compare the ratio of successfully tracked bounding boxes with respect to their visibility. The transparent red bar indicates the occurrences of ground truth bounding boxes for each visibility, and illustrates the proportionate impact of a visibility on the overall performance of the trackers. Our method achieves superior performance even for partially occluded bounding boxes with visibilities as low as 0.3. Neither the identify preserving aspects of MHT_DAM and MOTDT17 
nor the offline interpolation capabilities of MHT_DAM and jCC seem to successfully tackle highly occluded objects. The high MOTA in2 of all presented methods is largely due to the unbalanced distribution of ground truth visibilities. As expected, our extended version only achieves minor improvements over our vanilla Tracktor.
Object size. In view of the large fraction of visible but not tracked objects in Figure 2, we argue that the trackability of an object is not only dependent on its visibility, but also its size. Therefore, we conduct the same comparison as for the visibility but for the size of an object. In the first row of Figure 3 we assume the height of a pedestrian to be proportional to its size and compare on all three MOT17 public detection sets. All methods performed similarly well for object heights larger than 250 pixels. To demonstrate their shortcomings even for highly visible objects we only compare objects with a visibility larger than 0.9. As expected, the trackability of an object decreases drastically with its size across all three detection sets Our tracker shows its strength in compensating for insufficient DPM and Faster R-CNN detections for all object sizes. All methods except MOTDT17 benefit from the additional small detections provided by SDP. For our tracker this is largely due to the Feature Pyramid Network extension of our Faster-RCNN detector. However, the learned appearance model and reID of the online MOTDT17 method seem generally vulnerable to small detections. Appearance models generally suffer from small object sizes and few observed pixels. In conclusion, except from our compensation of inferior detections none of the trackers exhibit a notably better performance with respect to varying object sizes.
Robustness to detections. The performance of tracking-by-detection methods with respect to visibility and size is inherently limited by the robustness of the underlying detection method. However, as observed for the object size, trackers differ in their ability to cope with, or benefit from, varying quality of detections. In the second row of Figure 3
, we quantify this ability in terms of detection gaps on their coverage by the tracker. We define a detection gap as part of a ground truth trajectory that was at least once detected and compare coverage of each gap vs. the gap length. Intuitively, long gaps are harder to compensate for, as the online or offline tracker has to perform a longer hallucination or interpolation, respectively. We indicated the occurrences of gap lengths over the respective set of detections in transparent red. For DPM and Faster R-CNN detections, two solutions lead to notable gap coverage: (i) offline interpolation such as in jCC, or (ii) motion prediction with Kalman filter and reID as in MOTDT. Compared to the graph-based jCC method the online MOTDT17 method excels at covering particularly long gaps. However, none of these dedicated tracking methods yields similar robustness to our frame by frame regression tracker, which achieves far superior coverage. This holds especially true for very long detection gaps. Offline methods benefit the most from improved SDP detections and neither our nor the MOTDT17 tracker convince with a notable gap length robustness.
Identity preservation. The results of our Tracktor++ summarized in Table 2 indicate an identity preservation performance in terms of IDF1 and identity switches comparable with dedicated tracking methods. This is achieved without any offline graph optimization as in jCC  or eHAF . In particular, MOTDT17 which applies a sophisticated appearance model and reID is not substantially superior to our regression tracker and its comparatively simple extensions. However, our methods excels in reducing the number of false positives in MOT17 as well as MOT16. In addition, we have shown that our Tracktor is capable of incorporating additional identity preserving extension.
4.2 Oracle trackers
We have shown that none of the dedicated tracking methods specifically targets challenging tracking scenarios, i.e., objects under heavy occlusions or small objects. We therefore want to motivate our Tracktor as a new tracking paradigm. To this end, we analyse our performance two-fold: (i) the impact of the object detector on the killing policy and bounding box regression, (ii) identify performance upper bounds for potential extensions to our Tracktor. In Table 4 we present several oracle trackers by replacing parts of our algorithm with ground truth information. If not mentioned otherwise, all other tracking aspects are handled by our vanilla Tracktor. Their analysis should provide researchers with useful insights regarding the most promising research directions and extensions of our Tracktor.
Oracle detector. To illustrate the effect of a potentially perfect object detector, we introduce two oracles:
Oracle-Kill: Instead of killing with NMS or classification score we use ground truth information.
Oracle-REG: Instead of regression, we place the bounding boxes at their ground truth position.
While the perfect killing policy does not achieve notable improvements, a detector with perfect regression capabilities would yield substantial performance improvements with respect to MOTA, ID Sw. and FP.
Oracle extensions. Most notably, our extended Tracktor is already able to compensate for some of the object detector’s insufficiencies with respect to killing and regression. The following oracles ought to illustrate the potential performance gains and upper bounds for our Tracktor extended with perfect reID and motion model. In order to remain online, this excludes any form of hindsight tracking gap interpolation. To this end, we analyse two additional oracles:
Oracle-MM: A motion model places each bounding box at the center of the ground truth in the next frame.
Oracle-reID: Re-identification is performed with ground truth identities.
As expected, both oracles reduce the identity switches substantially. The combined Oracle-MM-reID, which represents the upper bound of our Tracktor++ tracker, promises significant improvements for the IDF1 measure due to its identity preserving characteristics.
Oracle all. The omniscient Oracle-ALL performs ground truth killing, regression, reID and motion prediction. We consider its top MOTA of 77.0%, in combination with a high IDF1 and virtually no false positives, as the absolute upper bound of our Tracktor on Faster R-CNN public detections in an online setting.
4.3 Towards a new tracking paradigm
The substantial gap between Oracle-ALL (77.0% MOTA) and Oracle-MM-reID (62.4% MOTA) demonstrates the necessity of a perfect killing policy, or in more practical terms a motion prediction model that hallucinates the position of an object through long occlusions. Performing a linear interpolation of the bounding boxes as in Oracle-reID-INTER and Oracle-MM-reID-INTER does not yield a similar performance. This is largely due to wrong linear occlusion paths caused by long detection gaps and camera movement. We therefore propose two approaches to apply our Tracktor.
Tracktor with extensions. Apply our Tracktor in an online fashion to a given set of detections and extend it with tracking specific methods. Many scenarios with large and highly visible objects will be easily covered by our frame to frame bounding box regression. For the remaining a proper motion predictor that takes into account the layout of the scene and the configuration of objects seems most promising. In addition, such a hallucinating predictor reduces the necessity for advanced killing and re-identification policies.
Tracklet generation. By extending the tracking-by-detection paradigm, one could argue for a tracking-by-tracklet approach. Indeed, many algorithms already use tracklets as input [26, 71], as they are richer in information for computing motion or appearance models. Nonetheless, a specific tracking method is usually used to create these tracklets.We advocate the exploitation of the detector itself, not only to create sparse detections but frame to frame tracklets. In this view, the complex tracking cases remain to be tackled by a subsequent tracking method.
In this work, we have formally defined those hard cases, analyzing the situations in which not only our method but other dedicated tracking solutions fail. Hence, we raise the question whether current methods actually focus on the real challenges in multi-object tracking.
We have shown that the bounding box regressor of a trained Faster-RCNN detector is enough to solve most tracking scenarios present in current benchmarks. A detector converted to Tracktor needs no specific training on tracking ground truth data and is able to work in an online fashion. In addition, we have shown that our Tracktor is extendable with re-identification and camera motion compensation, providing a substantial new state-of-the-art on the MOTChallenge. We analyzed the performance of multiple dedicated tracking methods on challenging tracking scenarios and none yielded substantially better performance compared to our regression based Tracktor. We hope this work will establish a new tracking paradigm, allowing the full use of a detector’s capabilities.
-  J. R. ad A. Farhadi. Yolo9000: Better, faster, stronger. CVPR, 2017.
A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese.
Social lstm: Human trajectory prediction in crowded spaces.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  A. Andriyenko and K. Schindler. Multi-target tracking by continuous energy minimization. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1265–1272, 2011.
-  J. Berclaz, F. Fleuret, and P. Fua. Robust people tracking with global trajectory optimization. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 744–750, 2006.
-  J. Berclaz, F. Fleuret, E. Türetken, and P. Fua. Multiple object tracking using k-shortest paths optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(9):1806–1819, 2011.
-  M. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. van Gool. Robust tracking-by-detection using a detector confidence particle filter. IEEE International Conference on Computer Vision (ICCV), pages 1515–1522, 2009.
-  J. Bromley, I. Guyon, Y. LeCun, E. Saeckinger, and R. Shah. Signature verification using a ”siamese” time delay neural network. NIPS, 1993.
-  B. Chen, D. Wang, P. Li, S. Wang, and H. Lu. Real-time ’actor-critic’ tracking. In The European Conference on Computer Vision (ECCV), September 2018.
L. Chen, H. Ai, C. Shang, Z. Zhuang, and B. Bai.
Online multi-object tracking with convolutional neural networks.pages 645–649, Sept 2017.
-  L. Chen, H. Ai, Z. Zhuang, and C. Shang. Real-time multiple people tracking with deeply learned candidate selection and person re-identification, 07 2018.
-  X. Chen and A. Gupta. An implementation of faster RCNN with study for region sampling. CoRR, abs/1702.02138, 2017.
-  W. Choi. Near-online multi-target tracking with aggregated local flow descriptor. ICCV, 2015.
-  W. Choi and S. Savarese. Multiple target tracking in world coordinate with single, minimally calibrated camera. European Conference on Computer Vision (ECCV), pages 553–567, 2010.
-  W. Choi and S. Savarese. A unified framework for multi-target tracking and collective activity recognition. European Conference on Computer Vision (ECCV), pages 215–230, 2012.
-  Y. chul Yoon, A. Boragule, Y. min Song, K. Yoon, and M. Jeon. Online multi-object tracking with historical appearance matching and scene adaptive detection filtering. AVSS, 2018.
-  P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. PAMI, 36(8):1532–1545, Aug. 2014.
-  A. Ess, B. Leibe, K. Schindler, and L. van Gool. A mobile vision system for robust multi-person tracking. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008.
-  G. D. Evangelidis and E. Z. Psarakis. Parametric image alignment using enhanced correlation coefficient maximization. PAMI, 30(10):1858–1865, 2008.
-  K. Fang, Y. Xiang, and S. Savarese. Recurrent autoregressive networks for online multi-object tracking. WACV, abs/1711.02741, 2017.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track and track to detect. 2017 IEEE International Conference on Computer Vision (ICCV), pages 3057–3065, 2017.
-  P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. pami, 32:1627–1645, 2009.
-  A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  R. B. Girshick. Fast r-cnn. ICCV, pages 1440–1448, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, abs/1512.03385, 2015.
-  R. Henschel, L. Leal-Taixé, D. Cremers, and B. Rosenhahn. Improvements to frank-wolfe optimization for multi-detector multi-object tracking. CVPR, abs/1705.08314, 2017.
-  R. Henschel, L. Leal-Taixé, and B. Rosenhahn. Efficient multiple people tracking using minimum cost arborescences. German Conference on Pattern Recognition (GCPR), 2014.
-  A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. CoRR, abs/1703.07737, 2017.
-  J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. CVPR, abs/1611.10012, 2016.
H. Jiang, S. Fels, and J. Little.
A linear programming approach for multiple object tracking.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007.
-  K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu, and X. Wang. Object detection in videos with tubelet proposal networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 889–897, 2017.
-  K. Kang, W. Ouyang, H. Li, and X. Wang. Object detection from video tubelets with convolutional neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 817–825, 2016.
-  R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, M. Boonstra, V. Korzhova, and J. Zhang. Framework for performance evaluation for face, text and vehicle detection and tracking in video: data, metrics, and protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 31(2):319–336, 2009.
-  M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele. Motion segmentation & multiple object tracking by correlation co-clustering. PAMI, pages 1–1, 2018.
-  C. Kim, F. Li, A. Ciptadi, and J. Rehg. Multiple hypothesis tracking revisited: Blending in modern appearance model. ICCV, 2015.
-  C. Kim, F. Li, A. Ciptadi, and J. M. Rehg. Multiple hypothesis tracking revisited. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4696–4704, Dec 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems (NIPS), 2012.
-  H. W. Kuhn and B. Yaw. The hungarian method for the assignment problem. Naval Res. Logist. Quart, pages 83–97, 1955.
-  C.-H. Kuo and R. Nevatia. How does person identity recognition help multi-person tracking? CVPR, 2011.
-  L. Leal-Taixé, C. Canton-Ferrer, and K. Schindler. Learning by tracking: siamese cnn for robust target association. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR). DeepVision: Deep Learning for Computer Vision., 2016.
-  L. Leal-Taixé, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese. Learning an image-based motion context for multiple people tracking. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv:1504.01942, 2015.
-  L. Leal-Taixé, A. Milan, K. Schindler, D. Cremers, I. D. Reid, and S. Roth. Tracking the trackers: An analysis of the state of the art in multiple object tracking. CoRR, abs/1704.02781, 2017.
-  L. Leal-Taixé, G. Pons-Moll, and B. Rosenhahn. Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker. IEEE International Conference on Computer Vision (ICCV) Workshops. 1st Workshop on Modeling, Simulation and Visual Analysis of Large Crowds, 2011.
-  L. Leal-Taixé, G. Pons-Moll, and B. Rosenhahn. Branch-and-price global optimization for multi-view multi-target tracking. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  W. Luo, X. Zhao, and T.-K. Kim. Multiple object tracking: A review. arXiv:1409.7618 [cs], Sept. 2014.
-  C. Ma, C. Yang, F. Yang, Y. Zhuang, Z. Zhang, H. Jia, and X. Xie. Trajectory factory: Tracklet cleaving and re-connection by deep siamese bi-gru for multiple object tracking. ICME, abs/1804.04555, 2018.
-  L. Ma, S. Tang, M. Blakc, and L. van Gool. Customized multi-person tracker. 2019.
-  A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler. Mot16: A benchmark for multi-object tracking. arXiv:1603.00831, 2016.
-  A. Milan, L. Leal-Taixé, K. Schindler, and I. Reid. Joint tracking and segmentation of multiple targets. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  M. Ogden, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, Z. Wang, R. Wang, X. Wang, and W. Ouyang. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28:2896–2907, 2018.
-  S. Pellegrini, A. Ess, K. Schindler, and L. van Gool. You’ll never walk alone: modeling social behavior for multi-target tracking. IEEE International Conference on Computer Vision (ICCV), pages 261–268, 2009.
-  H. Pirsiavash, D. Ramanan, and C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1201–1208, 2011.
-  L. Ren, J. Lu, Z. Wang, Q. Tian, and J. Zhou. Collaborative deep reinforcement learning for multi-object tracking. ECCV, 2018.
-  S. Ren, R. G. K. He, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Neural Information Processing Systems (NIPS), 2015.
-  E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. ECCV Workshops, 2016.
-  E. Ristani and C. Tommasi. Features for multi-target multi-camera tracking and re-identification. CVPR, 2018.
-  A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese. Learning social etiquette: Human trajectory prediction. European Conference on Computer Vision (ECCV), 2016.
-  A. Sadeghian, A. Alahi, and S. Savarese. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. ICCV, abs/1701.01909, 2017.
-  P. Scovanner and M. Tappen. Learning pedestrian dynamics from the real world. IEEE International Conference on Computer Vision (ICCV), pages 381–388, 2009.
-  H. Sheng, Y. Zhang, J. Chen, Z. Xiong, and J. Zhang. Heterogeneous association graph fusion for target association in multiple object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2018.
-  S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple people tracking by lifted multicut and person re-identification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3701–3710, July 2017.
-  S. Tang, M. Andriluka, and B. Schiele. Multi people tracking with lifted multicut and person re-identification. CVPR, 2017.
-  Z. Wu, T. Kunz, and M. Betke. Efficient track linking methods for track graphs using network-flow and set-cover techniques. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1185–1192, 2011.
-  K. Yamaguchi, A. Berg, L. Ortiz, and T. Berg. Who are you with and where are you going? IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1345–1352, 2011.
-  B. Yang and R. Nevatia. An online learned crf model for multi-target tracking. CVPR, 2012.
-  F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. CVPR, 2016.
-  F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. CVPR, pages 2129–2137, 2016.
Q. Yu, G. Medioni, and I. Cohen.
Multiple target tracking using spatio-temporal markov chain monte carlo data association.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007.
-  A. Zamir, A. Dehghan, and M. Shah. Gmcp-tracker: Global multi-object tracking using generalized minimum clique graphs. ECCV, 2012.
-  L. Zhang, Y. Li, and R. Nevatia. Global data association for multi-object tracking using network flows. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008.
-  J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M. Yang. Online multi-object tracking with dual matching attention networks. ECCV, 2018.
Appendix A Implementation
In order to complete the presentation of our Tracktor tracker and its extension as well as facilitate a reproduction of our results (code will be published) we provide additional implementation details and references.
In Algorithm 1 and 2 we present a structured pseudocode representation of our Tracktor for private and public detections, respectively. Algorithm 1 corresponds to the method illustrated in Figure 1 and Section 2.2 of our main work.
As mentioned before, our approach requires no dedicated training or optimization on tracking ground truth data and performs tracking only with an object detection method. To this end, we train the Faster R-CNN (FRCNN)  multi-object detector with Feature Pyramid Networks (FPN)  on the MOT17Det  dataset.
In addition, we follow the improvements suggested by . These include a replacement of the Region of Interest (RoI) pooling  by the crop and resize pooling suggested by Huang et al.  and training with a batch size of instead of while increasing the number of extracted regions from to . These changes and the addition of FPN ought to improve the detection results for comparatively small objects. We achieve the best results with a ResNet-101  as the underlying feature extractor. In Table 1 we compare the performance on the official MOT17Det detection benchmark for the three object detection methods mentioned in this work. The results demonstrate the incremental gain in detection performance of DPM , FRCNN and SDP  (ascending order). Our FRCNN implementation without FPN is on par with the official MOT17Det entry and represents the detector applied in the Tracktor-no-FPN variant of our ablation study in Section 3.1.
|FRCNN + FPN||0.81||70.2||14914||19196||96.5||83.2|
a.2 Tracking extensions
Our presented Tracktor++ tracker is an extension of the Tracktor that uses two multi-pedestrian tracking specific extensions, namely, a motion model and re-identification.
For the motion model via camera motion compensation (CMC) we apply image registration using the Enhanced Correlation Coefficient (ECC) maximization as in . The underlying image registration allows either for an euclidean or affine image alignment mode. We apply the first for rotating camera movements, e.g., as a result of an unsteady camera movement. In the case of an additional camera translation such as in the autonomous driving sequences of 2D MOT 2015 , we resort to the affine transformation. It should be noted that in MOT17 , camera translation is comparatively slow and therefore we consider all sequences as only rotating. In addition, we present a second motion model which aims at facilitating the regression for sequences with low frame rates, i.e., large object displacements between frames. Before we perform bounding box regression, the constant velocity assumption (CVM) model shifts bounding boxes in the direction of their previous velocity. This is achieved by moving the center of the bounding box by the vectorial difference of the two previous bounding box centers at and . The CVA motion model is only applied to the AVG-TownCentre sequence of 2D MOT 2015.
Our short-term re-identification utilizes a Siamese neural network to compare bounding box features and return a measure of their identity. To this end, we train the TriNet  architecture which is based on ResNet-50  with the triplet loss and batch hard strategy as presented in . The network is optimized with Adam  with and a decaying learning rate as described in . Training samples with corresponding identity are generated from the MOT17 tracking ground truth training data. The TriNet architecture requires input data with a dimension of . To allow for a subsequent data augmentation via horizontal flip and random cropping, each ground truth bounding box is cropped and resized to . A training batch consists of randomly selected identities, each of which is represented with different samples. Identities with less than 4 samples in the ground truth data are discarded.
Appendix B Experiments
A detailed summary of our official and published MOTChallenge benchmark results for our Tracktor++ tracker is presented in Table 2. For the corresponding results for each sequence and set of detections for the other trackers mentioned in this work we refer to the official MOTChallenge web page available at https://motchallenge.net.
b.1 Evaluation metrics
In order to measure the performance of a tracker we mentioned the Multiple Object Tracking Accuracy (MOTA)  and ID F1 Score (IDF1) . However, previous Tables such as 2 included additional informative metrics. The false positives (FP) and negatives (FN) account for the total number of either bounding boxes not covering any ground truth bounding box or ground truth bounding boxes not covered by any bounding box, respectively. To measure the track identity preserving capabilities, we report the total number of identity switches (ID Sw.), i.e., a bounding box covering a ground truth bounding box from a different track than in the previous frame. The mostly tracked (MT) and mostly lost (ML) metrics provide track wise information on how many ground truth tracks are covered by bounding boxes for either at least 80% or at most 20%, respectively. MOTA and IDF1 are meaningful combinations of the aforementioned basic metrics. All metrics were computed using the official evaluation code provided by the MOTChallenge benchmark.
b.2 Raw DPM detections
As most object detection methods, DPM applies a final non-maximum-suppression (NMS) step to a large set of raw detections. The MOT16  benchmark provides both, the set before and after the NMS, as public DPM detections. However, this NMS step is performed with DPM classification scores and an unknown Intersection over Union (IoU) threshold. Therefore, we extracted our own classification scores for all raw detections and applied our own NMS step. Although not specifically provided, we followed the convention to also process raw DPM detections for MOT17. Note, several other public trackers already work on raw detections [33, 10, 15] and their own classification score and NMS procedure. Therefore, we consider the comparison with public trackers as fair.
b.3 Tracktor thresholds
To demonstrate the robustness of our tracker with respect to the classification score and IoU thresholds, we refrained from any sequence or detection-specific fine-tuning. In particular, we performed our experiments on all benchmarks with , and , which were chosen to be optimal for the MOT17 training dataset. In general, a higher than introduces stability into the tracker, as less active tracks are killed by the NMS and less new tracks are initialized. A comparatively higher relaxes potential object-object occlusions and implies a certain confidence in the regression performance.
Appendix C Oracle trackers
In our main work, we conclude the analysis in Section 4 with a comparison of multiple oracle trackers that highlight the potential of future research directions. For each oracle, one or multiple aspects of our vanilla Tracktor are substituted with ground truth information, thereby simulating perfect behavior. For further understanding, we provide more details on the oracles for each of the distinct tracking aspects:
Oracle-Kill: This oracle kills tracks only if they have an IoU less than 0.5 with the corresponding ground truth bounding box. The matching between predicted and ground truth tracks is performed with the Hungarian  algorithm. In the case of an object-object occlusion (IoU ) the ground truth matching is applied to decide which of the objects is occluded by the other and therefore should be killed.
Oracle-REG: We simulate a perfect regression by matching tracks with an IoU threshold of 0.5 to the ground truth at frame . The regression oracle then sets track bounding boxes to the corresponding ground truth coordinates at frame .
Oracle-MM: A perfect motion model works analogous to Oracle-REG but we only move the previous bounding box center to the center of the ground truth bounding box at frame . However, the bounding box height and width are still determined by the regression.
Oracle-reID: Again, we use the Hungarian algorithm to match the new set of detections to the ground truth data. Ground truth identity matches between inactive tracks and new detections yield a perfect re-identification.
|2D MOT 2015 |