Tracking-by-detection approaches have been leading the online (no peeking into the future) multi-object tracking (MOT) benchmarks, as a result of the high-performing object detection models. There are a group of pragmatic tracking-by-detection approaches for 2D/3D multiple object tracking whose data association method is simply based on bounding box overlap or object center distance and built upon Kalman filter    . The majority of the participants in Waymo 2D and 3D tracking challenges are based on these methods. Despite the fact that tracking-by-detection often relies on strong object detectors, better overall performance can still be achieved by improving the data association schemes and the tracking framework.
In recent literature, there is a trend of joint detection and tracking using a single network, such as RetinaTrack , and CenterTrack  and FairMOT  which are developed on top of CenterNet. This paradigm was adopted by some participants in the 2D tracking challenge. The CenterTrack  network learns a 2D offset of the same object between two adjacent frames and associate it over time based on center distance. The overall idea of CenterTrack is simple yet effective. One problem of CenterTrack is that it achieves tracking using the center offsets in the local regime and therefore is unable to handle the long-term occlusion or missing detection problems.
For the 2D and 3D tracking challenges, we proposed a unified and pragmatic framework named HorizonMOT that focuses on frame-to-frame prediction and association, and is applicable to both 2D camera-based tracking in the image space and LiDAR-based 3D tracking in the 3D world space, as shown in Figure 1. Our trackers are online since only detections of the current frame are presented to the tracker, and the result of the current frame is decided right away without any latency. Our trackers belong to the tracking-by-detection paradigm.
2 Detection Network
2.1 2D Detection Network
High-performing detectors are the key to the success of tracking-by-detection approaches. We employ the one-stage, anchor-free, non-maxima suppression (NMS) free CenterNet framework  for 2D object detection. Under the CenterNet paradigm, many complicated perception tasks can be simplified in a unified framework as object center point detection and regression of object properties such as bounding box size, 3D information (e.g. 3D location, 3D dimension, heading), pose, or embedding.
We use Hourglass as the CenterNet backbone. As illustrated in Figure 2, two hourglass blocks are stacked and the first one only serves as providing auxiliary loss during training. We tried using both stacks for inference but it did not improve the results.
2.2 3D Detection Network
In the 3D detection track of Waymo Open Dataset Challenges, our solution is an improvement upon our baseline 3D point cloud detector named AFDet  and reached the 1st place, and we use 3D detections produced by this solution as input to our 3D tracker.
3 Tracking Framework
Tracking-by-detection consists of the following components: 1) track creation and deletion; 2) state prediction and update using the Kalman filter; 3) association between tracks and detections. We assume no ego-motion information and no future information is available. An illustration of our tracking framework and data association method is shown in Figure 3.
3.1 Track Creation and Deletion
Similar to  and , a new track is created when a detection of the current frame is not associated with any track. As in  and , each track has a number of frames since the last successful detection association and the track will be deleted once this counter exceeds predefined maximum age .
3.2 State Prediction and Update Using Kalman Filter
In 2D tracking, for each track we define an eight dimensional state space , which contains the center , aspect ratio , height , and their respective velocities in the image space. The observation is 2D detection box and its score . We simply set the score of the track to the score of its associated detection. In 3D tracking we use 10-dimensional state space which contains the 3D location, height, width and length, heading, and the respective velocities of the 3D location values, in the 3D space. The observation is 3D detection defined as where
is the detection score. At each frame, state prediction is performed first using a constant velocity model, and then the track-detection association. The state of each track is updated if it is associated with a detection. In Kalman filter, the estimated 2D/3D box is essentially a weighted average between state space and the observation. In our experiments we use the observation directly as the output instead of using the weighted average. If a track is not associated with any detections at the current frame, only the prediction step is performed and the track does not contribute to the output of the current frame.
3.3 Association Metric
The association between detections of the current frame and tracks is based on association metrics which are usually defined based on 2D/3D IoUs   , Mahalanobis distance of 2D/3D object centers  , and cosine distance  between appearance/Re-ID features of 2D boxes. For 3D tracking in the 3D world space, one can also transform the 3D bounding boxes to image space and calculate association metric based on the overlap of 2D projections as shown in Figure 4. In the 2D tracking challenge, we adopt 2D IoU and cosine distance, while in the 3D tracking challenge we employ Euclidean distance with Gaussian kernel (with a parameter ) between 3D centers, which is better and faster than other metrics that we tried on the validation set such as 3D box IoU, Bird’s Eye View (BEV) box IoU (i.e. ignoring the vertical dimension) and Mahalanobis distance.
3.4 Three-stage Data Association
Typically, association between tracks and detections is formulated as an assignment problem and relies on the Hungarian algorithm. In the tracking challenges, we developed a three-stage data association scheme that applies to both 2D and 3D tracking to improve the tracking performance. We first select a primary set of detections whose scores are larger than and a secondary set in which scores are within the range .
First-stage Association. We adopt the matching cascade proposed in  for the first stage. Association cost matrix is calculated first between tracks and the primary set of detections. We exclude unlikely associations if the cost is larger than a specified threshold . We start from the most frequently seen track (i.e. with smallest track age ) and iterate over each track of increasing age and solve a linear assignment problem.
Second-stage Association. In the second stage, the association is between un-matched tracks with age less than 3 and remaining detections in the primary detection set, we use a different association metric or relax the condition of the same association metric used in the first stage (e.g. by enlarging the size of a 2D bounding box to increase its overlap over time). The association is again solved in a linear assignment problem and only admissible associations are kept by excluding unlikely associations using a specified distance threshold .
Third-stage Association. In the third matching stage, association is between remaining un-matched tracks and detections in the secondary set. This helps to account for objects with weak detections (e.g. caused by partial occlusion). Admissible associations with distance lower than specified threshold are kept.
3.5 Re-ID Features
Our 2D tracker also relies on Re-ID features extracted by a small independent network to complement bounding box overlap based association metrics. Re-ID or appearance features help handle long-term occlusion or objects with large displacement which could result in the failure of IoU based association metric. There are many scenarios which could lead to rapid displacements of object in the image plane. For example, low frame rate, vehicles in the opposite traffic direction with high relative speed, and unaccounted camera motion such as large camera pitch motion caused by bumps on the ground.
Following , we keep a gallery of the history associated Re-ID features of each track and the smallest cosine distance between them and the detection is used as the distance. We also introduce a maximum appearance distance to exclude unlikely associations.
|2D Tracking Parameters||Pedestrian||Vehicle||Cyclist|
|max_iou_dist (front left/right)||0.97||0.93||0.97|
|3D Tracking Parameters||Pedestrian||Vehicle||Cyclist|
Dataset. Our tracking algorithms are evaluated on the Waymo Open Dataset v1.2 . We use its training set for training the 2D detection networks and 2D Re-ID networks, its validation set for verifying ideas and tuning parameters, and the test set to generate our final submission to the leaderboard .
Evaluation Metric. Waymo Open Dataset uses multiple object tracking metrics from . MOTA is the main metric that takes into account the number of misses, false positives and mismatches. It is calculated for two difficulty levels. L1 metrics are calculated only for level 1 ground truth, while L2 metrics are computed by considering both level 1 and level 2 ground truth.
4.2 Implementation Details
2D and 3D Detections. In contrast to the original CenterNet, we use Gaussian kernels as in  which takes into account the aspect ratio of the bounding box to encode training samples. During training, we use as the input size and a learning rate of -4. Due to lack of computational resource and sheer size of the dataset, we first trained a main network with weights pretrained on COCO on a subset of the training images for all 3 object categories (i.e
. car, pedestrian, cyclist) for 25 epochs. A daytime expert model and a nighttime expert model were fine-tuned from this main network using only daytime or nighttime training images in the subset. To handle the imbalanced training data problem (i.e. pedestrian and especially cyclist have significantly less training samples than vehicle class), we also fine-tuned an expert model using only images with pedestrian and cyclist samples. We then fine-tuned 4 more models on the entire validation set, the entire training set, and images in the entire training set with pedestrian and cyclist samples, and nighttime images in the entire training set, respectively for 8-10 epochs. In inference we use flip and multi-scale (0.5, 0.75, 1, 1.25, 1.5) augmentation. To serve as the tracker input, outputs of the 8 models were merged by naive NMS with IoU overlap threshold set to 0.5. Note that in the 2D detection challenge we use weighted boxes fusion instead to merge the results.
As input to our 3D tracker, we rely on the 3D detections produced by our solution in the 3D detection challenge. Details of this solution can be found in our technical report for that challenge.
Re-ID Network. We use an independent Re-ID network with 11 and 3 convolutional layers and a maxpooling and average pooling layer and a downsampling factor of 16. Input image is normalized to for pedestrian, and for car/cyclist. The network was trained from scratch as a classification network by adding a fully-connected layer and we prepared a total of 2844, 20041, and 906 unique objects for the pedestrian, car, and cyclist respectively from a subset of the Waymo 2D training images. The classification layer is removed during inference and the 512-dimension feature embedding servers as Re-ID feature.
2D Tracking. Cosine distance between Re-ID features is used in the first-stage matching, 2D IoU distance is used in the second and third-stage matching. We double or triple the size of the bounding boxes in the second and third-stage respectively when calculating the IoU overlap to account for objects with large displacement. Table 1 summarizes all the parameters used in my 2D tracking experiments. Note that we use different IoU matching thresholds for front, front left and right, and side cameras. We allow larger IoU distance (i.e. smaller overlap) in admissible associations for front left/right and side cameras since the displacement of some objects (especially pedestrians) tend to be very large. We assign the score of associated detection to the track as its score.
3D Tracking. Euclidean distance with Gaussian kernel between 3D centers is used throughout the three-stage associations. We use different values for each class as shown in Table 2. We also assign the score of associated detection to the track as its score.
|+ Third-stage association||46.10||39.68|
|+ Re-ID models||48.79||42.11|
As shown in Table 3 and Table 4, our tracking algorithm reached the 1st place on the official Waymo Open Dataset 2D and 3D tracking leaderboards   and achieved the highest MOTAL2 scores. In particular, our trackers return the lowest miss rate. Some qualitative results are shown in Figure 5 and Figure 6.
4.4 Ablation Study
On the 2D tracking validation set with 202 sequences, we study the effect of introducing the 3rd-stage association and using the Re-ID models. Our baseline performance is produced without these two components. As shown in Table 5, the 3rd-stage association results in a improvement in terms of MOTAL2 and the Re-ID models can further improve the performance by .
An accurate, online and unified 2D and 3D tracking framework is proposed and achieved the 1st places on the Waymo Open Dataset 2D and 3D tracking challenges. In the future, we will continue our ongoing work with the above-mentioned joint detection and tracking framework.
-  Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing, 2008, 01 2008.
-  Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. 2016 IEEE International Conference on Image Processing (ICIP), Sep 2016.
-  Erik Bochinski, Volker Eiselein, and Thomas Sikora. High-speed tracking-by-detection without using image information. In International Workshop on Traffic and Street Surveillance for Safety and Security at IEEE AVSS 2017, Lecce, Italy, Aug. 2017.
-  Hsu-kuang Chiu, Antonio Prioletti, Jie Li, and Jeannette Bohg. Probabilistic 3d multi-object tracking for autonomous driving. arXiv preprint arXiv:2001.05673, 2020.
-  Runzhou Ge, Zhuangzhuang Ding, Yihan Hu, Yu Wang, Sijia Chen, Li Huang, and Yuan Li. Afdet: Anchor free one stage 3d object detection. In CVPR Workshops, 2020.
-  Zhichao Lu, Vivek Rathod, Ronny Votel, and Jonathan Huang. Retinatrack: Online single stage joint detection and tracking. arXiv preprint arXiv:2003.13870, 2020.
-  Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. arXiv preprint arXiv:1912.04838, 2019.
-  Waymo. Waymo 2d tracking leaderboard. https://waymo.com/open/challenges/2d-tracking/, 2020.
-  Waymo. Waymo 3d tracking leaderboard. https://waymo.com/open/challenges/3d-tracking/, 2020.
-  Waymo. Waymo challenges. https://waymo.com/open/challenges/, 2020.
-  Xinshuo Weng and Kris Kitani. A baseline for 3d multi-object tracking. arXiv preprint arXiv:1907.03961, 2019.
-  Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3645–3649. IEEE, 2017.
-  Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. A simple baseline for multi-object tracking. arXiv preprint arXiv:2004.01888, 2020.
-  Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Tracking objects as points. arXiv preprint arXiv:2004.01177, 2020.
-  Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
-  Guodong Xu Zheng Yang Haifeng Liu Deng Cai Zili Liu, Tu Zheng. Training-time-friendly network for real-time object detection. arXiv preprint arXiv:1909.00700, 2019.