ByteTrack: Multi-Object Tracking by Associating Every Detection Box
Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. Most methods obtain identities by associating detection boxes whose scores are higher than a threshold. The objects with low detection scores, e.g. occluded objects, are simply thrown away, which brings non-negligible true object missing and fragmented trajectories. To solve this problem, we present a simple, effective and generic association method, called BYTE, tracking BY associaTing Every detection box instead of only the high score ones. For the low score detection boxes, we utilize their similarities with tracklets to recover true objects and filter out the background detections. We apply BYTE to 9 different state-of-the-art trackers and achieve consistent improvement on IDF1 score ranging from 1 to 10 points. To put forwards the state-of-the-art performance of MOT, we design a simple and strong tracker, named ByteTrack. For the first time, we achieve 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on the test set of MOT17 with 30 FPS running speed on a single V100 GPU. The source code, pre-trained models with deploy versions and tutorials of applying to other trackers are released at https://github.com/ifzhang/ByteTrack.READ FULL TEXT VIEW PDF
ByteTrack: Multi-Object Tracking by Associating Every Detection Box
Was vernünftig ist, das ist wirklich; und was wirklich ist, das ist vernünftig.
—— G. W. F. Hegel
Tracking by detection is the most effective paradigm for multi-object tracking (MOT) in current. Due to the complex scenarios in video, detectors are prone to make imperfect predictions. State-of-the-art MOT methods [berclaz2011multiple, dicle2013way, milan2013continuous, bae2014robust, xiang2015learning, bewley2016simple, wojke2017simple, chen2018real, bergmann2019tracking, zhang2020fairmot, sun2020transtrack] need to deal with true positive / false positive trade-off in detection boxes to eliminate low confidence detection boxes [bernardin2008evaluating, luiten2021hota]. However, is it the right way to eliminate all low confidence detection boxes? Our answer is NO: as Hegel said “What is reasonable is real; that which is real is reasonable.” Low confidence detection boxes sometimes indicate the existence of objects, e.g. the occluded objects. Filtering out these objects causes irreversible errors for MOT and brings non-negligible missing detection and fragmented trajectories.
Figure 2 (a) and (b) show this problem. In frame , we initialize three different tracklets as their scores are all higher than 0.5. However, in frame and frame when occlusion happens, red tracklet’s corresponding detection score becomes lower i.e. 0.8 to 0.4 and then 0.4 to 0.1. These detection boxes are eliminated by the thresholding mechanism and the red tracklet disappears accordingly. Nevertheless, if we take every detection box into consideration, more false positives will be introduced immediately, e.g., the most right box in frame of Figure 2 (a). To the best of our knowledge, very few methods [khurana2020detecting, tokmakov2021learning] in MOT are able to handle this detection dilemma.
In this paper, we identify that the similarity with tracklets provides a strong cue to distinguish the objects and background in low score detection boxes. As shown in Figure 2 (c), two low score detection boxes are matched to the tracklets by the motion model’s predicted boxes, and thus the objects are correctly recovered. At the same time, the background box is removed since it has no matched tracklet.
For making full use of detection boxes from high scores to low ones in the matching process, we present a simple and effective association method BYTE, named for each detection box is a basic unit of the tracklet, as byte in computer program, and our tracking method values every detailed detection box. We first match the high score detection boxes to the tracklets based on motion similarity. Similar to [bewley2016simple], we use Kalman Filter [kalman1960new] to predict the location of the tracklets in the new frame. The motion similarity can be computed by the IoU of the predicted box and the detection box. Figure 2 (b) is exactly the results after the first matching. Then, we perform the second matching between the unmatched tracklets, i.e. the tracklet in red box, and the low score detection boxes. Figure 2 (c) shows the results after the second matching. The occluded person with low detection scores is matched correctly to the previous tracklet and the background is removed.
To evaluate the generalization ability of our proposed association method, we apply it to 9 different state-of-the-art trackers, including the Re-ID-based ones [wang2020towards, zhang2020fairmot, liang2020rethinking, pang2021quasi], motion-based ones [zhou2020tracking, wu2021track, peng2020chained], chain-based one [peng2020chained] and attention-based ones [sun2020transtrack, zeng2021motr]. We achieve notable improvements on almost all the metrics including MOTA, IDF1 score and ID switches. For example, we increase the MOTA of CenterTrack [zhou2020tracking] from 66.1 to 67.4, IDF1 from 64.2 to 74.0 and decrease the IDs from 528 to 144 on the half validation set of MOT17 [zhou2020tracking].
Towards pushing forwards the state-of-the-art performance of MOT, we propose a simple and strong tracker, named ByteTrack. We adopt a recent high-performance detector YOLOX [ge2021yolox] to obtain the detection boxes and associate them with our proposed BYTE. On the MOT challenges, ByteTrack ranks 1st on both MOT17 [milan2016mot16] and MOT20 [dendorfer2020mot20], achieving 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA with 30 FPS running speed on V100 GPU on MOT17 and 77.8 MOTA, 75.2 IDF1 and 61.3 HOTA on crowded MOT20.
Our proposed method is the first work that achieves highly competitive tracking performance by the extremely simple motion model, without any Re-ID module or attention mechanisms [zhang2020fairmot, liang2020rethinking, pang2021quasi, wang2020joint, zeng2021motr, sun2020transtrack]. It sheds light on the great potential of motion cues on handling occlusion and long-range association. We hope the efficiency and simplicity of ByteTrack could make it attractive in real applications.
Object detection and data association are two key components of multi-object tracking. Detection estimates the bounding boxes and association obtains the identities.
Object detection is one of the most active topics in computer vision and it is the basis of multi-object tracking. The MOT17 dataset[milan2016mot16] provides detection results obtained by popular detectors such as DPM [felzenszwalb2008discriminatively], Faster R-CNN [ren2015faster] and SDP [yang2016exploit]. A large number of methods [xu2019spatial, chu2019famnet, bergmann2019tracking, chen2018real, zhu2018online, braso2020learning, hornakova2020lifted] focus on improving the tracking performance based on these given detection results. The association ability of these methods can be fairly compared.
Tracking by detection. With the rapid development of object detection [ren2015faster, he2017mask, redmon2018yolov3, lin2017focal, cai2018cascade, fu2020model, sun2021sparse, peize2020onenet], more and more methods begin to use more powerful detectors to obtain higher tracking performance. The one-stage object detector RetinaNet [lin2017focal] begin to be used by several methods such as [lu2020retinatrack, peng2020chained]. CenterNet [zhou2019objects] is the most popular detector used by most methods [zhou2020tracking, zhang2020fairmot, wu2021track, zheng2021improving, wang2020joint, tokmakov2021learning, wang2021multiple] for its simplicity and efficiency. The YOLO series detectors [redmon2018yolov3, bochkovskiy2020yolov4] are also used by a large number of methods [wang2020towards, liang2020rethinking, liang2021one, chu2021transmot] for its excellent balance of accuracy and speed. Most of these methods directly use the detection boxes on a single image for tracking.
However, the number of missing detections and very low scoring detections begin to increase when occlusion or motion blur happens in the video sequence, as is pointed out by video object detection methods [tang2019object, luo2019detect]. Therefore, the information of the previous frames are usually leveraged to enhance the video detection performance.
Detection by tracking. Tracking can also adopted to help obtain more accurate detection boxes. Some methods [sanchez2016online, zhu2018online, chu2019famnet, chu2019online, chu2021transmot, chen2018real] use single object tracking (SOT) [bertinetto2016fully] or Kalman Filter [kalman1960new] to predict the location of the tracklets in the following frame and fuse the predicted boxes with the detection boxes to enhance the detection results. Other methods [zhang2018integrated, liang2021one] use tracked boxes in the previous frames to enhance feature representation of the following frame. Recently, Transformer-based [vaswani2017attention, dosovitskiy2020vit, wang2021pvt, liu2021swin] detectors [carion2020end, zhu2020deformable] are used by several methods [sun2020transtrack, meinhardt2021trackformer, zeng2021motr] for its strong ability to propagate boxes between frames. Our method also utilize the similarity with tracklets to strength the reliability of detection boxes.
After obtaining the detection boxes by various detectors, most MOT methods [wang2020towards, zhang2020fairmot, pang2021quasi, lu2020retinatrack, liang2020rethinking, wu2021track, sun2020transtrack] only keep the high score detection boxes by a threshold, i.e. 0.5, and use those boxes as the input of data association. This is because the low score detection boxes contain many backgrounds which harm the tracking performance. However, we observe that many occluded objects can be correctly detected but have low scores. To reduce missing detections and keep the persistence of trajectories, we keep all the detection boxes and associate across every of them.
Data association is the core of multi-object tracking, which first computes the similarity between tracklets and detection boxes and then matches them according to the similarity.
Similarity metrics. Location, motion and appearance are useful cues for association. SORT [bewley2016simple] combines location and motion cues in a very simple way. It first uses Kalman Filter [kalman1960new] to predict the location of the tracklets in the new frame and then computes the IoU between the detection boxes and the predicted boxes as the similarity. Some recent methods [zhou2020tracking, sun2020transtrack, wu2021track]
design networks to learn object motions and achieve more robust results in cases of large camera motion or low frame rate. Location and motion similarity are accurate in the short-range matching. Appearance similarity are helpful in the long-range matching. An object can be re-identified using appearance similarity after being occluded for a long period of time. Appearance similarity can be measured by the cosine similarity of the Re-ID features. DeepSORT[wojke2017simple] adopts a stand-alone Re-ID model to extract appearance features from the detection boxes. Recently, joint detection and Re-ID models [wang2020towards, zhang2020fairmot, liang2020rethinking, lu2020retinatrack, zhang2021voxeltrack, pang2021quasi] becomes more and more popular because of their simplicity and efficiency.
Matching strategy. After similarity computation, matching strategy assigns identities to the objects. This can be done by Hungarian Algorithm [kuhn1955hungarian] or greedy assignment [zhou2020tracking]. SORT [bewley2016simple] matches the detection boxes to the tracklets by once matching. DeepSORT [wojke2017simple] proposes a cascaded matching strategy which first matches the detection boxes to the most recent tracklets and then to the lost ones. MOTDT [chen2018real] first uses appearance similarity to match and then use the IoU similarity to match the unmatched tracklets. QuasiDense [pang2021quasi]
turns the appearance similarity into probability by a bi-directional softmax operation and uses a nearest neighbor search to accomplish matching. Attention mechanism[vaswani2017attention] can directly propagate boxes between frames and perform association implicitly. Recent methods such as [meinhardt2021trackformer, zeng2021motr] propose track queries to find the location of the tracked objects in the following frames. The matching is implicitly performed in the attention interaction process.
All these methods focus on how to design better association methods. However, we argue that the detection boxes determines the upper bound of data association and we focus on how to make use of detection boxes from high scores to low ones in the matching process.
We propose a simple, effective and generic data association method, BYTE. Different from previous methods [wang2020towards, zhang2020fairmot, liang2020rethinking, pang2021quasi] which only keep the high score detection boxes, we keep every detection box and separate them into high score ones and low score ones. We first associate the high score detection boxes to the tracklets. Some tracklets get unmatched because it does not match to an appropriate high score detection box, which usually happens when occlusion, motion blur or size changing occurs. We then associate the low score detection boxes and these unmatched tracklets to recover the objects in low score detection boxes and filter out background, simultaneously. The pseudo-code of BYTE is shown in Algorithm 1.
The input of BYTE is a video sequence V, along with an object detector Det and the Kalman Filter KF. We also set three thresholds , and . and are the detection score thresholds and is the tracking score threshold. The output of BYTE is the tracks of the video and each track contains the bounding box and identity of the object in each frame.
For each frame in the video, we predict the detection boxes and scores using the detector Det. We separate all the detection boxes into two parts and according to the detection score thresholds and . For the detection boxes whose scores are higher than , we put them into the high score detection boxes . For those whose scores range from to , we put them into the low score detection boxes (line 3 to 13 in Algorithm 1).
After separating the low score detection boxes and the high score detection boxes, we use Kalman Filter KF to predict the new locations of each track in (line 14 to 16 in Algorithm 1).
The first association is performed between the high score detection boxes and all the tracks (including the lost tracks ). The similarity is computed by the IoU between the detection boxes and the predicted box of tracks . Then, we use Hungarian Algorithm [kuhn1955hungarian] to finish the matching based on the similarity. In particular, if the IoU between the detection box and the tracklet box is smaller than 0.2, we reject the matching. We keep the unmatched detections in and the unmatched tracks in (line 17 to 19 in Algorithm 1).
BYTE is highly flexible and can be compatible to other different association methods. For example, when BYTE is combined with DeepSORT [wojke2017simple], Re-ID feature is added into * first association * in Algorithm 1, others are the same. In the experiments, we apply BYTE to 9 different state-of-the-art trackers and achieve notable improvements on almost all the metrics.
The second association is performed between the low score detection boxes and the remaining tracks after the first association. We keep the unmatched tracks in and just delete all the unmatched low score detection boxes, since we view them as background. (line 20 to 21 in Algorithm 1).
We find it important to use IoU as the similarity in the second association because the low score detection boxes usually contains severe occlusion or motion blur and appearance features are not reliable. Thus, when apply BYTE to other Re-ID based trackers [wang2020towards, zhang2020fairmot, pang2021quasi], we do not use appearance similarity in the second association.
After the association, the unmatched tracks will be deleted from the tracklets. We do not list the procedure of track rebirth [wojke2017simple, chen2018real, zhou2020tracking] in Algorithm 1 for simplicity. Actually, it is necessary for the long-range association to preserve the identity of the tracks. For the unmatched tracks after the second association, we put them into . For each track in , only when it exists for more than a certain number of frames, i.e. 30, we delete it from the tracks . Otherwise, we remain the lost tracks in (line 22 in Algorithm 1).
Finally, we initialize new tracks from the unmatched high score detection boxes after the first association. For each detection box in , if its detection score is higher than and exists for two consecutive frames, we initialize a new track (line 23 to 27 in Algorithm 1).
The output of each individual frame is the bounding boxes and identities of the tracks in the current frame. Note that we do not output the boxes and identities of .
To put forwards the state-of-the-art performance of MOT, we design a simple and strong tracker, named ByteTrack, by equipping the high-performance detector YOLOX [ge2021yolox] with our association method BYTE.
YOLOX switches the YOLO series detectors [redmon2018yolov3, bochkovskiy2020yolov4] to an anchor-free manner and conduct other advanced detection techniques, including decoupled heads, strong data augmentations, such as Mosaic [bochkovskiy2020yolov4] and Mixup [zhang2017mixup], and effective label assignment strategy SimOTA [ge2021ota] to achieve state-of-the-art performance on object detection.
The backbone network is the same as YOLOv5 [yolov5] which adopts an advanced CSPNet [wang2020cspnet] backbone and an additional PAN [liu2018path] head. There are two decoupled heads after the backbone network, one for regression and the other for classification. An additional IoU-aware branch is added to the regression head to predict the IoU between the predicted box and the ground truth box. The regression head directly predicts four values in each location in the feature map, i.e., two offsets in terms of the left-top corner of the grid, and the height and width of the predicted box. The regression head is supervised by GIoU loss [rezatofighi2019generalized] and the classification and IoU heads are supervised by the binary cross entropy loss.
The SimOTA label assignment strategy automatically select positive samples according to their cost to the ground truth annotations. The cost is computed by a weighted sum of the classification cost and the box location cost [zhu2020autoassign, carion2020end, peize2020onenet]. Then, it selects a number of dynamic top-k positive samples from a fixed size of areas around the object center according to their cost. The advanced label assignment strategy notably increases the detection performance.
We note MOT17 [milan2016mot16] requires the bounding boxes [zhou2020tracking] covering the whole body, even though the object is occluded or partly out of the image. However, the default implementation of YOLOX clips the detection boxes inside the image area. To avoid the wrong detection results around the image boundary, we modify YOLOX in terms of data pre-processing and label assignment. We do not clip the bounding boxes inside the image during the data pre-processing and data augmentation procedure. We only delete the boxes which are fully outside the image after data augmentation. In the SimOTA label assignment strategy, the positive samples need to be around the center of the object, while the center of the whole body boxes may lie out of the image, so we clip the center of the object inside the image.
MOT20 [dendorfer2020mot20] clips the bounding box annotations inside the image in and thus we just use the original YOLOX.
Datasets. We evaluate BYTE and ByteTrack on MOT17 [milan2016mot16] and MOT20 [dendorfer2020mot20] datasets under the “private detection” protocol. Both datasets contain training sets and test sets, without validation sets. For ablation studies, we use the first half of each video in the training set of MOT17 for training and the last half for validation following [zhou2020tracking]. We train on the combination of CrowdHuman dataset [shao2018crowdhuman] and MOT17 half training set following [zhou2020tracking, sun2020transtrack, zeng2021motr, wu2021track]. We add Cityperson [zhang2017citypersons] and ETHZ [ess2008mobile] for training following [wang2020towards, zhang2020fairmot, liang2020rethinking] when testing on the test set of MOT17.
Metrics. We use the CLEAR metrics [bernardin2008evaluating], including MOTA, FP, FN, IDs, etc., IDF1 [ristani2016performance] and HOTA [luiten2021hota] to evaluate different aspects of the tracking performance. MOTA is computed based on FP, FN and IDs. Considering the amount of FP and FN are larger than IDs, MOTA focuses more on the detection performance. IDF1 evaluates the identity preservation ability and focus more on the association performance. HOTA is a very recently proposed metric which explicitly balances the effect of performing accurate detection, association and localization.
Implementation details. For BYTE, the default high detection score threshold is 0.6, the low threshold 0.1 and the trajectory initialization score 0.7, unless otherwise specified. In the linear assignment step, if the IoU between the detection box and the tracklet box is smaller than 0.2, the matching will be rejected. For the lost tracklets, we keep it for 30 frames in case it appears again.
For ByteTrack, the detector is YOLOX [ge2021yolox]
with YOLOX-X as the backbone and COCO-pretrained model[lin2014microsoft]
as the initialized weights. The training schedule is 80 epochs on the combination of MOT17, CrowdHuman, Cityperson and ETHZ. The input image size is 1440800 and the shortest side ranges from 576 to 1024 during multi-scale training. The data augmentation includes Mosaic [bochkovskiy2020yolov4] and Mixup [zhang2017mixup]. The model is trained on 8 NVIDIA Tesla V100 GPU with batch size of 48. The optimizer is SGD with weight decay of and momentum of 0.9. The initial learning rate is with 1 epoch warm-up and cosine annealing schedule. The total training time is about 12 hours. Following [ge2021yolox], FPS is measured with FP16-precision [micikevicius2017mixed] and batch size of 1 on a single GPU.
Comparisons with other association methods. We compare BYTE with other popular association methods including SORT [bewley2016simple], DeepSORT [wojke2017simple] and MOTDT [chen2018real]. The results are shown in Table 1.
SORT can be seen as our baseline method because both methods only use Kalman Filter to predict the object motion. We can see that BYTE improves the MOTA metric of SORT from 74.6 to 76.6, IDF1 from 76.9 to 79.3 and decreases IDs from 291 to 159. This highlights the importance of the low score detection boxes and proves the ability of BYTE to recover object boxes from low score one.
DeepSORT uses additional Re-ID models to enhance the long-range association. We surprisingly find BYTE also has additional gains compared with DeepSORT. This suggests a simple Kalman Filter can perform long-range association and achieve better IDF1 and IDs when the detection boxes are accurate enough. We note that in severe occlusion cases, Re-ID features are vulnerable and may lead to more identity switches, instead, motion model behaves more reliably.
MOTDT integrates motion-guided box propagation results and detection results to associate unreliable detection results with tracklets. Although sharing the similar motivation, MOTDT is behind BYTE by a large margin. We explain that MOTDT uses propagated boxes as tracklet boxes, which may lead to locating drifts in tracking. Instead, BYTE uses low-score detection boxes to re-associate those unmatched tracklets, therefore, tracklet boxes are more accuracy.
Robustness to detection score threshold. The detection score threshold is a sensitive hyper-parameter and needs to be carefully tuned in the task of multi-object tracking. We change it from 0.2 to 0.8 and compare the MOTA and IDF1 score of BYTE and SORT. The results are shown in Fig 3. From the results we can see that BYTE is more robust to the detection score threshold than SORT. This is because the second association in BYTE recovers the objects whose scores are lower than , and thus considers every detection box regardless of the change of .
|JDE [wang2020towards]||Motion(K) + Re-ID||60.0||63.6||2923||18158||473|
|Motion(K) + Re-ID||60.3 (+0.3)||64.1 (+0.5)||3065||17912||418|
|Motion(K)||60.6 (+0.6)||66.0 (+2.4)||3082||17771||360|
|Motion(K) + Re-ID||68.0||72.3||1846||15075||325|
|Motion(K) + Re-ID||69.2 (+1.2)||73.9 (+1.6)||2160||14128||285|
|Motion(K)||69.3 (+1.3)||71.7 (-0.6)||2202||14068||279|
|Motion(K) + Re-ID||69.1||72.8||1976||14443||299|
|Motion(K) + Re-ID||70.4 (+1.3)||74.2 (+1.4)||2288||13470||232|
|Motion(K)||70.3 (+1.2)||73.2 (+0.4)||2189||13625||236|
|Motion + Re-ID||68.2||71.7||1913||14962||285|
|Motion + Re-ID||68.6 (+0.4)||71.1 (-0.6)||2253||14419||259|
|Motion(K)||67.9 (-0.3)||72.0 (+0.3)||1822||15345||178|
|Motion(K) + Re-ID||67.7 (+0.4)||72.0 (+4.2)||2280||14856||281|
|Motion(K)||67.9 (+0.6)||70.9 (+3.1)||2310||14746||258|
|Motion||66.3 (+0.2)||64.8 (+0.6)||2376||15445||334|
|Motion(K)||67.4 (+1.3)||74.0 (+9.8)||1778||15641||144|
|Motion(K)||65.0 (+1.9)||66.7 (+5.8)||3303||15206||346|
|Attention||68.6 (+1.5)||69.0 (+0.7)||2151||14515||232|
|Motion(K)||68.3 (+1.2)||72.4 (+4.1)||1692||15189||181|
|Attention||64.3 (-0.4)||69.3 (+2.1)||5787||13220||263|
|Motion(K)||65.7 (+1.0)||68.4 (+1.2)||1607||16651||260|
Analysis on low score detection boxes. To prove the effectiveness of BYTE, we collect the number of TPs and FPs in the low score boxes obtained by BYTE. We use the half training set of MOT17 and CrowdHuman for training and evaluate on the half validation set of MOT17. First, we keep all the low score detection boxes whose scores range from to
and classify the TPs and FPs using ground truth annotations. Then, we select the tracking results obtained by BYTE from low score detection boxes. The results of each sequence are shown in Fig4. We can see that BYTE obtains notably more TPs than FPs from the low score detection boxes even though some sequences have much more FPs in all the detection boxes. The obtained TPs notably increases MOTA from 74.6 to 76.6 as is shown in Table 1.
Applications on other trackers. We apply BYTE on 9 different state-of-the-arts trackers, including JDE [wang2020towards], CSTrack [liang2020rethinking], FairMOT [zhang2020fairmot], TraDes [wu2021track], QuasiDense [pang2021quasi], CenterTrack [zhou2020tracking], Chained-Tracker [peng2020chained], TransTrack [sun2020transtrack] and MOTR [zeng2021motr]. Among these trackers, JDE, CSTrack, FairMOT, TraDes use a combination of motion and Re-ID similarity. QuasiDense uses Re-ID similarity alone. CenterTrack and TraDes predict the motion similarity by the learned networks. Chained-Tracker adopts the chain structure and outputs the results of two consecutive frames simultaneously and associate in the same frame by IoU. TransTrack and MOTR use the attention mechanism to propagate boxes among frames. Their results are shown in the first line of each tracker in Table 2. To evaluate the effectiveness of BYTE, we design two different modes to apply BYTE to these trackers.
The first mode is to insert BYTE into the original association methods of different trackers, as is shown in the second line of the results of each tracker in Table 2. Take FairMOT[zhang2020fairmot] for example, after the original association is done, we select all the unmatched tracklets and associate them with the low score detection boxes following the * second association * in Algorithm 1. Note that for the low score objects, the Re-ID features are not reliable so we only use the IoU between the detection boxes and the tracklet boxes after motion prediction as the similarity. We do not apply the first mode of BYTE to Chained-Tracker because we find it is difficult to implement in the chain structure.
We can see that in both modes, BYTE can bring stable improvements over almost all the metrics including MOTA, IDF1 and IDs. For example, BYTE increases CenterTrack by 1.3 MOTA and 9.8 IDF1, Chained-Tracker by 1.9 MOTA and 5.8 IDF1, TransTrack by 1.2 MOTA and 4.1 IDF1. The results in Table 2 indicate that BYTE has strong generalization ability and can be easily applied to existing trackers to obtain performance gain.
|MOT17 + CH||22.0K||76.6||79.3||159|
|MOT17 + CH + CE||26.6K||76.7||79.7||183|
Comparison of different interpolation intervals on the MOT17 validation set. The best results are shown inbold.
Speed v.s. accuracy. We evaluate the speed and accuracy of ByteTrack using different size of input images during inference. All experiments use the same multi-scale training. The results are shown in Table 3. The input size during inference ranges from to . The running time of the detector ranges from 17.9 ms to 30.0 ms and the association time is all around 4.0 ms. ByteTrack can achieve 75.0 MOTA with 45.7 FPS running speed and 76.6 MOTA with 29.6 FPS running speed, which has advantages in practical applications.
Training data. We evaluate ByteTrack on the half validation set of MOT17 using different combinations of training data. The results are shown in Table 4. When only using the half training set of MOT17, the performance achieves 75.8 MOTA, which already outperforms most methods. This is because we use strong augmentations such as Mosaic [bochkovskiy2020yolov4] and Mixup [zhang2017mixup]. When further adding CrowdHuman, Cityperson and ETHZ for training, we can achieve 76.7 MOTA and 79.7 IDF1. The big improvement of IDF1 arises from that the CrowdHuman dataset can boost the detector to recognize occluded person, therefore, making the Kalman Filter generate smoother predictions and enhance the association ability of the tracker.
The experiments on training data suggest that ByteTrack is not data hungry. This is a big advantage for real applications, comparing with previous methods [zhang2020fairmot, liang2020rethinking, wang2021multiple, liang2021one] that require more than 7 data sources [milan2016mot16, ess2008mobile, zhang2017citypersons, xiao2017joint, zheng2017person, dollar2009pedestrian, shao2018crowdhuman] to achieve high performance.
Visualization results. We show some visualization results of difficult cases which ByteTrack is able to handle in Figure 5. We select 6 sequences from the half validation set of MOT17 and generate the visualization results using the model with 76.6 MOTA and 79.3 IDF1. The difficult cases include occlusion (i.e. MOT17-02, MOT17-04, MOT17-05, MOT17-09, MOT17-13), motion blur (i.e. MOT17-10, MOT17-13) and small objects (i.e. MOT17-13). The pedestrian in the middle frame with red triangle has low detection score, which is obtained by our association method BYTE. The low score boxes not only decrease the number of missing detection, but also play an important role for long-range association. As we can see from all these difficult cases, ByteTrack does not bring any identity switch and preserve the identity effectively.
Tracklet interpolation. We notice that there are some fully-occluded pedestrians in MOT17, whose visible ratio is 0 in the ground truth annotations. Since it is almost impossible to detect them by visual cues, we obtain these objects by tracklet interpolation.
Suppose we have a tracklet , its tracklet box is lost due to occlusion from frame to . The tracklet box of at frame is which contains the top left and bottom right coordinate of the bounding box. Let represent the tracklet box of at frame . We set a hyper-parameter representing the max interval we perform tracklet interpolation, which means tracklet interpolation is performed when , . The interpolated box of tracklet at frame t can be computed as follows:
As shown in Table 5, tracklet interpolation can improve MOTA from 76.6 to 78.3 and IDF1 from 79.3 to 80.2, when is 20. Tracklet interpolation is an effective post-processing method to obtain the boxes of those fully-occluded objects.
We compare ByteTrack with the state-of-the-art trackers on the test set of MOT17 and MOT20 under the private detection protocol in Table 6 and Table 7, respectively. All the results are directly obtained from the official MOT Challenge evaluation server111https://motchallenge.net.
MOT17. ByteTrack ranks 1st among all the trackers on the leaderboard of MOT17. Not only does it achieve the best accuracy (i.e. 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA), but also runs with highest running speed (30 FPS). It outperforms the second-performance tracker [yang2021remot] by a large margin (i.e. +3.3 MOTA, +5.3 IDF1 and +3.4 HOTA). Also, we use less training data than many high performance methods such as [zhang2020fairmot, liang2020rethinking, wang2021multiple, shan2020tracklets, liang2021one] (29K images vs. 73K images). It is worth noting that we only use the simplest similarity computation method Kalman Filter in the association step compared to other methods [zhang2020fairmot, liang2020rethinking, pang2021quasi, wang2020joint, zeng2021motr, sun2020transtrack] which additionally use Re-ID similarity or attention mechanisms. All these indicate that ByteTrack is a simple and strong tracker.
MOT20. Compared with MOT17, MOT20 has much more crowded scenarios and occlusion cases. The average number of pedestrians in an image is 170 in the test set of MOT20. ByteTrack also ranks 1st among all the trackers on the leaderboard of MOT20 and outperforms other state-of-the-art trackers by a large margin on almost all the metrics. For example, it increases MOTA from 68.6 to 77.8, IDF1 from 71.4 to 75.2 and decreases IDs by 71% from 4209 to 1223. It is worth noting that ByteTrack achieves extremely low identity switches, which further indicates that associating every detection boxes is very effective under occlusion cases.
We present a simple yet effective data association method BYTE for multi-object tracking. BYTE can be easily applied to existing trackers and achieve consistent improvements. We also propose a strong tracker ByteTrack, which achieves 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on MOT17 test set with 30 FPS, ranking 1st among all the trackers on the leaderboard. ByteTrack is very robust to occlusion for its accurate detection performance and the help of associating low score detection boxes. It also sheds light on how to make the best use of detection results to enhance multi-object tracking. We hope the high accuracy, fast speed and simplicity of ByteTrack can make it attractive in real applications.