YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/
In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector – YOLOX. We switch the YOLO detector to an anchor-free manner and conduct other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve state-of-the-art results across a large scale range of models: For YOLO-Nano with only 0.91M parameters and 1.08G FLOPs, we get 25.3 NanoDet by 1.8 industry, we boost it to 47.3 practice by 3.0 YOLOv4-CSP, YOLOv5-L, we achieve 50.0 Tesla V100, exceeding YOLOv5-L by 1.8 Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model. We hope this report can provide useful experience for developers and researchers in practical scenes, and we also provide deploy versions with ONNX, TensorRT, NCNN, and Openvino supported. Source code is at https://github.com/Megvii-BaseDetection/YOLOX.READ FULL TEXT VIEW PDF
YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/
With the development of object detection, YOLO series [yolo, yolo9000, yolov3, yolov4, yolov5] always pursuit the optimal speed and accuracy trade-off for real-time applications. They extract the most advanced detection technologies available at the time (e.g., anchors [FasterCNN] for YOLOv2 [yolo9000], Residual Net [ResNet] for YOLOv3 [yolov3]) and optimize the implementation for best practice. Currently, YOLOv5 [yolov5] holds the best trade-off performance with 48.2% AP on COCO at 13.7 ms.111we choose the YOLOv5-L model at resolution and test the model with FP16-precision and batch=1 on a V100 to align the settings of YOLOv4 [yolov4] and YOLOv4-CSP [scaleyolo] for a fair comparison
Nevertheless, over the past two years, the major advances in object detection academia have focused on anchor-free detectors [fcos, centernet, cornernet], advanced label assignment strategies [freeanchor, atss, paa, autoassign, iqdet, ota], and end-to-end (NMS-free) detectors [detr, end2end, end2end2]. These have not been integrated into YOLO families yet, as YOLOv4 and YOLOv5 are still anchor-based detectors with hand-crafted assigning rules for training.
That’s what brings us here, delivering those recent advancements to YOLO series with experienced optimization. Considering YOLOv4 and YOLOv5 may be a little over-optimized for the anchor-based pipeline, we choose YOLOv3 [yolov3] as our start point (we set YOLOv3-SPP as the default YOLOv3). Indeed, YOLOv3 is still one of the most widely used detectors in the industry due to the limited computation resources and the insufficient software support in various practical applications.
As shown in Fig. 1, with the experienced updates of the above techniques, we boost the YOLOv3 to 47.3% AP (YOLOX-DarkNet53) on COCO with resolution, surpassing the current best practice of YOLOv3 (44.3% AP, ultralytics version222https://github.com/ultralytics/yolov3) by a large margin. Moreover, when switching to the advanced YOLOv5 architecture that adopts an advanced CSPNet [cspnet] backbone and an additional PAN [pan] head, YOLOX-L achieves 50.0% AP on COCO with resolution, outperforming the counterpart YOLOv5-L by 1.8% AP. We also test our design strategies on models of small size. YOLOX-Tiny and YOLOX-Nano (only 0.91M Parameters and 1.08G FLOPs) outperform the corresponding counterparts YOLOv4-Tiny and NanoDet333https://github.com/RangiLyu/nanodet by 10% AP and 1.8% AP, respectively.
We have released our code at https://github.com/Megvii-BaseDetection/YOLOX, with ONNX, TensorRT, NCNN and Openvino supported. One more thing worth mentioning, we won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model.
We choose YOLOv3 [yolov3] with Darknet53 as our baseline. In the following part, we will walk through the whole system designs in YOLOX step by step.
Our training settings are mostly consistent from the baseline to our final model. We train the models for a total of 300 epochs with 5 epochs warm-up on COCOtrain2017 [MSCOCO]
. We use stochastic gradient descent (SGD) for training. We use a learning rate ofBatchSize/64 (linear scaling [linear]), with a initial
0.01 and the cosine lr schedule. The weight decay is 0.0005 and the SGD momentum is 0.9. The batch size is 128 by default to typical 8-GPU devices. Other batch sizes include single GPU training also work well. The input size is evenly drawn from 448 to 832 with 32 strides. FPS and latency in this report are all measured with FP16-precision and batch=1 on a single Tesla V100.
Our baseline adopts the architecture of DarkNet53 backbone and an SPP layer, referred to YOLOv3-SPP in some papers [yolov4, yolov5]. We slightly change some training strategies compared to the original implementation [yolov3], adding EMA weights updating, cosine lr schedule, IoU loss and IoU-aware branch. We use BCE Loss for training cls and obj branch, and IoU Loss for training reg branch. These general training tricks are orthogonal to the key improvement of YOLOX, we thus put them on the baseline. Moreover, we only conduct RandomHorizontalFlip, ColorJitter and multi-scale for data augmentation and discard the RandomResizedCrop strategy, because we found the RandomResizedCrop is kind of overlapped with the planned mosaic augmentation. With those enhancements, our baseline achieves 38.5% AP on COCO val, as shown in Tab. 2.
In object detection, the conflict between classification and regression tasks is a well-known problem [TSD, doubleheadrcnn]. Thus the decoupled head for classification and localization is widely used in the most of one-stage and two-stage detectors [focal-loss, fcos, decouplehead, doubleheadrcnn]. However, as YOLO series’ backbones and feature pyramids ( e.g., FPN [pfp], PAN [panet].) continuously evolving, their detection heads remain coupled as shown in Fig. 2.
Our two analytical experiments indicate that the coupled detection head may harm the performance. 1). Replacing YOLO’s head with a decoupled one greatly improves the converging speed as shown in Fig. 3. 2). The decoupled head is essential to the end-to-end version of YOLO (will be described next). One can tell from Tab. 1, the end-to-end property decreases by 4.2% AP with the coupled head, while the decreasing reduces to 0.8% AP for a decoupled head. We thus replace the YOLO detect head with a lite decoupled head as in Fig. 2. Concretely, it contains a conv layer to reduce the channel dimension, followed by two parallel branches with two conv layers respectively. We report the inference time with batch=1 on V100 in Tab. 2 and the lite decoupled head brings additional 1.1 ms (11.6 ms v.s. 10.5 ms).
|Models||Coupled Head||Decoupled Head|
|End-to-end YOLO||34.3 (-4.2)||38.8 (-0.8)|
We add Mosaic and MixUp into our augmentation strategies to boost YOLOX’s performance. Mosaic is an efficient augmentation strategy proposed by ultralytics-YOLOv32. It is then widely used in YOLOv4 [yolov4], YOLOv5 [yolov5] and other detectors [yolof]. MixUp [mixup] is originally designed for image classification task but then modified in BoF [bag] for object detection training. We adopt the MixUp and Mosaic implementation in our model and close it for the last 15 epochs, achieving 42.0% AP in Tab. 2
. After using strong data augmentation, we found ImageNet pre-training is no more beneficial,we thus train all the following models from scratch.
|YOLOv3-ultralytics2||44.3||63.00 M||157.3||10.5 ms||95.2|
|YOLOv3 baseline||38.5||63.00 M||157.3||10.5 ms||95.2|
|+decoupled head||39.6 (+1.1)||63.86 M||186.0||11.6 ms||86.2|
|+strong augmentation||42.0 (+2.4)||63.86 M||186.0||11.6 ms||86.2|
|+anchor-free||42.9 (+0.9)||63.72 M||185.3||11.1 ms||90.1|
|+multi positives||45.0 (+2.1)||63.72 M||185.3||11.1 ms||90.1|
|+SimOTA||47.3 (+2.3)||63.72 M||185.3||11.1 ms||90.1|
|+NMS free (optional)||46.5 (-0.8)||67.27 M||205.1||13.5 ms||74.1|
Both YOLOv4 [yolov4] and YOLOv5 [yolov5] follow the original anchor-based pipeline of YOLOv3 [yolov3]
. However, the anchor mechanism has many known problems. First, to achieve optimal detection performance, one needs to conduct clustering analysis to determine a set of optimal anchors before training. Those clustered anchors are domain-specific and less generalized. Second, anchor mechanism increases the complexity of detection heads, as well as the number of predictions for each image. On some edge AI systems, moving such large amount of predictions between devices (e.g., from NPU to CPU) may become a potential bottleneck in terms of the overall latency.
Anchor-free detectors [fcos, centernet, cornernet]
have developed rapidly in the past two year. These works have shown that the performance of anchor-free detectors can be on par with anchor-based detectors. Anchor-free mechanism significantly reduces the number of design parameters which need heuristic tuning and many tricks involved (e.g., Anchor Clustering [yolo9000], Grid Sensitive [ppyolo].) for good performance, making the detector, especially its training and decoding phase, considerably simpler [fcos].
Switching YOLO to an anchor-free manner is quite simple. We reduce the predictions for each location from 3 to 1 and make them directly predict four values, i.e., two offsets in terms of the left-top corner of the grid, and the height and width of the predicted box. We assign the center location of each object as the positive sample and pre-define a scale range, as done in [fcos], to designate the FPN level for each object. Such modification reduces the parameters and GFLOPs of the detector and makes it faster, but obtains better performance – 42.9% AP as shown in Tab. 2.
To be consistent with the assigning rule of YOLOv3, the above anchor-free version selects only ONE positive sample (the center location) for each object meanwhile ignores other high quality predictions. However, optimizing those high quality predictions may also bring beneficial gradients, which may alleviates the extreme imbalance of positive/negative sampling during training. We simply assigns the center area as positives, also named “center sampling” in FCOS [fcos]. The performance of the detector improves to 45.0% AP as in Tab. 2, already surpassing the current best practice of ultralytics-YOLOv3 (44.3% AP2).
Advanced label assignment is another important progress of object detection in recent years. Based on our own study OTA [ota], we conclude four key insights for an advanced label assignment: 1). loss/quality aware, 2). center prior, 3). dynamic number of positive anchors444The term “anchor” refers to “anchor point” in the context of anchor-free detectors and “grid” in the context of YOLO. for each ground-truth (abbreviated as dynamic top-k), 4). global view. OTA meets all four rules above, hence we choose it as a candidate label assigning strategy.
Specifically, OTA [ota] analyzes the label assignment from a global perspective and formulate the assigning procedure as an Optimal Transport (OT) problem, producing the SOTA performance among the current assigning strategies [paa, autoassign, atss, iqdet, freeanchor]. However, in practice we found solving OT problem via Sinkhorn-Knopp algorithm brings 25% extra training time, which is quite expensive for training 300 epochs. We thus simplify it to dynamic top-k strategy, named SimOTA, to get an approximate solution.
We briefly introduce SimOTA here. SimOTA first calculates pair-wise matching degree, represented by cost [ota, lla, paa, detr] or quality [defcn] for each prediction-gt pair. For example, in SimOTA, the cost between gt and prediction is calculated as:
where is a balancing coefficient. and are classficiation loss and regression loss between gt and prediction . Then, for gt , we select the top predictions with the least cost within a fixed center region as its positive samples. Finally, the corresponding grids of those positive predictions are assigned as positives, while the rest grids are negatives. Noted that the value varies for different ground-truth. Please refer to Dynamic Estimation strategy in OTA [ota] for more details.
SimOTA not only reduces the training time but also avoids additional solver hyperparameters in Sinkhorn-Knopp algorithm. As shown in Tab.2, SimOTA raises the detector from 45.0% AP to 47.3% AP, higher than the SOTA ultralytics-YOLOv3 by 3.0% AP, showing the power of the advanced assigning strategy.
We follow [end2end2] to add two additional conv layers, one-to-one label assignment, and stop gradient. These enable the detector to perform an end-to-end manner, but slightly decreasing the performance and the inference speed, as listed in Tab. 2. We thus leave it as an optional module which is not involved in our final models.
|YOLOv5-S||36.7||7.3 M||17.1||8.7 ms|
|YOLOX-S||39.6 (+2.9)||9.0 M||26.8||9.8 ms|
|YOLOv5-M||44.5||21.4 M||51.4||11.1 ms|
|YOLOX-M||46.4 (+1.9)||25.3 M||73.8||12.3 ms|
|YOLOv5-L||48.2||47.1 M||115.6||13.7 ms|
|YOLOX-L||50.0 (+1.8)||54.2 M||155.6||14.5 ms|
|YOLOv5-X||50.4||87.8 M||219.0||16.0 ms|
|YOLOX-X||51.2 (+0.8)||99.1 M||281.9||17.3 ms|
|YOLOv4-Tiny [scaleyolo]||21.7||6.06 M||6.96|
|YOLOX-Tiny||31.7 (+9.0)||5.06 M||6.45|
|YOLOX-Nano||25.3 (+1.8)||0.91 M||1.08|
Besides DarkNet53, we also test YOLOX on other backbones with different sizes, where YOLOX achieves consistent improvements against all the corresponding counterparts.
To give a fair comparison, we adopt the exact YOLOv5’s backbone including modified CSPNet [cspnet], SiLU activation, and the PAN [pan] head. We also follow its scaling rule to product YOLOX-S, YOLOX-M, YOLOX-L, and YOLOX-X models. Compared to YOLOv5 in Tab. 3, our models get consistent improvement by 3.0% to 1.0% AP, with only marginal time increasing (comes from the decoupled head).
We further shrink our model as YOLOX-Tiny to compare with YOLOv4-Tiny [scaleyolo]. For mobile devices, we adopt depth wise convolution to construct a YOLOX-Nano model, which has only 0.91M parameters and 1.08G FLOPs. As shown in Tab. 4, YOLOX performs well with even smaller model size than the counterparts.
In our experiments, all the models keep almost the same learning schedule and optimizing parameters as depicted in 2.1. However, we found that the suitable augmentation strategy varies across different size of models. As Tab. 5 shows, while applying MixUp for YOLOX-L can improve AP by 0.9%, it is better to weaken the augmentation for small models like YOLOX-Nano. Specifically, we remove the mix up augmentation and weaken the mosaic (reduce the scale range from [0.1, 2.0] to [0.5, 1.5]) when training small models, i.e., YOLOX-S, YOLOX-Tiny, and YOLOX-Nano. Such a modification improves YOLOX-Nano’s AP from 24.0% to 25.3%.
For large models, we also found that stronger augmentation is more helpful. Indeed, our MixUp implementation is part of heavier than the original version in [bag]. Inspired by Copypaste [copypaste], we jittered both images by a random sampled scale factor before mixing up them. To understand the power of Mixup with scale jittering, we compare it with Copypaste on YOLOX-L. Noted that Copypaste requires extra instance mask annotations while MixUp does not. But as shown in Tab. 5, these two methods achieve competitive performance, indicating that MixUp with scale jittering is a qualified replacement for Copypaste when no instance mask annotation is available.
|Models||Scale Jit.||Extra Aug.||AP (%)|
|[0.1, 2.0]||MixUp||24.0 (-1.3)|
|[0.1, 2.0]||MixUp||49.5 (+0.9)|
|[0.1, 2.0]||Copypaste [copypaste]||49.4|
There is a tradition to show the SOTA comparing table as in Tab. 6. However, keep in mind that the inference speed of the models in this table is often uncontrolled, as speed varies with software and hardware. We thus use the same hardware and code base for all the YOLO series in Fig. 1, plotting the somewhat controlled speed/accuracy curve.
We notice that there are some high performance YOLO series with larger model sizes like Scale-YOLOv4 [scaleyolo] and YOLOv5-P6 [yolov5]. And the current Transformer based detectors [swin] push the accuracy-SOTA to 60 AP. Due to the time and resource limitation, we did not explore those important features in this report. However, they are already in our scope.
|YOLOv3 + ASFF* [asff]||Darknet-53||608||45.5||42.4||63.0||47.4||25.5||45.7||52.3|
|YOLOv3 + ASFF* [asff]||Darknet-53||800||29.4||43.9||64.1||49.2||27.0||46.6||53.4|
|YOLOv4-CSP [scaleyolo]||Modified CSP||640||73.0||47.5||66.2||51.7||28.2||51.2||59.8|
|YOLOv5-M [yolov5]||Modified CSP v5||640||90.1||44.5||63.1||-||-||-||-|
|YOLOv5-L [yolov5]||Modified CSP v5||640||73.0||48.2||66.9||-||-||-||-|
|YOLOv5-X [yolov5]||Modified CSP v5||640||62.5||50.4||68.8||-||-||-||-|
|YOLOX-M||Modified CSP v5||640||81.3||46.4||65.4||50.6||26.3||51.0||59.9|
|YOLOX-L||Modified CSP v5||640||69.0||50.0||68.5||54.5||29.8||54.5||64.4|
|YOLOX-X||Modified CSP v5||640||57.8||51.2||69.6||55.7||31.2||56.1||66.1|
Streaming Perception Challenge on WAD 2021 is a joint evaluation of accuracy and latency through a recently proposed metric: streaming accuracy [stream]. The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant, forcing the stack to consider the amount of streaming data that should be ignored while computation is occurring [stream]. We found that the best trade-off point for the metric on 30 FPS data stream is a powerful model with the inference time 33ms. So we adopt a YOLOX-L model with TensorRT to product our final model for the challenge to win the 1st place. Please refer to the challenge website555https://eval.ai/web/challenges/challenge-page/800/overview for more details.
In this report, we present some experienced updates to YOLO series, which forms a high-performance anchor-free detector called YOLOX. Equipped with some recent advanced detection techniques, i.e., decoupled head, anchor-free, and advanced label assigning strategy, YOLOX achieves a better trade-off between speed and accuracy than other counterparts across all model sizes. It is remarkable that we boost the architecture of YOLOv3, which is still one of the most widely used detectors in industry due to its broad compatibility, to 47.3% AP on COCO, surpassing the current best practice by 3.0% AP. We hope this report can help developers and researchers get better experience in practical scenes.