Object detection. EfficientDet-D5 level COCO AP in 20 epochs. SOTA single-stage detector on Waymo Open Dataset.
Benchmarks, such as COCO, play a crucial role in object detection. However, existing benchmarks are insufficient in scale variation, and their protocols are inadequate for fair comparison. In this paper, we introduce the Universal-Scale object detection Benchmark (USB). USB has variations in object scales and image domains by incorporating COCO with the recently proposed Waymo Open Dataset and Manga109-s dataset. To enable fair comparison, we propose USB protocols by defining multiple thresholds for training epochs and evaluation image resolutions. By analyzing methods on the proposed benchmark, we designed fast and accurate object detectors called UniverseNets, which surpassed all baselines on USB and achieved state-of-the-art results on existing benchmarks. Specifically, UniverseNets achieved 54.1 training, the top result among single-stage detectors on the Waymo Open Dataset Challenge 2020 2D detection, and the first place in the NightOwls Detection Challenge 2020 all objects track. The code is available at https://github.com/shinya7y/UniverseNet .READ FULL TEXT VIEW PDF
Object detection. EfficientDet-D5 level COCO AP in 20 epochs. SOTA single-stage detector on Waymo Open Dataset.
Humans can detect various objects. See Figure LABEL:fig:teaser. One can detect close equipment in everyday scenes, far vehicles in traffic scenes, and texts and persons in manga (Japanese comics). If computers can automatically detect various objects, they will yield significant benefits to humans. For example, they will help impaired people and the elderly, save lives by autonomous driving, and provide safe entertainment during pandemics by automatic translation.
Problem 1: Variations in object scales and image domains remain limited. To realize human-level perception, computers must handle various object scales and image domains as humans can. Among various domains , the traffic and artificial domains have extensive scale variations (see Sec. 3
). COCO is far from covering them. Nevertheless, the current computer vision community is overconfident in COCO results. For example, most studies on state-of-the-art methods in 2020 only report COCO results[69, 63, 31, 33, 12, 13] or those for bounding box object detection [59, 16, 5, 45]. Readers cannot assess whether these methods are specialized for COCO or generalizable to other datasets and domains.
Problem 2: Protocols for training and evaluation are not well established. There are standard experimental settings for the COCO benchmark [22, 10, 35, 36, 61, 69, 33]. Many works train detectors within 24 epochs using a learning rate of 0.01 or 0.02 and evaluate them on images within 1333800. These settings are not obligations but non-binding agreements for fair comparison. Some works do not follow the standard settings for accurate and fast detectors111YOLOv4 was trained for 273 epochs , DETR for 500 epochs , EfficientDet-D6 for 300 epochs , and EfficientDet-D7x for 600 epochs . SpineNet uses a learning rate of 0.28 , and YOLOv4 uses a searched learning rate of 0.00261 . EfficientDet finely changes the image resolution from 512512 to 15361536 .. Their abnormal and scattered settings hinder the assessment of the most suitable method (see Figure 1). Furthermore, by “buying stronger results” , they build a barrier for those without considerable funds to develop and train detectors.
Problem 3: The analysis of methods for multi-scale object detection is insufficient. Numerous studies have proposed methods for multi-scale object detection [38, 47, 39, 35, 14]. In recent years, improvements for network components have made significant progress in COCO (, Res2Net  for the backbone, SEPC  for the neck, and ATSS  for the head). These works have an insufficient analysis of combinability, effectiveness, and characteristics, especially on datasets other than COCO.
This study makes the following three contributions to resolve the problems.
Contribution 1: We introduce the Universal-Scale object detection Benchmark (USB) that consists of three datasets. In addition to COCO, we selected the Waymo Open Dataset  and Manga109-s [40, 3] to cover various object scales and image domains. They are the largest public datasets in their domains and enable reliable comparisons. We conducted experiments using eight methods and found weaknesses of existing COCO-biased methods.
Contribution 2: We established the USB protocols for fair training and evaluation for more inclusive object detection research. USB protocols enable fair, easy, and scalable comparisons by defining multiple thresholds for training epochs and evaluation image resolutions.
Contribution 3: We designed fast and accurate object detectors called UniverseNets by analyzing methods developed for multi-scale object detection. UniverseNets outperformed all baselines on USB and achieved state-of-the-art results on existing benchmarks. In particular, our finding on USB enables a 9.3 points higher score than YOLOv4  on the Waymo Open Dataset Challenge 2020 2D detection.
Deep learning-based detectors dominate the recent progress in object detection . They can be divided as [38, 10, 47] single-stage detectors without region proposal [46, 39, 36] and multi-stage (including two-stage) detectors with region proposal [47, 35, 8]. Our UniverseNets are single-stage detectors for efficiency [33, 59, 5, 28].
Detecting multi-scale objects is a fundamental challenge in object detection [38, 7]. Various components have been improved, including backbones and modules [56, 26, 27, 19, 14], necks [35, 63, 59], heads and training sample selection [47, 39, 69], and multi-scale training and testing [48, 53, 69] (see Supplementary Material for details).
Some recent or concurrent works [67, 5, 29] have combined multiple methods. Unlike these works, we analyzed various components developed for scale variations, without computation for neural architecture search , long training (273 epochs) , and multi-stage detectors .
Most prior studies have evaluated detectors on limited image domains. We demonstrate the superior performance of our UniverseNets across various object scales and image domains through the proposed benchmark.
There are numerous object detection benchmarks. For specific (category) object detection, recent benchmarks such as WIDER FACE  and TinyPerson  contain tiny objects. Although they are useful for evaluation for a specific category, many applications should detect multiple categories. For autonomous driving, KITTI  and Waymo Open Dataset  mainly evaluate three categories (car, pedestrian, and cyclist) in their leaderboards. For generic object detection, Pascal VOC  and COCO  include 20 and 80 categories, respectively. The number of categories has been further expanded by recent benchmarks, such as Open Images , Objects365 , and LVIS . All the above datasets comprise photographs, whereas Clipart1k, Watercolor2k, Comic2k , and Manga109-s [40, 3] comprise artificial images.
Detectors evaluated on a specific dataset may perform worse on other datasets or domains. To address this issue, some benchmarks consist of multiple datasets. In the Robust Vision Challenge 2020 , detectors were evaluated on three datasets in the natural and traffic image domains. For universal-domain object detection, the Universal Object Detection Benchmark (UODB)  comprises 11 datasets in the natural, traffic, aerial, medical, and artificial image domains. Although it is suitable for evaluating detectors in various domains, variations in object scales are limited. Unlike UODB, our USB focuses on universal-scale object detection. Datasets in USB contain more instances, including tiny objects, than the datasets used in UODB.
As discussed in Sec. 1, the current benchmarks allow extremely unfair settings (, 25 training epochs). We resolved this problem by establishing USB protocols for fair training and evaluation.
Here, we present the principle, datasets, protocols, and metrics of USB. See Supplementary Material for additional information.
We focus on the Universal-Scale Object Detection (USOD) task that aims to detect various objects in terms of object scales and image domains. Unlike separate discussions for multi-scale object detection (Sec. 2) and universal (-domain) object detection , USOD does not ignore the relation between scales and domains (Sec. 3.2).
For various applications and users, benchmark protocols should cover from short to long training and from small to large test scales. On the other hand, they should not be scattered for meaningful benchmarks. To satisfy the conflicting requirements, we define multiple thresholds for training epochs and evaluation image resolutions. Furthermore, we urge rich participants to report results with standard training settings. This request enables fair comparison and allows many people to develop and compare object detectors.
|Dataset||Domain||Color||Main sources of scale variation|
|COCO ||Natural||RGB||Categories, distance|
|Manga109-s [40, 3]||Artificial||Grayscale||Viewpoints, page layouts|
|USB (Ours)||COCO ||Natural||80||897 k||123 k||7.3|
|WOD  v1.2 f0||Traffic||3||1.0 M||100 k||10.0|
|Manga109-s [40, 3]||Artificial||4||401 k||8.5 k||47.0|
|UODB ||COCO  val2014||Natural||80||292 k||41 k||7.2|
|KITTI ||Traffic||3||35 k||7.5 k||4.7|
|Comic2k ||Artificial||6||6.4 k||2.0 k||3.2|
To establish USB, we selected the COCO , Waymo Open Dataset (WOD) , and Manga109-s (M109s) [40, 3]. WOD and M109s are the largest public datasets with many small objects in the traffic and artificial domains, respectively. Object scales in these domains vary significantly with distance and viewpoints, unlike those in the medical and aerial domains. The USB covers extensive scale variations quantitatively (Figure 2) and qualitatively (Table 1). As shown in Table 2, these three datasets in USB contain more instances than their counterpart datasets in UODB  (COCO  val2014 subset, KITTI , and Comic2k ). USOD needs to evaluate detectors on datasets with many instances because more instances enable more reliable comparisons of scale-wise metrics.
For the first dataset, we adopted the COCO dataset . COCO contains natural images of everyday scenes collected from the Internet. Annotations for 80 categories are used in the benchmark. As shown in Figure LABEL:fig:teaser (left), object scales mainly depend on the categories and distance. Although COCO contains objects smaller than those of Pascal VOC , objects in everyday scenes (especially indoor scenes) are relatively large. Since COCO is the current standard dataset for multi-scale object detection, we adopted the same training split train2017 (also known as trainval35k) as the COCO benchmark to eliminate the need for retraining across benchmarks. We adopted the val2017 split (also known as minival) as the test set.
For the second dataset, we adopted the WOD, which is a large-scale, diverse dataset for autonomous driving  with many annotations for tiny objects (Figure 2). The images were recorded using five high-resolution cameras mounted on vehicles. As shown in Figure LABEL:fig:teaser (middle), object scales vary mainly with distance. The full data splits of the WOD are too large for benchmarking methods. Thus, we extracted 10% size subsets from the predefined training split (798 sequences) and validation split (202 sequences) . Specifically, we extracted splits based on the ones place of the frame index (frames 0, 10, …, 190) in each sequence. We call the subsets f0train and f0val splits. Each sequence in the splits contains 20 frames (20 s, 1 Hz), and each frame contains five images for five cameras. We used three categories (vehicle, pedestrian, and cyclist) following the official ALL_NS setting  used in WOD competitions.
For the third dataset, we adopt the M109s [40, 3]. M109s contains artificial images of manga (Japanese comics) and annotations for four categories (body, face, frame, and text). Many characteristics differ from those of natural images. Most images are grayscale. The objects are highly overlapped . As shown in Figure LABEL:fig:teaser (right), object scales vary unrestrictedly with viewpoints and page layouts. Small objects differ greatly from downsampled versions of large objects because small objects are drawn with simple lines and points. For example, small faces look like a sign (). This characteristic may ruin techniques developed mainly for natural images. Another challenge is ambiguity in annotations. Sometimes, a small-scale object is annotated, and sometimes, a similar scale object on another page is not annotated. Since annotating small objects is difficult and labor-intensive, this is an important and practical challenge. We carefully selected 68, 4, and 15 volumes for training, validation, and testing splits, and we call them the 68train, 4val, and 15test, respectively.
We selected the test splits from images with publicly available annotations to reduce the labor required for submissions. Participants should not fine-tune hyperparameters based on the test splits to prevent overfitting.
|USB 1.0||24||✗||—||2 schedule [22, 25]|
|USB 2.0||73||✗||USB 1.0||6 schedule |
|USB 3.0||300||✗||USB 1.0, 2.0||EfficientDet-D6 |
|USB 3.1||300||✓||USB 1.0, 2.0, 3.0||YOLOv4 |
For fair training, we propose the USB training protocols shown in Table 3. By analogy with the backward compatibility of the Universal Serial Bus, USB training protocols emphasize compatibility between protocols. Importantly, participants should report results with not only higher protocols but also lower protocols. For example, when a participant trains a model for 150 epochs with standard hyperparameters, it corresponds to USB 3.0. The participant should also report the results of models trained for 24 and 73 epochs in a paper. The readers of the paper can judge whether the method is useful for standard training epochs.
The number of maximum epochs for USB 1.0 is 24, which is the most popular setting in COCO (see Table 10
). We adopted 73 epochs for USB 2.0, where models trained from scratch can catch up with those trained for 24 epochs from ImageNet pre-trained models. We adopted 300 epochs for USB 3.x such that YOLOv4  and most EfficientDet models  correspond to this protocol. Models trained for more than 300 epochs are regarded as Freestyle. They are not suitable for benchmarking methods, although they may push the empirical limits of detectors [60, 9].
For models trained with mask annotations, 0.5 is added to their number of protocols. Results without mask annotations should be reported if possible for their algorithms.
For ease of comparison, we limit the pre-training datasets to the three and ImageNet (ILSVRC 1,000-class classification). Other datasets are welcome only when the results with and without additional datasets are reported. Participants should describe how to use the datasets. A possible way is to fine-tune the models on WOD and M109s from COCO pre-trained models. Another way is to train a single model jointly  on the three datasets.
In addition to long training schedules, hyperparameter optimization is resource-intensive. If authors of a paper fine-tune hyperparameters for their architecture, other people without sufficient computational resources cannot compare methods fairly. We recommend roughly tuning the minimum hyperparameters, such as batch sizes and learning rates (, from choices , , and ). When participants optimize hyperparameters aggressively by manual fine-tuning or automatic algorithms, they should report both results with and without aggressive optimization.
|Protocol||Max reso.||Typical scale||Reference|
|Standard USB||1,066,667||1333 800||Popular in COCO [37, 10, 22]|
|Mini USB||262,144||512 512||Popular in VOC [17, 39]|
|Micro USB||50,176||224 224||Popular in ImageNet [49, 26]|
|Large USB||2,457,600||19201280||WOD front cameras |
|Huge USB||7,526,400||33602240||WOD top methods (, ours)|
For fair evaluation, we propose the USB evaluation protocols shown in Table 4. By analogy with the size variations of the Universal Serial Bus connectors for various devices, USB evaluation protocols have variations in test image scales for various devices and applications.
The maximum resolution for Standard USB follows the popular test scale of 1333800 in the COCO benchmark (see Table 10 and [10, 22]). For Mini USB, we limit the resolution based on 512512. This resolution is popular in the Pascal VOC benchmark [17, 39], which contains small images and large objects. It is also popular in real-time detectors [59, 5]. We adopted a further small-scale 224224 for Micro USB. This resolution is popular in ImageNet classification [49, 26]. Although small object detection is extremely difficult, it is suitable for low-power devices. Additionally, this protocol enables people to manage object detection tasks using one or few GPUs. To cover larger test scales than Standard USB, we define Large USB and Huge USB based on WOD resolutions. The maximum resolution of the Huge USB is determined by the top methods on WOD (see Sec. 5.3). Although larger inputs (regarded as Freestyle) may be preferable for accuracy, excessively large inputs significantly reduce the practicality of detectors.
In addition to test image scales, the presence and degree of Test-Time Augmentation (TTA) make large differences in accuracy and inference time. When using TTA, participants should report its details (including the number of scales of multi-scale testing) and results without TTA.
We mainly use the COCO metrics [37, 34] to evaluate the performance of detectors on each dataset. We provide data format converters for WOD222Our GitHub repository (anonymized for review). and M109s 333Our GitHub repository (anonymized for review)..
We first describe the calculation of COCO metrics according to the official evaluation code . True or false positives are judged by measuring the Intersection over Union (IoU) between predicted bounding boxes and ground truth bounding boxes . For each category, the Average Precision (AP) is calculated as precision averaged over 101 recall thresholds . The COCO-style AP (CAP) for a dataset is calculated as
where denotes the predefined 10 IoU thresholds, denotes categories in the dataset , denotes the cardinality of a set (, for COCO), and denotes AP for an IoU threshold and a category . For detailed analysis, five additional AP metrics (averaged over categories) are evaluated. AP and AP denote AP at single IoU thresholds of and , respectively. AP, AP, and AP are variants of CAP, where target objects are limited to small (area ), medium ( area ), and large ( area) objects, respectively. The area is measured using mask annotations for COCO and bounding box annotations for WOD and M109s.
As the primary metric for USB, we use the mean COCO-style AP (mCAP) averaged over all datasets as
Since USB adopts the three datasets described in Sec. 3.2, Similarly, we define five metrics from AP, AP, AP, AP, and AP by averaging them over the datasets. We plan to define finer scale-wise metrics for USOD in future work.
For ease of quantitative evaluation, we limit the number of detections per image to 100 across all categories, following the COCO benchmark . For qualitative evaluation, participants may raise the limit to 300 (1% of images in the M109s 15test set contain more than 100 annotations).
For fast and accurate detectors for USOD, we designed UniverseNets. Single-stage detectors were adopted for efficiency. See Supplementary Material for details of the methods and architectures used in UniverseNets.
, also known as the PyTorch style). The neck is FPN. We used focal loss , single-scale training, and single-scale testing.
Built on the RetinaNet baseline, we designed UniverseNet by collecting human wisdom about multi-scale object detection as of May 2020. We used ATSS  and SEPC without iBN  (hereafter referred to as ATSEPC). The backbone is Res2Net-50-v1b [19, 20]. Deformable Convolutional Networks (DCN)  were adopted in the backbone and neck. We used multi-scale training. Unless otherwise stated, we used single-scale testing for efficiency.
By adding GFL , SyncBN , and iBN , we designed three variants of UniverseNet around August 2020. UniverseNet-20.08d heavily uses DCN . UniverseNet-20.08 speeds up inference (and training) by the light use of DCN [14, 63]. UniverseNet-20.08s further speeds up inference using the ResNet-50-C  backbone.
Here, we present benchmark results on USB and comparison results with state-of-the-art methods on the three datasets. Thereafter, we analyze the characteristics of detectors by additional experiments. See Supplementary Material for details of the experimental settings and results, including the KITTI-style AP [21, 55] on WOD and the effects of COCO pre-training on M109s.
Our code is built on MMDetection  v2. We used the COCO pre-trained models of the repository for existing methods (Faster R-CNN  with FPN , Cascade R-CNN , RetinaNet , ATSS , and GFL 
). We trained all models with Stochastic Gradient Descent (SGD).
The default hyperparameters are listed in Table 5. Most values follow standard settings [10, 36, 69, 35]. We used some dataset-dependent values. For M109s, we roughly tuned the learning rates (LR) based on a preliminary experiment with the RetinaNet  baseline model. Test scales were determined within the standard USB protocol, considering the typical aspect ratio of the images in each dataset. The ranges for multi-scale training for COCO and M109s follow prior work . We used larger scales for WOD because the objects in WOD are especially small.
COCO models were fine-tuned from ImageNet pre-trained backbones. We trained the models for WOD and M109s from the corresponding COCO pre-trained models. We follow the learning rate schedules of MMDetection . We mainly used the 1 schedule (12 epochs). For comparison with state-of-the-art methods on COCO, we used the 2 schedule (24 epochs) for most models and the 20e schedule (20 epochs) for UniverseNet-20.08d due to overfitting with the 2 schedule. For comparison with state-of-the-art methods on WOD, we trained UniverseNet on the WOD full training set for 7 epochs. We used a learning rate of for 6 epochs and for the last epoch.
|Faster R-CNN ||45.9||68.2||49.1||15.2||38.9||62.5||37.4||34.5||65.8|
|Cascade R-CNN ||48.1||68.5||51.5||15.6||41.3||65.9||40.3||36.4||67.6|
|ATSEPC [69, 63]||48.1||68.5||51.2||15.5||40.5||66.8||42.1||35.0||67.1|
We trained and evaluated methods on the USB. All methods follow the Standard USB 1.0 protocol using the default hyperparameters in Sec. 5.1. The results are shown in Table 6. UniverseNet-20.08 achieves the highest results on all datasets, resulting in 52.1% mCAP. In most cases, methods that work on COCO also work on the other datasets. Cascade R-CNN  and ATSS  achieve over 2% more mCAP than Faster R-CNN  and RetinaNet , respectively. In some cases, methods that work on COCO show small or negative effects on WOD and M109s. Thus, USB can impose a penalty on COCO-biased methods.
To compare the effectiveness of each method on each dataset, we show the correlation between mCAP and CAP on each dataset in Figure 3. SEPC  improves COCO CAP and deteriorates WOD CAP. Multi-stage detectors [47, 8] show relatively high CAP on WOD and relatively low CAP on COCO. Adding GFL  is especially effective on M109s (see improvements from ATSS to GFL and from UniverseNet to UniverseNet-20.08).
We also show detailed results on each dataset. Table 9 shows the COCO results. Since the effectiveness of existing methods has been verified on COCO, their improvements are steady. Table 9 shows the WOD results. Adding SEPC  to ATSS  decreases all metrics except for AP. We found that this reduction does not occur at large test scales in higher USB evaluation protocols (see Sec. 5.4
). UniverseNet-20.08 shows worse results than UniverseNet in some metrics, probably due to the light use of DCN for fast inference (see Table13). Table 9 shows the M109s results. Interestingly, improvements by ATSS  are smaller than those on COCO and WOD due to the drop of face AP. We conjecture that this phenomenon comes from the domain differences discussed in Sec. 3.2 and prior work , although we should explore it in future work.
|Protocol||Method||Backbone||DCN||Epoch||Max test scale||TTA||FPS||AP||AP||AP||AP||AP||AP||Reference|
|Standard USB 1.0||Faster R-CNN [47, 35]||ResNet-101||22||1333 800||(14.2)||36.2||59.1||39.0||18.2||39.0||48.2||CVPR17|
|Standard USB 1.0||Cascade R-CNN ||ResNet-101||19||1312 800||(11.9)||42.8||62.1||46.3||23.7||45.5||55.2||CVPR18|
|Standard USB 1.0||RetinaNet ||ResNet-101||18||1333 800||(13.6)||39.1||59.1||42.3||21.8||42.7||50.2||ICCV17|
|Standard USB 1.0||FCOS ||X-101 (644d)||24||1333 800||( 8.9)||44.7||64.1||48.4||27.6||47.5||55.6||ICCV19|
|Standard USB 1.0||ATSS ||X-101 (644d)||✓||24||1333 800||10.6||47.7||66.5||51.9||29.7||50.8||59.4||CVPR20|
|Standard USB 1.0||FreeAnchor+SEPC ||X-101 (644d)||✓||24||1333 800||—||50.1||69.8||54.3||31.3||53.3||63.7||CVPR20|
|Standard USB 1.0||PAA ||X-101 (644d)||✓||24||1333 800||—||49.0||67.8||53.3||30.2||52.8||62.2||ECCV20|
|Standard USB 1.0||PAA ||X-152 (328d)||✓||24||1333 800||—||50.8||69.7||55.1||31.4||54.7||65.2||ECCV20|
|Standard USB 1.0||RepPoints v2 ||X-101 (644d)||✓||24||1333 800||( 3.8)||49.4||68.9||53.4||30.3||52.1||62.3||NeurIPS20|
|Standard USB 1.0||RelationNet++ ||X-101 (644d)||✓||20||1333 800||10.3||50.3||69.0||55.0||32.8||55.0||65.8||NeurIPS20|
|Standard USB 1.0||GFL ||ResNet-50||24||1333 800||37.2||43.1||62.0||46.8||26.0||46.7||52.3||NeurIPS20|
|Standard USB 1.0||GFL ||ResNet-101||24||1333 800||29.5||45.0||63.7||48.9||27.2||48.8||54.5||NeurIPS20|
|Standard USB 1.0||GFL ||ResNet-101||✓||24||1333 800||22.8||47.3||66.3||51.4||28.0||51.1||59.2||NeurIPS20|
|Standard USB 1.0||GFL ||X-101 (324d)||✓||24||1333 800||15.4||48.2||67.4||52.6||29.2||51.7||60.2||NeurIPS20|
|Standard USB 1.0||UniverseNet-20.08s||ResNet-50-C||✓||24||1333 800||31.6||47.4||66.0||51.4||28.3||50.8||59.5||(Ours)|
|Standard USB 1.0||UniverseNet-20.08||Res2Net-50-v1b||✓||24||1333 800||24.9||48.8||67.5||53.0||30.1||52.3||61.1||(Ours)|
|Standard USB 1.0||UniverseNet-20.08d||Res2Net-101-v1b||✓||20||1333 800||11.7||51.3||70.0||55.8||31.7||55.3||64.9||(Ours)|
|Large USB 1.0||UniverseNet-20.08d||Res2Net-101-v1b||✓||20||1493 896||11.6||51.5||70.2||56.0||32.8||55.5||63.7||(Ours)|
|Large USB 1.0||UniverseNet-20.08d||Res2Net-101-v1b||✓||20||20001200||5||—||53.8||71.5||59.4||35.3||57.3||67.3||(Ours)|
|Huge USB 1.0||ATSS ||X-101 (644d)||✓||24||30001800||13||—||50.7||68.9||56.3||33.2||52.9||62.4||CVPR20|
|Huge USB 1.0||PAA ||X-101 (644d)||✓||24||30001800||13||—||51.4||69.7||57.0||34.0||53.8||64.0||ECCV20|
|Huge USB 1.0||PAA ||X-152 (328d)||✓||24||30001800||13||—||53.5||71.6||59.1||36.0||56.3||66.9||ECCV20|
|Huge USB 1.0||RepPoints v2 ||X-101 (644d)||✓||24||30001800||13||—||52.1||70.1||57.5||34.5||54.6||63.6||NeurIPS20|
|Huge USB 1.0||RelationNet++ ||X-101 (644d)||✓||20||30001800||13||—||52.7||70.4||58.3||35.8||55.3||64.7||NeurIPS20|
|Huge USB 1.0||UniverseNet-20.08d||Res2Net-101-v1b||✓||20||30001800||13||—||54.1||71.6||59.9||35.8||57.2||67.4||(Ours)|
|Huge USB 2.0||TSD ||SENet-154||✓||34||20001400||4||—||51.2||71.9||56.0||33.8||54.8||64.2||CVPR20|
|Huge USB 2.5||DetectoRS ||X-101 (324d)||✓||40||24001600||5||—||54.7||73.5||60.1||37.4||57.3||66.4||arXiv20|
|Mini USB 3.0||EfficientDet-D0 ||EfficientNet-B0||300||512 512||98.0||33.8||52.2||35.8||12.0||38.3||51.2||CVPR20|
|Mini USB 3.1||YOLOv4 ||CSPDarknet-53||273||512 512||83||43.0||64.9||46.5||24.3||46.1||55.2||arXiv20|
|Standard USB 3.0||EfficientDet-D2 ||EfficientNet-B2||300||768 768||56.5||43.0||62.3||46.2||22.5||47.0||58.4||CVPR20|
|Standard USB 3.0||EfficientDet-D4 ||EfficientNet-B4||300||10241024||23.4||49.4||69.0||53.4||30.3||53.2||63.2||CVPR20|
|Standard USB 3.1||YOLOv4 ||CSPDarknet-53||273||608 608||62||43.5||65.7||47.3||26.7||46.7||53.3||arXiv20|
|Large USB 3.0||EfficientDet-D5 ||EfficientNet-B5||300||12801280||13.8||50.7||70.2||54.7||33.2||53.9||63.2||CVPR20|
|Large USB 3.0||EfficientDet-D6 ||EfficientNet-B6||300||12801280||10.8||51.7||71.2||56.0||34.1||55.2||64.1||CVPR20|
|Large USB 3.0||EfficientDet-D7 ||EfficientNet-B6||300||15361536||8.2||52.2||71.4||56.3||34.8||55.5||64.6||CVPR20|
. We classify methods by proposed protocols. X in the Backbone column denotes ResNeXt. See method papers for other backbones. TTA: Test-time augmentation including horizontal flip and multi-scale testing (numbers denote scales). FPS values without and with parentheses were measured on V100 with mixed precision and other environments, respectively. We measured the FPS of GFL 
models in our environment and estimated those of ATSS and RelationNet++  based on the measured values and [33, 13]. The settings of other methods are based on conference papers, their arXiv versions, and authors’ codes. The values shown in gray were estimated from descriptions in papers and codes. Some FPS values are from .
|Methods including multi-stage detector:|
|13||YOLO V4 ||1+||58.08|
|14||ATSS-Efficientnet [69, 58]||1+||56.99|
COCO. We show state-of-the-art methods on COCO test-dev (as of November 14, 2020) in Table 10. Our UniverseNet-20.08d achieves the highest AP (51.3%) in the Standard USB 1.0 protocol. Despite 12.5 fewer epochs, the speed-accuracy trade-offs of our models are comparable to those of EfficientDet  (see also Figure 1). With 13-scale TTA, UniverseNet-20.08d achieves the highest AP (54.1%) in the Huge USB 1.0 protocol. Even with 5-scale TTA in the Large USB 1.0, it achieves 53.8% AP, which is higher than other methods in the USB 1.0 protocols.
WOD. For comparison with state-of-the-art methods on WOD, we submitted the detection results of UniverseNet to the Waymo Open Dataset Challenge 2020 2D detection, a competition held at a CVPR 2020 workshop. The primary metric is AP/L2, a KITTI-style AP evaluated with LEVEL_2 objects [55, 2]. We used multi-scale testing with soft-NMS . The shorter side pixels of test scales are
, including 8 pixels padding. These scales enable utilizing SEPC (see Sec. 5.4) and detecting small objects. Table 11 shows the top teams’ results. UniverseNet achieves 67.42% AP/L2 without multi-stage detectors, ensembles, expert models, or heavy backbones, unlike other top methods. RW-TSDet  overwhelms other multi-stage detectors, whereas UniverseNet overwhelms other single-stage detectors. These two methods used light backbones and large test scales . Interestingly, the maximum test scales are the same (33602240). We conjecture that this is not a coincidence but a convergence caused by searching the accuracy saturation point.
Manga109-s. To the best of our knowledge, no prior work has reported detection results on the Manga109-s dataset (87 volumes). Although many settings differ, the state-of-the-art method on the full Manga109 dataset (109 volumes, non-public to commercial organizations) achieves 77.1–92.0% (mean: 84.2%) AP on ten test volumes . The mean AP of UniverseNet-20.08 on the 15test set (92.5%) is higher than those results.
Chaotic state of the art. As shown in Table 10, state-of-the-art detectors on the COCO benchmark were trained with various settings. Comparisons across different training epochs are especially difficult because long training does not decrease FPS, unlike large test scales. Nevertheless, the EfficientDet , YOLOv4 , and SpineNet  papers compare methods in their tables without specifying the difference in training epochs. The compatibility of the USB training protocols (Sec. 3.3) resolves this disorder. We hope that many papers report results with the protocols for inclusive, healthy, and sustainable development of detectors.
Ablation studies. We show the results of ablation studies for UniverseNets on COCO in Table 13. SEPC , Res2Net-v1b [19, 20], DCN , multi-scale training, GFL , SyncBN , and iBN  improve AP. UniverseNet-20.08d is much more accurate (48.6% AP) than other models trained for 12 epochs using ResNet-50-level backbones (, ATSS: 39.4% [69, 10], GFL: 40.2% [33, 10]). As shown in Table 13, UniverseNet-20.08 is 1.4 faster than UniverseNet-20.08d at the cost of a 1% AP drop. UniverseNet-20.08s, the variant with ResNet-50-C backbone in Table 13, shows a good speed-accuracy trade-off by achieving 45.8% AP and over 30 FPS.
Generalization. To evaluate the generalization ability, we show the results on another dataset out of the USB. We trained UniverseNet on the NightOwls , a dataset for person detection at night, from the WOD pre-trained model in Sec. 5.3. The top teams’ results of the NightOwls Detection Challenge 2020 are shown in Table 13. UniverseNet is more accurate than other methods, even without TTA, and should be faster than the runner-up method that uses larger test scales and a heavy model (Cascade R-CNN, ResNeXt-101, CBNet, Double-Head, DCN, and soft-NMS) [NightOwls_talks_CVPRW2020_anonymize].
Test scales. We show the results on WOD at different test scales in Figure 4. Single-stage detectors require larger test scales than multi-stage detectors to achieve peak performance, probably because they cannot extract features from precisely localized region proposals. Although ATSEPC shows lower AP than ATSS at the default test scale (1248832 in Standard USB), it outperforms ATSS at larger test scales (, 19201280 in Large USB). We conjecture that we should enlarge object scales in images to utilize SEPC  because its DCN  enlarges effective receptive fields. SEPC and DCN prefer large objects empirically (see Tables 13, 13, [63, 14]), and DCN  cannot increase the sampling points for objects smaller than the kernel size in principle. By utilizing the characteristics of SEPC and multi-scale training, UniverseNets achieve the highest AP in a wide range of test scales.
We introduced USB, a benchmark for universal-scale object detection. To resolve unfair comparisons in existing benchmarks, we established USB training/evaluation protocols. Our UniverseNets achieved state-of-the-art results on USB and existing benchmarks. We found some weaknesses in the existing methods to be addressed in future research.
There are three limitations in this work. (1) USB depends on datasets with many instances. Reliable scale-wise metrics for small datasets should be considered. (2) We adopted single-stage detectors for UniverseNets and trained detectors in the USB 1.0 protocol. Although these settings are practical, it is worth exploring multi-stage detectors in higher protocols. (3) The architectures and results of UniverseNets are still biased toward COCO due to ablation studies and pre-training on COCO. Less biased and more universal detectors should be developed in future research.
The proposed USB protocols can be applied to other tasks with modifications. We believe that our work is an important step toward recognizing universal-scale objects by connecting various experimental settings.
Bag of tricks for image classification with convolutional neural networks.In CVPR, 2019.
Backbones and modules.Inception module  arranges , , and convolutions to cover multi-scale regions. Residual block  adds multi-scale features from shortcut connections and convolutions. ResNet-C and ResNet-D  replace the first layer of ResNet with the deep stem (three convolutions) . Res2Net module  stacks convolutions hierarchically to represent multi-scale features. Res2Net-v1b  adopts deep stem with Res2Net module. Deformable convolution module in Deformable Convolutional Networks (DCN)  adjusts receptive field adaptively by deforming the sampling locations of standard convolutions. These modules are mainly used in backbones.
Necks. To combine and enhance backbones’ representation, necks follow backbones. Feature Pyramid Networks (FPN)  adopt top-down path and lateral connections like architectures for semantic segmentation. Scale-Equalizing Pyramid Convolution (SEPC)  introduces pyramid convolution across feature maps with different resolutions and utilizes DCN to align the features.
Heads and training sample selection. Faster R-CNN  spreads multi-scale anchors over a feature map. SSD  spreads multi-scale anchors over multiple feature maps with different resolutions. ATSS  eliminates the need for multi-scale anchors by dividing positive and negative samples according to object statistics across pyramid levels.
Multi-scale training and testing. Traditionally, the image pyramid is an essential technique to handle multi-scale objects . Although recent detectors can output multi-scale objects from a single-scale input, many works use multi-scale inputs to improve performance [47, 36, 69, 63]. In a popular implementation , multi-scale training randomly chooses a scale at each iteration for (training-time) data augmentation. Multi-scale testing infers multi-scale inputs and merges their outputs for Test-Time Augmentation (TTA). SNIP  limits the range of object scales at each image scale during training and testing.
|Bakuretsu! Kung Fu Girl||Romantic comedy|
|Eva Lady||Science fiction|
|Hinagiku Kenzan!||Love romance|
|Love Hina vol. 1||Romantic comedy|
|Momoyama Haikagura||Historical drama|
|Tennen Senshi G||Humor|
|Uchi no Nyan’s Diary||Animal|
|Unbalance Tokyo||Science fiction|
|Yamato no Hane||Sports|
|Yume no Kayoiji||Fantasy|
|Yumeiro Cooking||Love romance|
|Healing Planet||Science fiction|
|Love Hina vol. 14||Romantic comedy|
|68train set: All the other volumes|
The Manga109-s dataset (87 volumes) is a subset of the full Manga109 dataset (109 volumes) . Unlike the full Manga109 dataset, the Manga109-s dataset can be used by commercial organizations. The dataset splits for the full Manga109 dataset used in prior work  cannot be used for the Manga109-s dataset.
We defined the Manga109-s dataset splits shown in Table 14. Unlike alphabetical order splits used in the prior work , we selected the volumes carefully. The 15test set was selected to be well-balanced for reliable evaluation. Five volumes in the 15test set were selected from the 10 test volumes used in  to enable partially direct comparison. All the authors of the 15test and 4val set are different from those of the 68train set to evaluate generalizability.
There are 118,287 images in COCO train2017, 5,000 in COCO val2017, 79,735 in WOD f0train, 20,190 in WOD f0val, 6,760 in M109s 68train, 419 in M109s 4val, and 1,354 in M109s 15test.
|Method||Head||Neck||Backbone||Input||FPS||COCO (1 schedule)|
|ATSEPC [69, 63]||✓||✓||P, LC||25.0||42.1||59.9||45.5||24.6||46.1||55.0|
|UniverseNet-20.08 w/o SEPC ||✓||✓||✓||c5||✓||✓||26.7||45.8||64.6||50.0||27.6||50.4||59.7|
|UniverseNet-20.08 w/o Res2Net-v1b [19, 20]||✓||✓||✓||LC||✓||c5||✓||✓||32.8||44.7||62.8||48.4||27.1||48.8||59.5|
|UniverseNet-20.08 w/o DCN ||✓||✓||✓||✓||✓||✓||✓||27.8||45.9||64.5||49.8||28.9||49.9||59.0|
|UniverseNet-20.08 w/o iBN, SyncBN [63, 44]||✓||✓||✓||LC||✓||c5||✓||25.7||45.8||64.0||50.2||27.9||50.0||59.8|
|UniverseNet-20.08 w/o MStrain||✓||✓||✓||LC||✓||✓||c5||✓||24.8||45.9||64.5||49.6||27.4||50.5||60.1|
. PConv (Pyramid Convolution) and iBN (integrated Batch Normalization) are the components of SEPC. The DCN columns indicate where to apply DCN. “P”: The PConv modules in the combined head of SEPC . “LC”: The extra head of SEPC for localization and classification . “c3-c5”: conv3_x, conv4_x, and conv5_x layers in ResNet-style backbones . “c5”: conv5_x layers in ResNet-style backbones . ATSEPC: ATSS with SEPC (without iBN). MStrain: Multi-scale training. FPS: Frames per second on one V100 with mixed precision.
The rounding error of epochs between epoch- and iteration-based training can be ignored when calculating the maximum epochs. Small differences of eight pixels or less can be ignored when calculating the maximum resolution. For example, DSSD513  will be compared in Mini USB.
We show the detailed architectures of UniverseNets in Table 15.
Here, we show the details of experimental settings and results. See also the code to reproduce our settings including minor hyperparameters.
We follow the learning rate schedules of MMDetection , which are similar to those of Detectron . Specifically, the learning rates are reduced by 10 in two predefined epochs. Epochs for the first learning rate decay, the second decay, and ending training are for the 1 schedule, for the 2 schedule, and for the 20e schedule. To avoid overfitting by small learning rates , the 20e schedule is reasonable.
We mainly used ImageNet pre-trained backbones that are standard in MMDetection . Some pre-trained Res2Net backbones not supported in MMDetection were downloaded from the Res2Net repository . We trained most models with mixed precision and 4 GPUs ( 4 images per GPU). All results on USB and all results of UniverseNets are single model results without ensemble.
For comparison with state-of-the-art methods with TTA on COCO, we used soft voting with 13-scale testing and horizontal flipping following the original implementation of ATSS . Specifically, shorter side pixels are (400, 500, 600, 640, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1800), while longer side pixels are their 1.667. For the 13 test scales, target objects are limited to corresponding 13 predefined ranges ((96, ), (96, ), (64, ), (64, ), (64, ), (0, ), (0, ), (0, ), (0, 256), (0, 256), (0, 192), (0, 192), (0, 96)), where each tuple denotes the minimum and maximum object scales. Each object scale is measured by , where and denote the object’s width and height, respectively. We also evaluated 5-scale TTA because the above-mentioned ATSS-style TTA is slow. We picked (400, 600, 800, 1000, 1200) for shorter side pixels, and ((96, ), (64, ), (0, ), (0, ), (0, 256)) for object scale ranges.
NightOwls  is a dataset for person detection at night. It contains three categories (pedestrian, bicycle driver, and motorbike driver). In contrast to WOD, it is important to detect medium or large objects because the evaluation of NightOwls follows the reasonable setting  where small objects (less than 50 pixels tall) are ignored. We prevented the overfitting of the driver categories (bicycle driver and motorbike driver) in two ways. The first is to map the classifier layer of the WOD pre-trained model. We transferred the weights for cyclists learned on the richer WOD to those for the NightOwls driver categories. The second is early stopping. We trained the model for 2 epochs (4,554 iterations) without background images.
We describe the results of ablation studies for UniverseNets on COCO in more detail. As shown in Table 12a, ATSEPC (ATSS  with SEPC without iBN ) outperforms ATSS by a large margin. The effectiveness of SEPC for ATSS is consistent with those for other detectors reported in the SEPC paper . As shown in Table 12b, UniverseNet further improves AP metrics by 5% by adopting Res2Net-v1b [19, 20], DCN , and multi-scale training. As shown in Table 12c, adopting GFL  improves AP by 0.8%. There is room for improvement of AP in the Quality Focal Loss of GFL . As shown in Table 12d, UniverseNet-20.08d achieves 48.6% AP by making more use of BatchNorm (SyncBN  and iBN ). It is much more accurate than other models trained for 12 epochs using ResNet-50-level backbones (, ATSS: 39.4% [69, 10], GFL: 40.2% [33, 10]). On the other hand, the inference is not so fast (less than 20 FPS) due to the heavy use of DCN . UniverseNet-20.08 speeds up inference by the light use of DCN [14, 63]. As shown in Table 12e, UniverseNet-20.08 is 1.4 faster than UniverseNet-20.08d at the cost of a 1% AP drop. To further verify the effectiveness of each technique, we conducted ablation from UniverseNet-20.08 shown in Table 12f. All techniques contribute to the high AP of UniverseNet-20.08. Ablating the Res2Net-v1b backbone (replacing Res2Net-50-v1b [19, 20] with ResNet-50-B ) has the largest effects. Res2Net-v1b improves AP by 2.8% and increases the inference time by 1.3. To further investigate the effectiveness of backbones, we trained variants of UniverseNet-20.08 as shown in Table 12g. Although the Res2Net module  makes inference slower, the deep stem used in ResNet-50-C  and Res2Net-50-v1b [19, 20] improves AP metrics with similar speeds. UniverseNet-20.08s (the variant using ResNet-50-C backbone) shows a good speed-accuracy trade-off by achieving 45.8% AP and over 30 FPS.
To analyze differences by metrics, we evaluated the KITTI-style AP (KAP) on WOD. KAP is a metric used in benchmarks for autonomous driving [21, 55]. Using different IoU thresholds (0.7 for vehicles, and 0.5 for pedestrians and cyclists), KAP is calculated as The results of KAP are shown in Figure 5. For ease of comparison, we show again the results of CAP in Figure 5. GFL  and Cascade R-CNN , which focus on localization quality, are less effective for KAP.
To verify the effects of COCO pre-training, we trained UniverseNet-20.08 on M109s from different pre-trained models. Table 16 shows the results. COCO pre-training improves all the metrics, especially body AP.
We also trained models with the eight methods on M109s from ImageNet pre-trained backbones. We halved the learning rates in Table 5 and doubled warmup iterations  (from 500 to 1,000) because the training of single-stage detectors without COCO pre-training or SyncBN  is unstable. The CAP without COCO pre-training is 1.9% lower than that with COCO pre-training (Table 6) on average.