USB: Universal-Scale Object Detection Benchmark

by   Yosuke Shinya, et al.

Benchmarks, such as COCO, play a crucial role in object detection. However, existing benchmarks are insufficient in scale variation, and their protocols are inadequate for fair comparison. In this paper, we introduce the Universal-Scale object detection Benchmark (USB). USB has variations in object scales and image domains by incorporating COCO with the recently proposed Waymo Open Dataset and Manga109-s dataset. To enable fair comparison, we propose USB protocols by defining multiple thresholds for training epochs and evaluation image resolutions. By analyzing methods on the proposed benchmark, we designed fast and accurate object detectors called UniverseNets, which surpassed all baselines on USB and achieved state-of-the-art results on existing benchmarks. Specifically, UniverseNets achieved 54.1 training, the top result among single-stage detectors on the Waymo Open Dataset Challenge 2020 2D detection, and the first place in the NightOwls Detection Challenge 2020 all objects track. The code is available at .



There are no comments yet.


page 1


1st Place Solutions of Waymo Open Dataset Challenge 2020 – 2D Object Detection Track

In this technical report, we present our solutions of Waymo Open Dataset...

MMDetection: Open MMLab Detection Toolbox and Benchmark

We present MMDetection, an object detection toolbox that contains a rich...

Scale Match for Tiny Person Detection

Visual object detection has achieved unprecedented ad-vance with the ris...

Object Detection as Probabilistic Set Prediction

Accurate uncertainty estimates are essential for deploying deep object d...

Revisiting Open World Object Detection

Open World Object Detection (OWOD), simulating the real dynamic world wh...

An Analysis of Scale Invariance in Object Detection - SNIP

An analysis of different techniques for recognizing and detecting object...

Scale-Localized Abstract Reasoning

We consider the abstract relational reasoning task, which is commonly us...

Code Repositories


Object detection. EfficientDet-D5 level COCO AP in 20 epochs. SOTA single-stage detector on Waymo Open Dataset.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Speed-accuracy trade-offs in the current standard COCO benchmark. Most works train models with standard settings (, within 24 epochs), while some works train with abnormal settings (, 300 epochs). To enable fair comparison, we propose USB protocols that urge the latter works to report results with standard settings. Additionally, we design UniverseNets that achieve state-of-the-art results with standard settings. We show some detection examples of UniverseNet-20.08 in Figure LABEL:fig:teaser.

Humans can detect various objects. See Figure LABEL:fig:teaser. One can detect close equipment in everyday scenes, far vehicles in traffic scenes, and texts and persons in manga (Japanese comics). If computers can automatically detect various objects, they will yield significant benefits to humans. For example, they will help impaired people and the elderly, save lives by autonomous driving, and provide safe entertainment during pandemics by automatic translation.

Researchers have pushed the limits of object detection systems by establishing datasets and benchmarks [38]. One of the most important milestones is Pascal VOC [17]

. It has enabled considerable research on object detection, leading to the success of deep learning-based methods and successor datasets such as ImageNet 

[49] and COCO [37]. Currently, COCO serves as the standard dataset and benchmark for object detection because it has several advantages over Pascal VOC [17]. COCO contains more images, categories, and objects (especially small objects) in their natural context [37]. Using COCO, researchers can develop and evaluate methods for multi-scale object detection. However, the current object detection benchmarks, especially COCO, have the following three problems.

Problem 1: Variations in object scales and image domains remain limited. To realize human-level perception, computers must handle various object scales and image domains as humans can. Among various domains [62], the traffic and artificial domains have extensive scale variations (see Sec. 3

). COCO is far from covering them. Nevertheless, the current computer vision community is overconfident in COCO results. For example, most studies on state-of-the-art methods in 2020 only report COCO results 

[69, 63, 31, 33, 12, 13] or those for bounding box object detection [59, 16, 5, 45]. Readers cannot assess whether these methods are specialized for COCO or generalizable to other datasets and domains.

Problem 2: Protocols for training and evaluation are not well established. There are standard experimental settings for the COCO benchmark [22, 10, 35, 36, 61, 69, 33]. Many works train detectors within 24 epochs using a learning rate of 0.01 or 0.02 and evaluate them on images within 1333800. These settings are not obligations but non-binding agreements for fair comparison. Some works do not follow the standard settings for accurate and fast detectors111YOLOv4 was trained for 273 epochs [5], DETR for 500 epochs [9], EfficientDet-D6 for 300 epochs [59], and EfficientDet-D7x for 600 epochs [60]. SpineNet uses a learning rate of 0.28 [16], and YOLOv4 uses a searched learning rate of 0.00261 [5]. EfficientDet finely changes the image resolution from 512512 to 15361536 [59].. Their abnormal and scattered settings hinder the assessment of the most suitable method (see Figure 1). Furthermore, by “buying stronger results” [50], they build a barrier for those without considerable funds to develop and train detectors.

Problem 3: The analysis of methods for multi-scale object detection is insufficient. Numerous studies have proposed methods for multi-scale object detection [38, 47, 39, 35, 14]. In recent years, improvements for network components have made significant progress in COCO (, Res2Net [19] for the backbone, SEPC [63] for the neck, and ATSS [69] for the head). These works have an insufficient analysis of combinability, effectiveness, and characteristics, especially on datasets other than COCO.

This study makes the following three contributions to resolve the problems.

Contribution 1: We introduce the Universal-Scale object detection Benchmark (USB) that consists of three datasets. In addition to COCO, we selected the Waymo Open Dataset [55] and Manga109-s [40, 3] to cover various object scales and image domains. They are the largest public datasets in their domains and enable reliable comparisons. We conducted experiments using eight methods and found weaknesses of existing COCO-biased methods.

Contribution 2: We established the USB protocols for fair training and evaluation for more inclusive object detection research. USB protocols enable fair, easy, and scalable comparisons by defining multiple thresholds for training epochs and evaluation image resolutions.

Contribution 3: We designed fast and accurate object detectors called UniverseNets by analyzing methods developed for multi-scale object detection. UniverseNets outperformed all baselines on USB and achieved state-of-the-art results on existing benchmarks. In particular, our finding on USB enables a 9.3 points higher score than YOLOv4 [5] on the Waymo Open Dataset Challenge 2020 2D detection.

2 Related Work

2.1 Object Detection Methods

Deep learning-based detectors dominate the recent progress in object detection [38]. They can be divided as [38, 10, 47] single-stage detectors without region proposal [46, 39, 36] and multi-stage (including two-stage) detectors with region proposal [47, 35, 8]. Our UniverseNets are single-stage detectors for efficiency [33, 59, 5, 28].

Detecting multi-scale objects is a fundamental challenge in object detection [38, 7]. Various components have been improved, including backbones and modules [56, 26, 27, 19, 14], necks [35, 63, 59], heads and training sample selection [47, 39, 69], and multi-scale training and testing [48, 53, 69] (see Supplementary Material for details).

Some recent or concurrent works [67, 5, 29] have combined multiple methods. Unlike these works, we analyzed various components developed for scale variations, without computation for neural architecture search [67], long training (273 epochs) [5], and multi-stage detectors [29].

Most prior studies have evaluated detectors on limited image domains. We demonstrate the superior performance of our UniverseNets across various object scales and image domains through the proposed benchmark.

2.2 Object Detection Benchmarks

There are numerous object detection benchmarks. For specific (category) object detection, recent benchmarks such as WIDER FACE [66] and TinyPerson [68] contain tiny objects. Although they are useful for evaluation for a specific category, many applications should detect multiple categories. For autonomous driving, KITTI [21] and Waymo Open Dataset [55] mainly evaluate three categories (car, pedestrian, and cyclist) in their leaderboards. For generic object detection, Pascal VOC [17] and COCO [37] include 20 and 80 categories, respectively. The number of categories has been further expanded by recent benchmarks, such as Open Images [32], Objects365 [51], and LVIS [24]. All the above datasets comprise photographs, whereas Clipart1k, Watercolor2k, Comic2k [30], and Manga109-s [40, 3] comprise artificial images.

Detectors evaluated on a specific dataset may perform worse on other datasets or domains. To address this issue, some benchmarks consist of multiple datasets. In the Robust Vision Challenge 2020 [1], detectors were evaluated on three datasets in the natural and traffic image domains. For universal-domain object detection, the Universal Object Detection Benchmark (UODB) [62] comprises 11 datasets in the natural, traffic, aerial, medical, and artificial image domains. Although it is suitable for evaluating detectors in various domains, variations in object scales are limited. Unlike UODB, our USB focuses on universal-scale object detection. Datasets in USB contain more instances, including tiny objects, than the datasets used in UODB.

As discussed in Sec. 1, the current benchmarks allow extremely unfair settings (, 25 training epochs). We resolved this problem by establishing USB protocols for fair training and evaluation.

3 Benchmark Protocols of USB

Here, we present the principle, datasets, protocols, and metrics of USB. See Supplementary Material for additional information.

3.1 Principle

We focus on the Universal-Scale Object Detection (USOD) task that aims to detect various objects in terms of object scales and image domains. Unlike separate discussions for multi-scale object detection (Sec. 2) and universal (-domain) object detection [62], USOD does not ignore the relation between scales and domains (Sec. 3.2).

For various applications and users, benchmark protocols should cover from short to long training and from small to large test scales. On the other hand, they should not be scattered for meaningful benchmarks. To satisfy the conflicting requirements, we define multiple thresholds for training epochs and evaluation image resolutions. Furthermore, we urge rich participants to report results with standard training settings. This request enables fair comparison and allows many people to develop and compare object detectors.

3.2 Datasets

Figure 2: Object scale distributions. USB covers extensive scale variations quantitatively. The relative scale is the square root of the ratio of bounding box area to image area [68, 53, 29].
Dataset Domain Color Main sources of scale variation
COCO [37] Natural RGB Categories, distance
WOD [55] Traffic RGB Distance
Manga109-s [40, 3] Artificial Grayscale Viewpoints, page layouts
Table 1: Characteristics of datasets. USB covers many qualitative variations related to scales and domains. : Few RGB images.
Benchmark Dataset Domain Classes Boxes Images B/I
USB (Ours) COCO [37] Natural 80 897 k 123 k 7.3
WOD [55] v1.2 f0 Traffic 3 1.0 M 100 k 10.0
Manga109-s [40, 3] Artificial 4 401 k 8.5 k 47.0
UODB [62] COCO [37] val2014 Natural 80 292 k 41 k 7.2
KITTI [21] Traffic 3 35 k 7.5 k 4.7
Comic2k [30] Artificial 6 6.4 k 2.0 k 3.2
Table 2: Statistics of datasets in USB and counterpart datasets in UODB [62]. Values are based on publicly available annotations. B/I: Average number of boxes per image.

To establish USB, we selected the COCO [37], Waymo Open Dataset (WOD) [55], and Manga109-s (M109s) [40, 3]. WOD and M109s are the largest public datasets with many small objects in the traffic and artificial domains, respectively. Object scales in these domains vary significantly with distance and viewpoints, unlike those in the medical and aerial domains. The USB covers extensive scale variations quantitatively (Figure 2) and qualitatively (Table 1). As shown in Table 2, these three datasets in USB contain more instances than their counterpart datasets in UODB [62] (COCO [37] val2014 subset, KITTI [21], and Comic2k [30]). USOD needs to evaluate detectors on datasets with many instances because more instances enable more reliable comparisons of scale-wise metrics.

For the first dataset, we adopted the COCO dataset [37]. COCO contains natural images of everyday scenes collected from the Internet. Annotations for 80 categories are used in the benchmark. As shown in Figure LABEL:fig:teaser (left), object scales mainly depend on the categories and distance. Although COCO contains objects smaller than those of Pascal VOC [17], objects in everyday scenes (especially indoor scenes) are relatively large. Since COCO is the current standard dataset for multi-scale object detection, we adopted the same training split train2017 (also known as trainval35k) as the COCO benchmark to eliminate the need for retraining across benchmarks. We adopted the val2017 split (also known as minival) as the test set.

For the second dataset, we adopted the WOD, which is a large-scale, diverse dataset for autonomous driving [55] with many annotations for tiny objects (Figure 2). The images were recorded using five high-resolution cameras mounted on vehicles. As shown in Figure LABEL:fig:teaser (middle), object scales vary mainly with distance. The full data splits of the WOD are too large for benchmarking methods. Thus, we extracted 10% size subsets from the predefined training split (798 sequences) and validation split (202 sequences) [55]. Specifically, we extracted splits based on the ones place of the frame index (frames 0, 10, …, 190) in each sequence. We call the subsets f0train and f0val splits. Each sequence in the splits contains 20 frames (20 s, 1 Hz), and each frame contains five images for five cameras. We used three categories (vehicle, pedestrian, and cyclist) following the official ALL_NS setting [2] used in WOD competitions.

For the third dataset, we adopt the M109s [40, 3]. M109s contains artificial images of manga (Japanese comics) and annotations for four categories (body, face, frame, and text). Many characteristics differ from those of natural images. Most images are grayscale. The objects are highly overlapped [43]. As shown in Figure LABEL:fig:teaser (right), object scales vary unrestrictedly with viewpoints and page layouts. Small objects differ greatly from downsampled versions of large objects because small objects are drawn with simple lines and points. For example, small faces look like a sign (). This characteristic may ruin techniques developed mainly for natural images. Another challenge is ambiguity in annotations. Sometimes, a small-scale object is annotated, and sometimes, a similar scale object on another page is not annotated. Since annotating small objects is difficult and labor-intensive, this is an important and practical challenge. We carefully selected 68, 4, and 15 volumes for training, validation, and testing splits, and we call them the 68train, 4val, and 15test, respectively.

We selected the test splits from images with publicly available annotations to reduce the labor required for submissions. Participants should not fine-tune hyperparameters based on the test splits to prevent overfitting.

3.3 Training Protocols

Protocol Max epoch AHPO Compatibility Example
USB 1.0 24 2 schedule [22, 25]
USB 2.0 73 USB 1.0 6 schedule [25]
USB 3.0 300 USB 1.0, 2.0 EfficientDet-D6 [59]
USB 3.1 300 USB 1.0, 2.0, 3.0 YOLOv4 [5]
Freestyle EfficientDet-D7x [60]
Table 3: USB training protocols. For models trained with masks, 0.5 is added. AHPO: Aggressive hyperparameter optimization.

For fair training, we propose the USB training protocols shown in Table 3. By analogy with the backward compatibility of the Universal Serial Bus, USB training protocols emphasize compatibility between protocols. Importantly, participants should report results with not only higher protocols but also lower protocols. For example, when a participant trains a model for 150 epochs with standard hyperparameters, it corresponds to USB 3.0. The participant should also report the results of models trained for 24 and 73 epochs in a paper. The readers of the paper can judge whether the method is useful for standard training epochs.

The number of maximum epochs for USB 1.0 is 24, which is the most popular setting in COCO (see Table 10

). We adopted 73 epochs for USB 2.0, where models trained from scratch can catch up with those trained for 24 epochs from ImageNet pre-trained models 

[25]. We adopted 300 epochs for USB 3.x such that YOLOv4 [5] and most EfficientDet models [60] correspond to this protocol. Models trained for more than 300 epochs are regarded as Freestyle. They are not suitable for benchmarking methods, although they may push the empirical limits of detectors [60, 9].

For models trained with mask annotations, 0.5 is added to their number of protocols. Results without mask annotations should be reported if possible for their algorithms.

For ease of comparison, we limit the pre-training datasets to the three and ImageNet (ILSVRC 1,000-class classification). Other datasets are welcome only when the results with and without additional datasets are reported. Participants should describe how to use the datasets. A possible way is to fine-tune the models on WOD and M109s from COCO pre-trained models. Another way is to train a single model jointly [62] on the three datasets.

In addition to long training schedules, hyperparameter optimization is resource-intensive. If authors of a paper fine-tune hyperparameters for their architecture, other people without sufficient computational resources cannot compare methods fairly. We recommend roughly tuning the minimum hyperparameters, such as batch sizes and learning rates (, from choices , , and ). When participants optimize hyperparameters aggressively by manual fine-tuning or automatic algorithms, they should report both results with and without aggressive optimization.

3.4 Evaluation Protocols

Protocol Max reso. Typical scale Reference
Standard USB 1,066,667 1333 800 Popular in COCO [37, 10, 22]
Mini USB 262,144 512 512 Popular in VOC [17, 39]
Micro USB 50,176 224 224 Popular in ImageNet [49, 26]
Large USB 2,457,600 19201280 WOD front cameras [55]
Huge USB 7,526,400 33602240 WOD top methods ([29], ours)
Table 4: USB evaluation protocols.

For fair evaluation, we propose the USB evaluation protocols shown in Table 4. By analogy with the size variations of the Universal Serial Bus connectors for various devices, USB evaluation protocols have variations in test image scales for various devices and applications.

The maximum resolution for Standard USB follows the popular test scale of 1333800 in the COCO benchmark (see Table 10 and [10, 22]). For Mini USB, we limit the resolution based on 512512. This resolution is popular in the Pascal VOC benchmark [17, 39], which contains small images and large objects. It is also popular in real-time detectors [59, 5]. We adopted a further small-scale 224224 for Micro USB. This resolution is popular in ImageNet classification [49, 26]. Although small object detection is extremely difficult, it is suitable for low-power devices. Additionally, this protocol enables people to manage object detection tasks using one or few GPUs. To cover larger test scales than Standard USB, we define Large USB and Huge USB based on WOD resolutions. The maximum resolution of the Huge USB is determined by the top methods on WOD (see Sec. 5.3). Although larger inputs (regarded as Freestyle) may be preferable for accuracy, excessively large inputs significantly reduce the practicality of detectors.

In addition to test image scales, the presence and degree of Test-Time Augmentation (TTA) make large differences in accuracy and inference time. When using TTA, participants should report its details (including the number of scales of multi-scale testing) and results without TTA.

3.5 Evaluation Metrics

We mainly use the COCO metrics [37, 34] to evaluate the performance of detectors on each dataset. We provide data format converters for WOD222Our GitHub repository (anonymized for review). and M109s 333Our GitHub repository (anonymized for review)..

We first describe the calculation of COCO metrics according to the official evaluation code [34]. True or false positives are judged by measuring the Intersection over Union (IoU) between predicted bounding boxes and ground truth bounding boxes [17]. For each category, the Average Precision (AP) is calculated as precision averaged over 101 recall thresholds . The COCO-style AP (CAP) for a dataset is calculated as


where denotes the predefined 10 IoU thresholds, denotes categories in the dataset , denotes the cardinality of a set (, for COCO), and denotes AP for an IoU threshold and a category . For detailed analysis, five additional AP metrics (averaged over categories) are evaluated. AP and AP denote AP at single IoU thresholds of and , respectively. AP, AP, and AP are variants of CAP, where target objects are limited to small (area ), medium ( area ), and large ( area) objects, respectively. The area is measured using mask annotations for COCO and bounding box annotations for WOD and M109s.

As the primary metric for USB, we use the mean COCO-style AP (mCAP) averaged over all datasets as


Since USB adopts the three datasets described in Sec. 3.2, Similarly, we define five metrics from AP, AP, AP, AP, and AP by averaging them over the datasets. We plan to define finer scale-wise metrics for USOD in future work.

For ease of quantitative evaluation, we limit the number of detections per image to 100 across all categories, following the COCO benchmark [34]. For qualitative evaluation, participants may raise the limit to 300 (1% of images in the M109s 15test set contain more than 100 annotations).

4 UniverseNets

For fast and accurate detectors for USOD, we designed UniverseNets. Single-stage detectors were adopted for efficiency. See Supplementary Material for details of the methods and architectures used in UniverseNets.

As a baseline model, we used RetinaNet [36] implemented in MMDetection [10]. Specifically, the backbone is ResNet-50-B [27] (a variant of ResNet-50 [26]

, also known as the PyTorch style). The neck is FPN 

[35]. We used focal loss [36], single-scale training, and single-scale testing.

Built on the RetinaNet baseline, we designed UniverseNet by collecting human wisdom about multi-scale object detection as of May 2020. We used ATSS [69] and SEPC without iBN [63] (hereafter referred to as ATSEPC). The backbone is Res2Net-50-v1b [19, 20]. Deformable Convolutional Networks (DCN) [14] were adopted in the backbone and neck. We used multi-scale training. Unless otherwise stated, we used single-scale testing for efficiency.

By adding GFL [33], SyncBN [44], and iBN [63], we designed three variants of UniverseNet around August 2020. UniverseNet-20.08d heavily uses DCN [14]. UniverseNet-20.08 speeds up inference (and training) by the light use of DCN [14, 63]. UniverseNet-20.08s further speeds up inference using the ResNet-50-C [27] backbone.

5 Experiments

Here, we present benchmark results on USB and comparison results with state-of-the-art methods on the three datasets. Thereafter, we analyze the characteristics of detectors by additional experiments. See Supplementary Material for details of the experimental settings and results, including the KITTI-style AP [21, 55] on WOD and the effects of COCO pre-training on M109s.

5.1 Experimental Settings

Hyperparameters COCO WOD M109s LR for multi-stage detectors 0.02 0.02 0.16 LR for single-stage detectors 0.01 0.01 0.08 Test scale 1333800 1248832 1216864 Range for multi-scale training 480–960 640–1280 480–960 Hyperparam. Common Epoch 12 Batch size 16 Momentum 0.9 Weight decay
Table 5: Default hyperparameters. : Shorter side pixels.

Our code is built on MMDetection [10] v2. We used the COCO pre-trained models of the repository for existing methods (Faster R-CNN [47] with FPN [35], Cascade R-CNN [8], RetinaNet [36], ATSS [69], and GFL [33]

). We trained all models with Stochastic Gradient Descent (SGD).

The default hyperparameters are listed in Table 5. Most values follow standard settings [10, 36, 69, 35]. We used some dataset-dependent values. For M109s, we roughly tuned the learning rates (LR) based on a preliminary experiment with the RetinaNet [36] baseline model. Test scales were determined within the standard USB protocol, considering the typical aspect ratio of the images in each dataset. The ranges for multi-scale training for COCO and M109s follow prior work [63]. We used larger scales for WOD because the objects in WOD are especially small.

COCO models were fine-tuned from ImageNet pre-trained backbones. We trained the models for WOD and M109s from the corresponding COCO pre-trained models. We follow the learning rate schedules of MMDetection [10]. We mainly used the 1 schedule (12 epochs). For comparison with state-of-the-art methods on COCO, we used the 2 schedule (24 epochs) for most models and the 20e schedule (20 epochs) for UniverseNet-20.08d due to overfitting with the 2 schedule. For comparison with state-of-the-art methods on WOD, we trained UniverseNet on the WOD full training set for 7 epochs. We used a learning rate of for 6 epochs and for the last epoch.

5.2 Benchmark Results on USB

Faster R-CNN [47] 45.9 68.2 49.1 15.2 38.9 62.5 37.4 34.5 65.8
Cascade R-CNN [8] 48.1 68.5 51.5 15.6 41.3 65.9 40.3 36.4 67.6
RetinaNet [36] 44.8 66.0 47.4 12.9 37.3 62.6 36.5 32.5 65.3
ATSS [69] 47.1 68.0 50.2 15.5 39.5 64.7 39.4 35.4 66.5
ATSEPC [69, 63] 48.1 68.5 51.2 15.5 40.5 66.8 42.1 35.0 67.1
GFL [33] 47.7 68.3 50.6 15.8 39.9 65.8 40.2 35.7 67.3
UniverseNet 51.4 72.1 55.1 18.4 45.0 70.7 46.7 38.6 68.9
UniverseNet-20.08 52.1 72.9 55.5 19.2 45.8 70.8 47.5 39.0 69.9
Table 6: Benchmark results on USB.
Figure 3: Correlation between mCAP and CAP on each dataset.
Method AP AP AP AP AP AP Faster R-CNN [47] 37.4 58.1 40.4 21.2 41.0 48.1 Cascade R-CNN [8] 40.3 58.6 44.0 22.5 43.8 52.9 RetinaNet [36] 36.5 55.4 39.1 20.4 40.3 48.1 ATSS [69] 39.4 57.6 42.8 23.6 42.9 50.3 ATSEPC [69, 63] 42.1 59.9 45.5 24.6 46.1 55.0 GFL [33] 40.2 58.4 43.3 23.3 44.0 52.2 UniverseNet 46.7 65.0 50.7 29.2 50.6 61.4 UniverseNet-20.08 47.5 66.0 51.9 28.9 52.1 61.9
Table 7: Results on COCO minival.
Method AP AP AP AP AP AP veh. ped. cyc. Faster 34.5 55.3 36.3 6.0 35.8 67.4 42.7 34.6 26.1 Cascade 36.4 56.3 38.6 6.5 38.1 70.6 44.5 36.3 28.5 RetinaNet 32.5 52.2 33.7 2.6 32.8 67.9 40.0 32.5 25.0 ATSS 35.4 56.2 37.0 6.1 36.6 69.8 43.6 35.6 27.0 ATSEPC 35.0 55.3 36.5 5.8 35.5 70.5 43.5 35.3 26.3 GFL 35.7 56.0 37.1 6.2 36.7 70.7 44.0 36.0 27.1 Univ 38.6 59.8 40.9 7.4 41.0 74.0 46.0 37.6 32.3 Univ20.08 39.0 60.2 40.4 8.3 41.7 73.3 47.1 38.7 31.0
Table 8: Results on WOD f0val.
Method AP AP AP AP AP AP body face frame text Faster 65.8 91.1 70.6 18.4 39.9 72.1 58.3 47.5 90.1 67.1 Cascade 67.6 90.6 72.0 17.9 41.9 74.3 60.8 48.2 92.5 69.0 RetinaNet 65.3 90.5 69.5 15.7 38.9 71.9 58.3 46.3 88.8 67.7 ATSS 66.5 90.1 70.8 16.8 38.9 74.0 60.9 44.6 91.3 69.0 ATSEPC 67.1 90.2 71.5 16.2 39.8 74.9 62.3 44.6 92.1 69.4 GFL 67.3 90.6 71.5 17.9 38.9 74.4 61.7 45.7 92.2 69.4 Univ 68.9 91.4 73.7 18.7 43.4 76.6 65.8 46.6 93.0 70.3 Univ20.08 69.9 92.5 74.3 20.5 43.6 77.1 66.6 48.0 93.7 71.2
Table 9: Results on Manga109-s 15test.

We trained and evaluated methods on the USB. All methods follow the Standard USB 1.0 protocol using the default hyperparameters in Sec. 5.1. The results are shown in Table 6. UniverseNet-20.08 achieves the highest results on all datasets, resulting in 52.1% mCAP. In most cases, methods that work on COCO also work on the other datasets. Cascade R-CNN [8] and ATSS [69] achieve over 2% more mCAP than Faster R-CNN [47] and RetinaNet [36], respectively. In some cases, methods that work on COCO show small or negative effects on WOD and M109s. Thus, USB can impose a penalty on COCO-biased methods.

To compare the effectiveness of each method on each dataset, we show the correlation between mCAP and CAP on each dataset in Figure 3. SEPC [63] improves COCO CAP and deteriorates WOD CAP. Multi-stage detectors [47, 8] show relatively high CAP on WOD and relatively low CAP on COCO. Adding GFL [33] is especially effective on M109s (see improvements from ATSS to GFL and from UniverseNet to UniverseNet-20.08).

We also show detailed results on each dataset. Table 9 shows the COCO results. Since the effectiveness of existing methods has been verified on COCO, their improvements are steady. Table 9 shows the WOD results. Adding SEPC [63] to ATSS [69] decreases all metrics except for AP. We found that this reduction does not occur at large test scales in higher USB evaluation protocols (see Sec. 5.4

). UniverseNet-20.08 shows worse results than UniverseNet in some metrics, probably due to the light use of DCN for fast inference (see Table 

13). Table 9 shows the M109s results. Interestingly, improvements by ATSS [69] are smaller than those on COCO and WOD due to the drop of face AP. We conjecture that this phenomenon comes from the domain differences discussed in Sec. 3.2 and prior work [43], although we should explore it in future work.

Protocol Method Backbone DCN Epoch Max test scale TTA FPS AP AP AP AP AP AP Reference
Standard USB 1.0 Faster R-CNN [47, 35] ResNet-101 22 1333 800 (14.2) 36.2 59.1 39.0 18.2 39.0 48.2 CVPR17
Standard USB 1.0 Cascade R-CNN [8] ResNet-101 19 1312 800 (11.9) 42.8 62.1 46.3 23.7 45.5 55.2 CVPR18
Standard USB 1.0 RetinaNet [36] ResNet-101 18 1333 800 (13.6) 39.1 59.1 42.3 21.8 42.7 50.2 ICCV17
Standard USB 1.0 FCOS [61] X-101 (644d) 24 1333 800 ( 8.9) 44.7 64.1 48.4 27.6 47.5 55.6 ICCV19
Standard USB 1.0 ATSS [69] X-101 (644d) 24 1333 800 10.6 47.7 66.5 51.9 29.7 50.8 59.4 CVPR20
Standard USB 1.0 FreeAnchor+SEPC [63] X-101 (644d) 24 1333 800 50.1 69.8 54.3 31.3 53.3 63.7 CVPR20
Standard USB 1.0 PAA [31] X-101 (644d) 24 1333 800 49.0 67.8 53.3 30.2 52.8 62.2 ECCV20
Standard USB 1.0 PAA [31] X-152 (328d) 24 1333 800 50.8 69.7 55.1 31.4 54.7 65.2 ECCV20
Standard USB 1.0 RepPoints v2 [12] X-101 (644d) 24 1333 800 ( 3.8) 49.4 68.9 53.4 30.3 52.1 62.3 NeurIPS20
Standard USB 1.0 RelationNet++ [13] X-101 (644d) 20 1333 800 10.3 50.3 69.0 55.0 32.8 55.0 65.8 NeurIPS20
Standard USB 1.0 GFL [33] ResNet-50 24 1333 800 37.2 43.1 62.0 46.8 26.0 46.7 52.3 NeurIPS20
Standard USB 1.0 GFL [33] ResNet-101 24 1333 800 29.5 45.0 63.7 48.9 27.2 48.8 54.5 NeurIPS20
Standard USB 1.0 GFL [33] ResNet-101 24 1333 800 22.8 47.3 66.3 51.4 28.0 51.1 59.2 NeurIPS20
Standard USB 1.0 GFL [33] X-101 (324d) 24 1333 800 15.4 48.2 67.4 52.6 29.2 51.7 60.2 NeurIPS20
Standard USB 1.0 UniverseNet-20.08s ResNet-50-C 24 1333 800 31.6 47.4 66.0 51.4 28.3 50.8 59.5 (Ours)
Standard USB 1.0 UniverseNet-20.08 Res2Net-50-v1b 24 1333 800 24.9 48.8 67.5 53.0 30.1 52.3 61.1 (Ours)
Standard USB 1.0 UniverseNet-20.08d Res2Net-101-v1b 20 1333 800 11.7 51.3 70.0 55.8 31.7 55.3 64.9 (Ours)
Large USB 1.0 UniverseNet-20.08d Res2Net-101-v1b 20 1493 896 11.6 51.5 70.2 56.0 32.8 55.5 63.7 (Ours)
Large USB 1.0 UniverseNet-20.08d Res2Net-101-v1b 20 20001200 5 53.8 71.5 59.4 35.3 57.3 67.3 (Ours)
Huge USB 1.0 ATSS [69] X-101 (644d) 24 30001800 13 50.7 68.9 56.3 33.2 52.9 62.4 CVPR20
Huge USB 1.0 PAA [31] X-101 (644d) 24 30001800 13 51.4 69.7 57.0 34.0 53.8 64.0 ECCV20
Huge USB 1.0 PAA [31] X-152 (328d) 24 30001800 13 53.5 71.6 59.1 36.0 56.3 66.9 ECCV20
Huge USB 1.0 RepPoints v2 [12] X-101 (644d) 24 30001800 13 52.1 70.1 57.5 34.5 54.6 63.6 NeurIPS20
Huge USB 1.0 RelationNet++ [13] X-101 (644d) 20 30001800 13 52.7 70.4 58.3 35.8 55.3 64.7 NeurIPS20
Huge USB 1.0 UniverseNet-20.08d Res2Net-101-v1b 20 30001800 13 54.1 71.6 59.9 35.8 57.2 67.4 (Ours)
Huge USB 2.0 TSD [54] SENet-154 34 20001400 4 51.2 71.9 56.0 33.8 54.8 64.2 CVPR20
Huge USB 2.5 DetectoRS [45] X-101 (324d) 40 24001600 5 54.7 73.5 60.1 37.4 57.3 66.4 arXiv20
Mini USB 3.0 EfficientDet-D0 [59] EfficientNet-B0 300  512 512 98.0 33.8 52.2 35.8 12.0 38.3 51.2 CVPR20
Mini USB 3.1 YOLOv4 [5] CSPDarknet-53 273  512 512 83 43.0 64.9 46.5 24.3 46.1 55.2 arXiv20
Standard USB 3.0 EfficientDet-D2 [59] EfficientNet-B2 300  768 768 56.5 43.0 62.3 46.2 22.5 47.0 58.4 CVPR20
Standard USB 3.0 EfficientDet-D4 [59] EfficientNet-B4 300 10241024 23.4 49.4 69.0 53.4 30.3 53.2 63.2 CVPR20
Standard USB 3.1 YOLOv4 [5] CSPDarknet-53 273  608 608 62 43.5 65.7 47.3 26.7 46.7 53.3 arXiv20
Large USB 3.0 EfficientDet-D5 [59] EfficientNet-B5 300 12801280 13.8 50.7 70.2 54.7 33.2 53.9 63.2 CVPR20
Large USB 3.0 EfficientDet-D6 [59] EfficientNet-B6 300 12801280 10.8 51.7 71.2 56.0 34.1 55.2 64.1 CVPR20
Large USB 3.0 EfficientDet-D7 [59] EfficientNet-B6 300 15361536  8.2 52.2 71.4 56.3 34.8 55.5 64.6 CVPR20
Freestyle RetinaNet+SpineNet [16] SpineNet-190 400 12801280 52.1 71.8 56.5 35.4 55.0 63.6 CVPR20
Freestyle EfficientDet-D7x [60] EfficientNet-B7 600 15361536  6.5 55.1 74.3 59.9 37.2 57.9 68.0 arXiv20
Table 10: State-of-the-art methods on COCO test-dev

. We classify methods by proposed protocols. X in the Backbone column denotes ResNeXt 

[64]. See method papers for other backbones. TTA: Test-time augmentation including horizontal flip and multi-scale testing (numbers denote scales). FPS values without and with parentheses were measured on V100 with mixed precision and other environments, respectively. We measured the FPS of GFL [33]

models in our environment and estimated those of ATSS 

[69] and RelationNet++ [13] based on the measured values and [33, 13]. The settings of other methods are based on conference papers, their arXiv versions, and authors’ codes. The values shown in gray were estimated from descriptions in papers and codes. Some FPS values are from [33].
Rank Method # Models AP/L2
Multi-stage Single-stage
Methods including multi-stage detector:
1 RW-TSDet [29] 6+ 74.43
2 HorizonDet [11] 4 8 70.28
3 SPNAS-Noah [65] 2 69.43
Single-stage detectors:
7 UniverseNet (Ours) 1 67.42
13 YOLO V4 [5] 1+ 58.08
14 ATSS-Efficientnet [69, 58] 1+ 56.99
Table 11: Waymo Open Dataset Challenge 2020 2D detection [2].
(b) AP improvements by Res2Net-v1b [19, 20], DCN [14], and multi-scale training.
(c) AP improvements by GFL [33].
(d) AP improvements by SyncBN [44], iBN [63].
(e) Speeding up by the light use of DCN [14, 63].
(f) Ablation from UniverseNet-20.08. Replacing Res2Net-v1b backbone with ResNet-B [27] has the largest effects.
(g) UniverseNet-20.08 with different backbones.
Table 12: Ablation studies on COCO minival.
Table 13: NightOwls Detection Challenge 2020 all objects track. MR: Average Miss Rate (%) on test set under reasonable setting.
Figure 4: Test scales CAP on WOD f0val.
(a) AP improvements by SEPC without iBN [63].

5.3 Comparison with State-of-the-Art

COCO. We show state-of-the-art methods on COCO test-dev (as of November 14, 2020) in Table 10. Our UniverseNet-20.08d achieves the highest AP (51.3%) in the Standard USB 1.0 protocol. Despite 12.5 fewer epochs, the speed-accuracy trade-offs of our models are comparable to those of EfficientDet [59] (see also Figure 1). With 13-scale TTA, UniverseNet-20.08d achieves the highest AP (54.1%) in the Huge USB 1.0 protocol. Even with 5-scale TTA in the Large USB 1.0, it achieves 53.8% AP, which is higher than other methods in the USB 1.0 protocols.

WOD. For comparison with state-of-the-art methods on WOD, we submitted the detection results of UniverseNet to the Waymo Open Dataset Challenge 2020 2D detection, a competition held at a CVPR 2020 workshop. The primary metric is AP/L2, a KITTI-style AP evaluated with LEVEL_2 objects [55, 2]. We used multi-scale testing with soft-NMS [6]. The shorter side pixels of test scales are

, including 8 pixels padding. These scales enable utilizing SEPC 

[63] (see Sec. 5.4) and detecting small objects. Table 11 shows the top teams’ results. UniverseNet achieves 67.42% AP/L2 without multi-stage detectors, ensembles, expert models, or heavy backbones, unlike other top methods. RW-TSDet [29] overwhelms other multi-stage detectors, whereas UniverseNet overwhelms other single-stage detectors. These two methods used light backbones and large test scales [4]. Interestingly, the maximum test scales are the same (33602240). We conjecture that this is not a coincidence but a convergence caused by searching the accuracy saturation point.

Manga109-s. To the best of our knowledge, no prior work has reported detection results on the Manga109-s dataset (87 volumes). Although many settings differ, the state-of-the-art method on the full Manga109 dataset (109 volumes, non-public to commercial organizations) achieves 77.1–92.0% (mean: 84.2%) AP on ten test volumes [43]. The mean AP of UniverseNet-20.08 on the 15test set (92.5%) is higher than those results.

5.4 Analyses and Discussions

Chaotic state of the art. As shown in Table 10, state-of-the-art detectors on the COCO benchmark were trained with various settings. Comparisons across different training epochs are especially difficult because long training does not decrease FPS, unlike large test scales. Nevertheless, the EfficientDet [59], YOLOv4 [5], and SpineNet [16] papers compare methods in their tables without specifying the difference in training epochs. The compatibility of the USB training protocols (Sec. 3.3) resolves this disorder. We hope that many papers report results with the protocols for inclusive, healthy, and sustainable development of detectors.

Ablation studies. We show the results of ablation studies for UniverseNets on COCO in Table 13. SEPC [63], Res2Net-v1b [19, 20], DCN [14], multi-scale training, GFL [33], SyncBN [44], and iBN [63] improve AP. UniverseNet-20.08d is much more accurate (48.6% AP) than other models trained for 12 epochs using ResNet-50-level backbones (, ATSS: 39.4% [69, 10], GFL: 40.2% [33, 10]). As shown in Table 13, UniverseNet-20.08 is 1.4 faster than UniverseNet-20.08d at the cost of a 1% AP drop. UniverseNet-20.08s, the variant with ResNet-50-C backbone in Table 13, shows a good speed-accuracy trade-off by achieving 45.8% AP and over 30 FPS.

Generalization. To evaluate the generalization ability, we show the results on another dataset out of the USB. We trained UniverseNet on the NightOwls [41], a dataset for person detection at night, from the WOD pre-trained model in Sec. 5.3. The top teams’ results of the NightOwls Detection Challenge 2020 are shown in Table 13. UniverseNet is more accurate than other methods, even without TTA, and should be faster than the runner-up method that uses larger test scales and a heavy model (Cascade R-CNN, ResNeXt-101, CBNet, Double-Head, DCN, and soft-NMS) [NightOwls_talks_CVPRW2020_anonymize].

Test scales. We show the results on WOD at different test scales in Figure 4. Single-stage detectors require larger test scales than multi-stage detectors to achieve peak performance, probably because they cannot extract features from precisely localized region proposals. Although ATSEPC shows lower AP than ATSS at the default test scale (1248832 in Standard USB), it outperforms ATSS at larger test scales (, 19201280 in Large USB). We conjecture that we should enlarge object scales in images to utilize SEPC [63] because its DCN [14] enlarges effective receptive fields. SEPC and DCN prefer large objects empirically (see Tables 13, 13, [63, 14]), and DCN [14] cannot increase the sampling points for objects smaller than the kernel size in principle. By utilizing the characteristics of SEPC and multi-scale training, UniverseNets achieve the highest AP in a wide range of test scales.

6 Conclusions

We introduced USB, a benchmark for universal-scale object detection. To resolve unfair comparisons in existing benchmarks, we established USB training/evaluation protocols. Our UniverseNets achieved state-of-the-art results on USB and existing benchmarks. We found some weaknesses in the existing methods to be addressed in future research.

There are three limitations in this work. (1) USB depends on datasets with many instances. Reliable scale-wise metrics for small datasets should be considered. (2) We adopted single-stage detectors for UniverseNets and trained detectors in the USB 1.0 protocol. Although these settings are practical, it is worth exploring multi-stage detectors in higher protocols. (3) The architectures and results of UniverseNets are still biased toward COCO due to ablation studies and pre-training on COCO. Less biased and more universal detectors should be developed in future research.

The proposed USB protocols can be applied to other tasks with modifications. We believe that our work is an important step toward recognizing universal-scale objects by connecting various experimental settings.


We are grateful to Dr. Hirokatsu Kataoka for helpful comments. We thank all contributors for the datasets and software libraries. The original image of Figure LABEL:fig:teaser (left) is satellite office by Taiyo FUJII (CC BY 2.0).


  • [1] Robust Vision Challenge 2020., Accessed on Nov. 8, 2020.
  • [2] Waymo Open Dataset 2D detection leaderboard., Accessed on June 18, 2020.
  • [3] Kiyoharu Aizawa, Azuma Fujimoto, Atsushi Otsubo, Toru Ogawa, Yusuke Matsui, Koki Tsubota, and Hikaru Ikuta. Building a manga dataset “Manga109” with annotations for multimedia applications. IEEE MultiMedia, 2020.
  • [4] Khalid Ashraf, Bichen Wu, Forrest N. Iandola, Mattthew W. Moskewicz, and Kurt Keutzer. Shallow networks for high-accuracy road object-detection. arXiv:1606.01561, 2016.
  • [5] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. YOLOv4: Optimal speed and accuracy of object detection. arXiv:2004.10934, 2020.
  • [6] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-NMS – improving object detection with one line of code. In ICCV, 2017.
  • [7] Zhaowei Cai. Towards Universal Object Detection. PhD thesis, UC San Diego, 2019.
  • [8] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In CVPR, 2018.
  • [9] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
  • [10] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155, 2019.
  • [11] Sijia Chen, Yu Wang, Li Huang, Runzhou Ge, Yihan Hu, Zhuangzhuang Ding, and Jie Liao. 2nd place solution for Waymo Open Dataset challenge – 2D object detection. arXiv:2006.15507, 2020.
  • [12] Yihong Chen, Zheng Zhang, Yue Cao, Liwei Wang, Stephen Lin, and Han Hu. RepPoints v2: Verification meets regression for object detection. In NeurIPS, 2020.
  • [13] Cheng Chi, Fangyun Wei, and Han Hu. RelationNet++: Bridging visual representations for object detection via transformer decoder. In NeurIPS, 2020.
  • [14] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, 2017.
  • [15] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. TPAMI, 2012.
  • [16] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, and Xiaodan Song. SpineNet: Learning scale-permuted backbone for recognition and localization. In CVPR, 2020.
  • [17] Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes challenge: A retrospective. IJCV, 2015.
  • [18] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg. DSSD : Deconvolutional single shot detector. arXiv:1701.06659, 2017.
  • [19] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2Net: A new multi-scale backbone architecture. TPAMI, 2020.
  • [20] Shang-Hua Gao et al. Res2Net Pretrained Models., Accessed on May 20, 2020.
  • [21] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, 2012.
  • [22] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron., 2018.
  • [23] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017.
  • [24] Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  • [25] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking ImageNet pre-training. In ICCV, 2019.
  • [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [27] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li.

    Bag of tricks for image classification with convolutional neural networks.

    In CVPR, 2019.
  • [28] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017.
  • [29] Zehao Huang, Zehui Chen, Qiaofei Li, Hongkai Zhang, and Naiyan Wang. 1st place solutions of Waymo Open Dataset challenge 2020 – 2D object detection track. arXiv:2008.01365, 2020.
  • [30] Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In CVPR, 2018.
  • [31] Kang Kim and Hee Seok Lee. Probabilistic anchor assignment with IoU prediction for object detection. In ECCV, 2020.
  • [32] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  • [33] Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized Focal Loss: Learning qualified and distributed bounding boxes for dense object detection. In NeurIPS, 2020.
  • [34] Tsung-Yi Lin, Piotr Dollár, et al. COCO API., Accessed on Nov. 8, 2020.
  • [35] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
  • [36] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, 2017.
  • [37] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • [38] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning for generic object detection: A survey. IJCV, 2020.
  • [39] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
  • [40] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using Manga109 dataset. Multimedia Tools and Applications, 2017.
  • [41] Lukáš Neumann, Michelle Karg, Shanshan Zhang, Christian Scharfenberger, Eric Piegert, Sarah Mistr, Olga Prokofyeva, Robert Thiel, Andrea Vedaldi, Andrew Zisserman, and Bernt Schiele. NightOwls: A pedestrians at night dataset. In ACCV, 2018.
  • [42] Lukáš Neumann, Yosuke Shinya, and Zhenyu Xu. NightOwls detection challenge. Presentations at CVPR Workshop on Scalability in Autonomous Driving, 2020.
  • [43] Toru Ogawa, Atsushi Otsubo, Rei Narita, Yusuke Matsui, Toshihiko Yamasaki, and Kiyoharu Aizawa. Object detection for comics using Manga109 annotations. arXiv:1803.08670, 2018.
  • [44] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. MegDet: A large mini-batch object detector. In CVPR, 2018.
  • [45] Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. DetectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv:2006.02334, 2020.
  • [46] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You Only Look Once: Unified, real-time object detection. In CVPR, 2016.
  • [47] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [48] Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. Neural network-based face detection. TPAMI, 1998.
  • [49] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
  • [50] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI. Communications of the ACM, 2020.
  • [51] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  • [52] Yosuke Shinya, Edgar Simo-Serra, and Taiji Suzuki. Understanding the effects of pre-training for object detectors via eigenspectrum. In ICCV Workshop on Neural Architects, 2019.
  • [53] Bharat Singh and Larry S. Davis. An analysis of scale invariance in object detection – SNIP. In CVPR, 2018.
  • [54] Guanglu Song, Yu Liu, and Xiaogang Wang. Revisiting the sibling head in object detector. In CVPR, 2020.
  • [55] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception for autonomous driving: Waymo Open Dataset. In CVPR, 2020.
  • [56] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [57] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  • [58] Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  • [59] Mingxing Tan, Ruoming Pang, and Quoc V. Le. EfficientDet: Scalable and efficient object detection. In CVPR, 2020.
  • [60] Mingxing Tan, Ruoming Pang, and Quoc V. Le. EfficientDet: Scalable and efficient object detection. arXiv:1911.09070v7, 2020.
  • [61] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In ICCV, 2019.
  • [62] Xudong Wang, Zhaowei Cai, Dashan Gao, and Nuno Vasconcelos. Towards universal object detection by domain attention. In CVPR, 2019.
  • [63] Xinjiang Wang, Shilong Zhang, Zhuoran Yu, Litong Feng, and Wayne Zhang. Scale-equalizing pyramid convolution for object detection. In CVPR, 2020.
  • [64] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
  • [65] Hang Xu, Chenhan Jiang, Dapeng Feng, Chaoqiang Ye, Rui Sun, and Xiaodan Liang. SPNAS-Noah: Single Cascade-RCNN with backbone architecture adaption for Waymo 2D detection., Accessed on June 21, 2020.
  • [66] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. WIDER FACE: A face detection benchmark. In CVPR, 2016.
  • [67] Lewei Yao, Hang Xu, Wei Zhang, Xiaodan Liang, and Zhenguo Li. SM-NAS: Structural-to-modular neural architecture search for object detection. In AAAI, 2020.
  • [68] Xuehui Yu, Yuqi Gong, Nan Jiang, Qixiang Ye, and Zhenjun Han. Scale match for tiny person detection. In WACV, 2020.
  • [69] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR, 2020.

Supplementary Material

Appendix A Details of Related Work

a.1 Components for Multi-Scale Object Detection

Backbones and modules.Inception module [56] arranges , , and convolutions to cover multi-scale regions. Residual block [26] adds multi-scale features from shortcut connections and convolutions. ResNet-C and ResNet-D [27] replace the first layer of ResNet with the deep stem (three convolutions) [57]. Res2Net module [19] stacks convolutions hierarchically to represent multi-scale features. Res2Net-v1b [20] adopts deep stem with Res2Net module. Deformable convolution module in Deformable Convolutional Networks (DCN) [14] adjusts receptive field adaptively by deforming the sampling locations of standard convolutions. These modules are mainly used in backbones.

Necks. To combine and enhance backbones’ representation, necks follow backbones. Feature Pyramid Networks (FPN) [35] adopt top-down path and lateral connections like architectures for semantic segmentation. Scale-Equalizing Pyramid Convolution (SEPC) [63] introduces pyramid convolution across feature maps with different resolutions and utilizes DCN to align the features.

Heads and training sample selection. Faster R-CNN [47] spreads multi-scale anchors over a feature map. SSD [39] spreads multi-scale anchors over multiple feature maps with different resolutions. ATSS [69] eliminates the need for multi-scale anchors by dividing positive and negative samples according to object statistics across pyramid levels.

Multi-scale training and testing. Traditionally, the image pyramid is an essential technique to handle multi-scale objects [48]. Although recent detectors can output multi-scale objects from a single-scale input, many works use multi-scale inputs to improve performance [47, 36, 69, 63]. In a popular implementation [10], multi-scale training randomly chooses a scale at each iteration for (training-time) data augmentation. Multi-scale testing infers multi-scale inputs and merges their outputs for Test-Time Augmentation (TTA). SNIP [53] limits the range of object scales at each image scale during training and testing.

Appendix B Details of Protocols

b.1 Dataset Splits for Manga109-s

Volume Genre
15test set:
Aku-Ham Four-frame cartoons
Bakuretsu! Kung Fu Girl Romantic comedy
Doll Gun Battle
Eva Lady Science fiction
Hinagiku Kenzan! Love romance
Kyokugen Cyclone Sports
Love Hina vol. 1 Romantic comedy
Momoyama Haikagura Historical drama
Tennen Senshi G Humor
Uchi no Nyan’s Diary Animal
Unbalance Tokyo Science fiction
Yamato no Hane Sports
Youma Kourin Fantasy
Yume no Kayoiji Fantasy
Yumeiro Cooking Love romance
4val set:
Healing Planet Science fiction
Love Hina vol. 14 Romantic comedy
Seijinki Vulnus Battle
That’s! Izumiko Fantasy
68train set: All the other volumes
Table 14: Manga109-s dataset splits (87 volumes in total).

The Manga109-s dataset (87 volumes) is a subset of the full Manga109 dataset (109 volumes) [3]. Unlike the full Manga109 dataset, the Manga109-s dataset can be used by commercial organizations. The dataset splits for the full Manga109 dataset used in prior work [43] cannot be used for the Manga109-s dataset.

We defined the Manga109-s dataset splits shown in Table 14. Unlike alphabetical order splits used in the prior work [43], we selected the volumes carefully. The 15test set was selected to be well-balanced for reliable evaluation. Five volumes in the 15test set were selected from the 10 test volumes used in [43] to enable partially direct comparison. All the authors of the 15test and 4val set are different from those of the 68train set to evaluate generalizability.

b.2 Number of Images

There are 118,287 images in COCO train2017, 5,000 in COCO val2017, 79,735 in WOD f0train, 20,190 in WOD f0val, 6,760 in M109s 68train, 419 in M109s 4val, and 1,354 in M109s 15test.

Method Head Neck Backbone Input FPS COCO (1 schedule)
RetinaNet [36] 33.9 36.5 55.4 39.1 20.4 40.3 48.1
ATSS [69] 35.2 39.4 57.6 42.8 23.6 42.9 50.3
GFL [33] 37.2 40.2 58.4 43.3 23.3 44.0 52.2
ATSEPC [69, 63] P, LC 25.0 42.1 59.9 45.5 24.6 46.1 55.0
UniverseNet P, LC c3-c5 17.3 46.7 65.0 50.7 29.2 50.6 61.4
UniverseNetGFL P, LC c3-c5 17.5 47.5 65.8 51.8 29.2 51.6 62.5
UniverseNet-20.08d P, LC c3-c5 17.3 48.6 67.1 52.7 30.1 53.0 63.8
UniverseNet-20.08 LC c5 24.9 47.5 66.0 51.9 28.9 52.1 61.9
UniverseNet-20.08 w/o SEPC [63] c5 26.7 45.8 64.6 50.0 27.6 50.4 59.7
UniverseNet-20.08 w/o Res2Net-v1b [19, 20] LC c5 32.8 44.7 62.8 48.4 27.1 48.8 59.5
UniverseNet-20.08 w/o DCN [14] 27.8 45.9 64.5 49.8 28.9 49.9 59.0
UniverseNet-20.08 w/o iBN, SyncBN [63, 44] LC c5 25.7 45.8 64.0 50.2 27.9 50.0 59.8
UniverseNet-20.08 w/o MStrain LC c5 24.8 45.9 64.5 49.6 27.4 50.5 60.1
Table 15: Architectures of UniverseNets with a summary of ablation studies on COCO minival. See Sec. D.4 for step-by-step improvements. All results are based on MMDetection [10] v2. The “Head” methods (ATSS and GFL) affect losses and training sample selection. Res2: Res2Net-v1b [19, 20]

. PConv (Pyramid Convolution) and iBN (integrated Batch Normalization) are the components of SEPC 

[63]. The DCN columns indicate where to apply DCN. “P”: The PConv modules in the combined head of SEPC [63]. “LC”: The extra head of SEPC for localization and classification [63]. “c3-c5”: conv3_x, conv4_x, and conv5_x layers in ResNet-style backbones [26]. “c5”: conv5_x layers in ResNet-style backbones [26]. ATSEPC: ATSS with SEPC (without iBN). MStrain: Multi-scale training. FPS: Frames per second on one V100 with mixed precision.

b.3 Exceptions

The rounding error of epochs between epoch- and iteration-based training can be ignored when calculating the maximum epochs. Small differences of eight pixels or less can be ignored when calculating the maximum resolution. For example, DSSD513 [18] will be compared in Mini USB.

Appendix C Details of UniverseNets

We show the detailed architectures of UniverseNets in Table 15.

Appendix D Details of Experiments

Here, we show the details of experimental settings and results. See also the code to reproduce our settings including minor hyperparameters.

d.1 Common Settings

We follow the learning rate schedules of MMDetection [10], which are similar to those of Detectron [22]. Specifically, the learning rates are reduced by 10 in two predefined epochs. Epochs for the first learning rate decay, the second decay, and ending training are for the 1 schedule, for the 2 schedule, and for the 20e schedule. To avoid overfitting by small learning rates [52], the 20e schedule is reasonable.

We mainly used ImageNet pre-trained backbones that are standard in MMDetection [10]. Some pre-trained Res2Net backbones not supported in MMDetection were downloaded from the Res2Net repository [20]. We trained most models with mixed precision and 4 GPUs ( 4 images per GPU). All results on USB and all results of UniverseNets are single model results without ensemble.

d.2 Settings on COCO

For comparison with state-of-the-art methods with TTA on COCO, we used soft voting with 13-scale testing and horizontal flipping following the original implementation of ATSS [69]. Specifically, shorter side pixels are (400, 500, 600, 640, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1800), while longer side pixels are their 1.667. For the 13 test scales, target objects are limited to corresponding 13 predefined ranges ((96, ), (96, ), (64, ), (64, ), (64, ), (0, ), (0, ), (0, ), (0, 256), (0, 256), (0, 192), (0, 192), (0, 96)), where each tuple denotes the minimum and maximum object scales. Each object scale is measured by , where and denote the object’s width and height, respectively. We also evaluated 5-scale TTA because the above-mentioned ATSS-style TTA is slow. We picked (400, 600, 800, 1000, 1200) for shorter side pixels, and ((96, ), (64, ), (0, ), (0, ), (0, 256)) for object scale ranges.

d.3 Settings on NightOwls

NightOwls [41] is a dataset for person detection at night. It contains three categories (pedestrian, bicycle driver, and motorbike driver). In contrast to WOD, it is important to detect medium or large objects because the evaluation of NightOwls follows the reasonable setting [15] where small objects (less than 50 pixels tall) are ignored. We prevented the overfitting of the driver categories (bicycle driver and motorbike driver) in two ways. The first is to map the classifier layer of the WOD pre-trained model. We transferred the weights for cyclists learned on the richer WOD to those for the NightOwls driver categories. The second is early stopping. We trained the model for 2 epochs (4,554 iterations) without background images.

d.4 Ablation Studies for UniverseNets

We describe the results of ablation studies for UniverseNets on COCO in more detail. As shown in Table 12a, ATSEPC (ATSS [69] with SEPC without iBN [63]) outperforms ATSS by a large margin. The effectiveness of SEPC for ATSS is consistent with those for other detectors reported in the SEPC paper [63]. As shown in Table 12b, UniverseNet further improves AP metrics by 5% by adopting Res2Net-v1b [19, 20], DCN [14], and multi-scale training. As shown in Table 12c, adopting GFL [33] improves AP by 0.8%. There is room for improvement of AP in the Quality Focal Loss of GFL [33]. As shown in Table 12d, UniverseNet-20.08d achieves 48.6% AP by making more use of BatchNorm (SyncBN [44] and iBN [63]). It is much more accurate than other models trained for 12 epochs using ResNet-50-level backbones (, ATSS: 39.4% [69, 10], GFL: 40.2% [33, 10]). On the other hand, the inference is not so fast (less than 20 FPS) due to the heavy use of DCN [14]. UniverseNet-20.08 speeds up inference by the light use of DCN [14, 63]. As shown in Table 12e, UniverseNet-20.08 is 1.4 faster than UniverseNet-20.08d at the cost of a 1% AP drop. To further verify the effectiveness of each technique, we conducted ablation from UniverseNet-20.08 shown in Table 12f. All techniques contribute to the high AP of UniverseNet-20.08. Ablating the Res2Net-v1b backbone (replacing Res2Net-50-v1b [19, 20] with ResNet-50-B [27]) has the largest effects. Res2Net-v1b improves AP by 2.8% and increases the inference time by 1.3. To further investigate the effectiveness of backbones, we trained variants of UniverseNet-20.08 as shown in Table 12g. Although the Res2Net module [19] makes inference slower, the deep stem used in ResNet-50-C [27] and Res2Net-50-v1b [19, 20] improves AP metrics with similar speeds. UniverseNet-20.08s (the variant using ResNet-50-C backbone) shows a good speed-accuracy trade-off by achieving 45.8% AP and over 30 FPS.

(a) COCO-style AP
(b) KITTI-style AP
Figure 5: Test scales different AP metrics on WOD f0val.

d.5 Differences by Metrics

To analyze differences by metrics, we evaluated the KITTI-style AP (KAP) on WOD. KAP is a metric used in benchmarks for autonomous driving [21, 55]. Using different IoU thresholds (0.7 for vehicles, and 0.5 for pedestrians and cyclists), KAP is calculated as The results of KAP are shown in Figure 5. For ease of comparison, we show again the results of CAP in Figure 5. GFL [33] and Cascade R-CNN [8], which focus on localization quality, are less effective for KAP.

d.6 Effects of COCO Pre-Training

To verify the effects of COCO pre-training, we trained UniverseNet-20.08 on M109s from different pre-trained models. Table 16 shows the results. COCO pre-training improves all the metrics, especially body AP.

We also trained models with the eight methods on M109s from ImageNet pre-trained backbones. We halved the learning rates in Table 5 and doubled warmup iterations [23] (from 500 to 1,000) because the training of single-stage detectors without COCO pre-training or SyncBN [44] is unstable. The CAP without COCO pre-training is 1.9% lower than that with COCO pre-training (Table 6) on average.

Pre-training AP AP AP AP AP AP body face frame text
ImageNet 68.9 92.2 73.3 19.9 42.6 75.8 64.3 47.6 93.0 70.7
COCO 1 69.9 92.5 74.3 20.5 43.6 77.1 66.6 48.0 93.7 71.2
COCO 2 69.8 92.3 74.0 20.5 43.4 77.0 66.5 47.8 93.8 71.2
Table 16: UniverseNet-20.08 fine-tuned on Manga109-s 15test from different pre-trained models.