Open Images Detection Dataset V5 (OID)  is currently the largest publicly available object detection dataset, including M annotated images with
M bounding boxes. The diversity of images in training datasets is the driving force of the generalizability of machine learning models. Successfully trained models on OID would push the frontier of object detectors with the help of data.
Since the number of images in OID is extremely large, the speed of training is critical. For faster training, we use Fast R-CNN  instead of the more commonly used Faster R-CNN . Fast R-CNN omits time-consuming online RoI (Region of Interest) generation during training by pre-computing RoIs. We find that the selection of pre-computed RoIs plays an important role in achieving good accuracy. For instance, when the number of pre-computed RoIs is small during training, a network overfits to those RoIs. By default, RoIs used to train Faster R-CNN have high variation, so the aforementioned problem is unique to Fast R-CNN.
OID is a federated object detection dataset [5, 8]. This means that for each image, only a subset of categories is annotated. This is in contrast to exhaustively annotated datasets such as COCO . Federated annotation is a realistic approach to expand the number of categories covered by the dataset, since without sparsifying the number of annotated categories, the number of annotations required may explode as the total number of categories increases.
When training a detector on exhaustively annotated datasets like COCO, the loss functions make an assumption that no objects is inside an unannotated region. However, for federated object detection datasets, such assumption may be violated because some regions contain an object of an unverified category. We handle this problem by ignoring loss for unverified categories.
In addition to the previously mentioned uniqueness of OID, the dataset poses an unprecedented class imbalance for an object detection dataset. The rarest category Pressure Cooker is annotated in only images, but the most common category Person is annotated in more than k images. The ratio of the occurrence of the most common and the least common category is times larger than in COCO . Typically, this class imbalance can be handled by by over-sampling images containing rare categories. However, over-sampling may suffer from degraded performance for common categories.
As a practical method to solve class imbalance, we train models exclusively on rare classes and ensemble them with the rest of the models similar to our our method in the last year’s competition .
To summarize our major contributions:
Fast R-CNN: We present the effectiveness of Fast R-CNN and propose methods to alleviate performance penalty introduced by per-computed RoIs.
Using Only Verified Categories: We find that it is helpful to ignore unverified categories during training.
Expert Models: We show the effectiveness of using expert models, especially for classes that rarely appear in the dataset.
2.1 Model architecture
Two stage object detectors such as Faster R-CNN 
are known to achieve excellent accuracy, but their GPU usage efficiency during training is sub-optimal. This is because RoIs used to train R-CNN heads are determined between feature extraction and loss calculation. Thus, GPUs are forced to wait for the ground-truth assignment of RoIs. To make a more efficient use of GPUs, we use Fast R-CNN, which pre-computes RoIs and assigns ground-truths to RoIs in parallel to feature extraction.
For the instance segmentation track, we add a mask head  that predicts a segmentation of an object given a region around it and its category. The categories for instance segmentation is a subset of the categories for detection. We use all the detection categories even for training the instance segmentation model, where only of these have instance masks. This worked better compared to only using the instance segmentation categories in preliminary experiments.
2.2 Learning with a federated dataset
In federated object detection datasets [8, 5], for each image, categories are grouped into positively verified, negatively verified and unverified. For positively verified categories, annotations are exhaustively made for all objects of those categories. For negatively verified categories, the annotators have made sure that the image contains no objects of those categories. For unverified categories, the objects of those categories may or may not exist in the image.
During training, each RoI is assigned to one of the ground truth boxes if there is any that has sufficiently large enough intersection. This assignment is used to calculate classification loss and localization loss . In this work, the classification loss is calculated as the sum of sigmoid cross entropy loss for each proposal and each category as:
where and when the -th RoI is assigned or not assigned to the category , respectively. Also, can be set to , which means that the classification loss for the category is ignored for the -th RoI.
When the RoI is not assigned to any of the ground truth boxes, we set for positively and negatively verified categories and for unverified categories. When the RoI is assigned to a ground truth bounding box with the category , we set and for all negatively verified and positively categories except for the category . We set for unverified categories. In practice, verified categories are expanded based on the category hierarchy.
2.3 Expert models
In OID, there is an extreme class imbalance, which makes it difficult for a model to learn rare categories. For instance, there are classes that are annotated in less than images, but the most common class Person is annotated in k images. We use expert models fine-tuned from a model trained with the entire dataset as done in our previous year’s submission .
We select a subset of categories to which an expert model is trained based on one of the following criteria:
Occurrence ranking of categories in the detection subset. We group categories that are in a neighboring ranking so that sampling imbalance does not occur among the categories in a subset.
Occurrence ranking in the instance segmentation subset.
When training an expert model, the annotations of categories that the expert is not responsible for are dropped. Images that do not contain an annotation of target categories are discarded. Also, the bias term of the classification layer is reinitialized so that the network only outputs target categories.
When aggregating predictions from multiple models, we apply non-maximum suppression to predictions from each model and apply suppression once again to concatenation of the predictions. In the second suppression step, we group predictions that have a large enough intersection and are assigned to the same category. We compute a representative prediction from each group, and the collection of the representatives is the final prediction. Given a group, the bounding box of the group’s representative is the bounding box of the most confident prediction in the group. The segmentation is calculated as the average of the segmentations in the group weighted by confidence and spatial proximity.
2.5 Pre-computed RoIs
The RoIs used by Faster R-CNN are conditioned on the model weights. Thus, high variation of RoIs could be achieved without any effort. This variation is lost when we use Fast R-CNN models with the same set of pre-computed RoIs. This comes at the cost of degraded performance. During training training, RoIs that are used to compute head losses are sampled from a pool of RoIs. For training Fast R-CNN, the pool of RoIs is pre-computed, and we find that the number of RoIs in the pool needs to be very high in order for the network not to overfit to the RoIs. In many published works using a variant of Faster R-CNN , the number of RoIs in the pool for each image is up to . However, we find that this is not large enough, so we prepare up to RoIs per image.
Selection of pre-computed RoIs is also important when ensembling models. The set of RoIs used by each model should be different when ensembling. This is because predictions made with different set of RoIs complement each other better.
As stated in the competition description page, the size of an annotated object is larger than or 111https://storage.googleapis.com/openimages/web/factsfigures.html Thus, any predictions with small segmentations are unlikely to be counted as true positives during evaluation. We omitted predictions whose areas of their segmentation are less than pixels.
The competition submission file size is limited to GB. Our submission file sometimes exceeded this limit, especially after ensembling. We find that predictions for some categories occur much more frequently than the rest, which are likely to be less important for a class averaged metric. We drop predictions of frequently predicted categories to meet the file size limit.
We used COCO , LVIS  and Objects365 as the external data. We use Feature Pyramid Networks  for our experiments. The feature extractor is SENet . The initial bias for the final classification layer is set to a large negative number to prevent unstable training in the beginning. We set the initial weight of the base extractor by the weights of an image classification network trained on the ImageNet classification task 
. We use stochastic gradient descent with the momentum set tofor optimization. The base learning rate is set to . We used up to GPUs. The best single model is trained with the batch size set to
. We used multi-node batch normalization to make training stable. We trained for epochs. The learning rate is scheduled by a cosine function , where and are the learning rate and the initial learning rate. We scale images during training so that the length of the smaller edge is between . Also, we randomly flip images horizontally to augment training data. In addition to that, for training expert models, we used an augmentation policy searched by AutoAugment . During inference, we did not do any test-time augmentation. We used non-maximum suppression with the intersection over union threshold set to . We use Chainer [16, 1, 12] in our experiments.
3.1 Pre-computed RoIs
To study the best ensembling strategy of Fast R-CNN models, we conducted an ablative study of ensembles of predictions from two models with two sets of RoIs. The result is shown in Table 1. When ensembling predictions from two models using the same set of RoIs, the performance did not improve from the single model result. By using different sets of RoIs, the ensemble outperformed the single model.
|Model||RoI||Segmentation val mAP|
|A, B||1, 1|
|A, B||1, 2|
3.2 Expert models
Table 2 shows an ablative study of expert models with different numbers of categories assigned. This is the result when training expert models for the 50th to 99th rarest categories. When training multiple expert models for these categories, the categories are split into disjoint sets. These splits are made based on how frequent the categories appear in OID. For instance, when training two expert models, the first expert model is responsible for the 50th to 74th rarest categories and the second model is responsible for the 75th to 99th rarest categories. As seen in the table, when each expert is responsible for a smaller number of categories, the performance for each category improves on average. Since the computational budget is limited, it is difficult to make the number of categories assigned to each expert small. Thus, the numbers of categories responsible by an expert model in our final submission vary.
|# of categories per expert||# of experts||detection val mAP|
3.3 Competition results
Our final submission consists of the predictions from the following models:
Two Fast R-CNN models trained on detection categories. One of them is trained for epochs and another is trained for epochs.
Fast R-CNN expert models. On average each expert predicts categories.
For instance segmentation, all ensemble results exceeded the file size limit of 5GB for the test set. Thus, we needed to drop some predictions for the frequently predicted categories. Therefore, by adding more models, the instance segmentation test results did not improve as much as the validation scores and the object detection results.
|val||public test||private test|
|Full (16 epochs)|
|Full (24 epochs)|
|Ensemble of above two|
|+ Expert Models|
|+ Remove small masks|
|+ Faster R-CNN models|
|val||public test||private test|
|Full (16 epochs)|
|Full (24 epochs)|
|Ensemble of above two|
|+ Expert Models|
|+ Faster R-CNN models|
In this paper, we described the instance segmentation and the object detection submissions to Open Images Challenge 2019 by th e team PFDet. Thanks to the fast research cycle enabled by an efficient usage of large GPU clusters, we developed several techniques that led to 3rd and 4th place in the the instance segmentation and the object detection track, respectively.
We thank K. Uenishi, R. Arai, T. Shiota and S. Omura for helping with our experiments.
ChainerMN: Scalable Distributed Deep Learning Framework. In LearningSys workshop in NIPS, Cited by: §3.
-  (2018) PFDet: 2nd place solution to open images challenge 2018 object detection track. In ECCV Workshop, Cited by: §1, §2.3, 3rd item.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: §3.
-  (2015) Fast r-cnn. In ICCV, Cited by: §1.
-  (2019) LVIS: a dataset for large vocabulary instance segmentation. In CVPR, Cited by: §1, §2.2, §3.
-  (2017) Mask r-cnn. In ICCV, Cited by: §2.1, §2.5.
-  (2018) Squeeze-and-excitation networks. CVPR. Cited by: §3.
-  (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982. Cited by: §1, §1, §2.2.
-  (2017) Feature pyramid networks for object detection.. In CVPR, Cited by: §3.
-  (2014) Microsoft coco: common objects in context. ECCV. Cited by: §1, §1, §3.
-  (2019) Sampling techniques for large-scale object detection from sparsely annotated objects. In CVPR, Cited by: §1, 3rd item.
ChainerCV: a library for deep learning in computer vision. In ACM MM, Cited by: §3.
-  (2016) Factors in finetuning deep model for object detection. In CVPR, Cited by: 3rd item.
-  (2018) MegDet: a large mini-batch object detector. In CVPR, Cited by: §3.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1, §2.1, §2.2.
-  (2019) Chainer: a deep learning framework for accelerating the research cycle. In KDD, Cited by: §3.
-  (2019) Learning data augmentation strategies for object detection. In arxiv, Cited by: §3.