Log In Sign Up

Team PFDet's Methods for Open Images Challenge 2019

by   Yusuke Niitani, et al.

We present the instance segmentation and the object detection method used by team PFDet for Open Images Challenge 2019. We tackle a massive dataset size, huge class imbalance and federated annotations. Using this method, the team PFDet achieved 3rd and 4th place in the instance segmentation and the object detection track, respectively.


page 1

page 2

page 3

page 4


PFDet: 2nd Place Solution to Open Images Challenge 2018 Object Detection Track

We present a large-scale object detection system by team PFDet. Our syst...

1st Place Solutions of Waymo Open Dataset Challenge 2020 – 2D Object Detection Track

In this technical report, we present our solutions of Waymo Open Dataset...

TACO: Trash Annotations in Context for Litter Detection

TACO is an open image dataset for litter detection and segmentation, whi...

LDD: A Dataset for Grape Diseases Object Detection and Instance Segmentation

The Instance Segmentation task, an extension of the well-known Object De...

1st Place Solutions for OpenImage2019 – Object Detection and Instance Segmentation

This article introduces the solutions of the two champion teams, `MMfrui...

Object Detection Free Instance Segmentation With Labeling Transformations

Instance segmentation has attracted recent attention in computer vision ...

Data-Efficient Instance Segmentation with a Single GPU

Not everyone is wealthy enough to have hundreds of GPUs or TPUs. Therefo...

1 Introduction

Open Images Detection Dataset V5 (OID) [8] is currently the largest publicly available object detection dataset, including M annotated images with

M bounding boxes. The diversity of images in training datasets is the driving force of the generalizability of machine learning models. Successfully trained models on OID would push the frontier of object detectors with the help of data.

Since the number of images in OID is extremely large, the speed of training is critical. For faster training, we use Fast R-CNN [4] instead of the more commonly used Faster R-CNN [15]. Fast R-CNN omits time-consuming online RoI (Region of Interest) generation during training by pre-computing RoIs. We find that the selection of pre-computed RoIs plays an important role in achieving good accuracy. For instance, when the number of pre-computed RoIs is small during training, a network overfits to those RoIs. By default, RoIs used to train Faster R-CNN have high variation, so the aforementioned problem is unique to Fast R-CNN.

OID is a federated object detection dataset [5, 8]. This means that for each image, only a subset of categories is annotated. This is in contrast to exhaustively annotated datasets such as COCO [10]. Federated annotation is a realistic approach to expand the number of categories covered by the dataset, since without sparsifying the number of annotated categories, the number of annotations required may explode as the total number of categories increases.

When training a detector on exhaustively annotated datasets like COCO, the loss functions make an assumption that no objects is inside an unannotated region. However, for federated object detection datasets, such assumption may be violated because some regions contain an object of an unverified category. We handle this problem by ignoring loss for unverified categories 


In addition to the previously mentioned uniqueness of OID, the dataset poses an unprecedented class imbalance for an object detection dataset. The rarest category Pressure Cooker is annotated in only images, but the most common category Person is annotated in more than k images. The ratio of the occurrence of the most common and the least common category is times larger than in COCO [10]. Typically, this class imbalance can be handled by by over-sampling images containing rare categories. However, over-sampling may suffer from degraded performance for common categories.

As a practical method to solve class imbalance, we train models exclusively on rare classes and ensemble them with the rest of the models similar to our our method in the last year’s competition [2].

To summarize our major contributions:

  • Fast R-CNN: We present the effectiveness of Fast R-CNN and propose methods to alleviate performance penalty introduced by per-computed RoIs.

  • Using Only Verified Categories: We find that it is helpful to ignore unverified categories during training.

  • Expert Models: We show the effectiveness of using expert models, especially for classes that rarely appear in the dataset.

2 Methods

2.1 Model architecture

Two stage object detectors such as Faster R-CNN [15]

are known to achieve excellent accuracy, but their GPU usage efficiency during training is sub-optimal. This is because RoIs used to train R-CNN heads are determined between feature extraction and loss calculation. Thus, GPUs are forced to wait for the ground-truth assignment of RoIs. To make a more efficient use of GPUs, we use Fast R-CNN, which pre-computes RoIs and assigns ground-truths to RoIs in parallel to feature extraction.

For the instance segmentation track, we add a mask head [6] that predicts a segmentation of an object given a region around it and its category. The categories for instance segmentation is a subset of the categories for detection. We use all the detection categories even for training the instance segmentation model, where only of these have instance masks. This worked better compared to only using the instance segmentation categories in preliminary experiments.

2.2 Learning with a federated dataset

In federated object detection datasets [8, 5], for each image, categories are grouped into positively verified, negatively verified and unverified. For positively verified categories, annotations are exhaustively made for all objects of those categories. For negatively verified categories, the annotators have made sure that the image contains no objects of those categories. For unverified categories, the objects of those categories may or may not exist in the image.

During training, each RoI is assigned to one of the ground truth boxes if there is any that has sufficiently large enough intersection. This assignment is used to calculate classification loss and localization loss [15]. In this work, the classification loss is calculated as the sum of sigmoid cross entropy loss for each proposal and each category as:


where and when the -th RoI is assigned or not assigned to the category , respectively. Also, can be set to , which means that the classification loss for the category is ignored for the -th RoI.

When the RoI is not assigned to any of the ground truth boxes, we set for positively and negatively verified categories and for unverified categories. When the RoI is assigned to a ground truth bounding box with the category , we set and for all negatively verified and positively categories except for the category . We set for unverified categories. In practice, verified categories are expanded based on the category hierarchy.

2.3 Expert models

In OID, there is an extreme class imbalance, which makes it difficult for a model to learn rare categories. For instance, there are classes that are annotated in less than images, but the most common class Person is annotated in k images. We use expert models fine-tuned from a model trained with the entire dataset as done in our previous year’s submission [2].

We select a subset of categories to which an expert model is trained based on one of the following criteria:

  • Occurrence ranking of categories in the detection subset. We group categories that are in a neighboring ranking so that sampling imbalance does not occur among the categories in a subset.

  • Occurrence ranking in the instance segmentation subset.

  • Semantic similarity of categories. We cluster categories based on their similarity of embeddings by an imagenet pretrained feature extractor 


When training an expert model, the annotations of categories that the expert is not responsible for are dropped. Images that do not contain an annotation of target categories are discarded. Also, the bias term of the classification layer is reinitialized so that the network only outputs target categories.

2.4 Ensembling

When aggregating predictions from multiple models, we apply non-maximum suppression to predictions from each model and apply suppression once again to concatenation of the predictions. In the second suppression step, we group predictions that have a large enough intersection and are assigned to the same category. We compute a representative prediction from each group, and the collection of the representatives is the final prediction. Given a group, the bounding box of the group’s representative is the bounding box of the most confident prediction in the group. The segmentation is calculated as the average of the segmentations in the group weighted by confidence and spatial proximity.

2.5 Pre-computed RoIs

The RoIs used by Faster R-CNN are conditioned on the model weights. Thus, high variation of RoIs could be achieved without any effort. This variation is lost when we use Fast R-CNN models with the same set of pre-computed RoIs. This comes at the cost of degraded performance. During training training, RoIs that are used to compute head losses are sampled from a pool of RoIs. For training Fast R-CNN, the pool of RoIs is pre-computed, and we find that the number of RoIs in the pool needs to be very high in order for the network not to overfit to the RoIs. In many published works using a variant of Faster R-CNN [6], the number of RoIs in the pool for each image is up to . However, we find that this is not large enough, so we prepare up to RoIs per image.

Selection of pre-computed RoIs is also important when ensembling models. The set of RoIs used by each model should be different when ensembling. This is because predictions made with different set of RoIs complement each other better.

2.6 Post-processing

As stated in the competition description page, the size of an annotated object is larger than or  111 Thus, any predictions with small segmentations are unlikely to be counted as true positives during evaluation. We omitted predictions whose areas of their segmentation are less than pixels.

The competition submission file size is limited to GB. Our submission file sometimes exceeded this limit, especially after ensembling. We find that predictions for some categories occur much more frequently than the rest, which are likely to be less important for a class averaged metric. We drop predictions of frequently predicted categories to meet the file size limit.

3 Experiments

We used COCO [10], LVIS [5] and Objects365 as the external data. We use Feature Pyramid Networks [9] for our experiments. The feature extractor is SENet [7]. The initial bias for the final classification layer is set to a large negative number to prevent unstable training in the beginning. We set the initial weight of the base extractor by the weights of an image classification network trained on the ImageNet classification task [3]

. We use stochastic gradient descent with the momentum set to

for optimization. The base learning rate is set to . We used up to GPUs. The best single model is trained with the batch size set to

. We used multi-node batch normalization 

[14] to make training stable. We trained for epochs. The learning rate is scheduled by a cosine function , where and are the learning rate and the initial learning rate. We scale images during training so that the length of the smaller edge is between . Also, we randomly flip images horizontally to augment training data. In addition to that, for training expert models, we used an augmentation policy searched by AutoAugment [17]. During inference, we did not do any test-time augmentation. We used non-maximum suppression with the intersection over union threshold set to . We use Chainer [16, 1, 12] in our experiments.

3.1 Pre-computed RoIs

To study the best ensembling strategy of Fast R-CNN models, we conducted an ablative study of ensembles of predictions from two models with two sets of RoIs. The result is shown in Table 1. When ensembling predictions from two models using the same set of RoIs, the performance did not improve from the single model result. By using different sets of RoIs, the ensemble outperformed the single model.

Model RoI Segmentation val mAP
A 1
B 1
B 2
A, B 1, 1
A, B 1, 2
Table 1: Comparison of ensembling results with different set of RoIs. The top three rows show the scores with single model with different weights and RoIs. The bottom two rows show the scores of ensembling predictions from two models. The last row shows the result when different set of RoIs are used for predictions of two models.

3.2 Expert models

Table 2 shows an ablative study of expert models with different numbers of categories assigned. This is the result when training expert models for the 50th to 99th rarest categories. When training multiple expert models for these categories, the categories are split into disjoint sets. These splits are made based on how frequent the categories appear in OID. For instance, when training two expert models, the first expert model is responsible for the 50th to 74th rarest categories and the second model is responsible for the 75th to 99th rarest categories. As seen in the table, when each expert is responsible for a smaller number of categories, the performance for each category improves on average. Since the computational budget is limited, it is difficult to make the number of categories assigned to each expert small. Thus, the numbers of categories responsible by an expert model in our final submission vary.

# of categories per expert # of experts detection val mAP
50 1
25 2
10 5
Table 2: Ablative study on the number of categories assigned to expert models. The mean average precision of the baseline model is .

3.3 Competition results

Our final submission consists of the predictions from the following models:

  • Two Fast R-CNN models trained on detection categories. One of them is trained for epochs and another is trained for epochs.

  • Fast R-CNN expert models. On average each expert predicts categories.

  • Faster R-CNN models. We used predictions from Faster R-CNN models trained in the preliminary experiments. Some of them are from the last year’s submission [11, 2].

The results for instance segmentation and object detection are shown in Table 3 and Table 4. We ranked 3rd and 4th place in the instance segmentation and the object detection tracks, respectively.

For instance segmentation, all ensemble results exceeded the file size limit of 5GB for the test set. Thus, we needed to drop some predictions for the frequently predicted categories. Therefore, by adding more models, the instance segmentation test results did not improve as much as the validation scores and the object detection results.

val public test private test
Full (16 epochs)
Full (24 epochs)
Ensemble of above two
+ Expert Models
+ Remove small masks
+ Faster R-CNN models
Table 3: Mean average precision on the instance segmentation track. We did not set a file size limit for the validation set, so the post processing of removing small masks was not evaluated on the validation set. Some of the predictions of Faster R-CNN models are from last year’s competition, so we could not evaluate them on the validation set.
val public test private test
Full (16 epochs)
Full (24 epochs)
Ensemble of above two
+ Expert Models
+ Faster R-CNN models
Table 4: Mean average precision on the detection track. Some of the predictions of Faster R-CNN models are from last year’s competition, so we could not evaluate them on the validation set.

4 Conclusion

In this paper, we described the instance segmentation and the object detection submissions to Open Images Challenge 2019 by th e team PFDet. Thanks to the fast research cycle enabled by an efficient usage of large GPU clusters, we developed several techniques that led to 3rd and 4th place in the the instance segmentation and the object detection track, respectively.


We thank K. Uenishi, R. Arai, T. Shiota and S. Omura for helping with our experiments.


  • [1] T. Akiba, K. Fukuda, and S. Suzuki (2017)

    ChainerMN: Scalable Distributed Deep Learning Framework

    In LearningSys workshop in NIPS, Cited by: §3.
  • [2] T. Akiba, T. Kerola, Y. Niitani, T. Ogawa, S. Sano, and S. Suzuki (2018) PFDet: 2nd place solution to open images challenge 2018 object detection track. In ECCV Workshop, Cited by: §1, §2.3, 3rd item.
  • [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: §3.
  • [4] R. Girshick (2015) Fast r-cnn. In ICCV, Cited by: §1.
  • [5] A. Gupta, P. Dollár, and R. Girshick (2019) LVIS: a dataset for large vocabulary instance segmentation. In CVPR, Cited by: §1, §2.2, §3.
  • [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §2.1, §2.5.
  • [7] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. CVPR. Cited by: §3.
  • [8] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982. Cited by: §1, §1, §2.2.
  • [9] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection.. In CVPR, Cited by: §3.
  • [10] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár (2014) Microsoft coco: common objects in context. ECCV. Cited by: §1, §1, §3.
  • [11] Y. Niitani, T. Akiba, T. Kerola, T. Ogawa, S. Sano, and S. Suzuki (2019) Sampling techniques for large-scale object detection from sparsely annotated objects. In CVPR, Cited by: §1, 3rd item.
  • [12] Y. Niitani, T. Ogawa, S. Saito, and M. Saito (2017)

    ChainerCV: a library for deep learning in computer vision

    In ACM MM, Cited by: §3.
  • [13] W. Ouyang, X. Wang, C. Zhang, and X. Yang (2016) Factors in finetuning deep model for object detection. In CVPR, Cited by: 3rd item.
  • [14] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun (2018) MegDet: a large mini-batch object detector. In CVPR, Cited by: §3.
  • [15] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1, §2.1, §2.2.
  • [16] S. Tokui, R. Okuta, T. Akiba, Y. Niitani, T. Ogawa, S. Saito, S. Suzuki, K. Uenishi, B. Vogel, and H. Yamazaki Vincent (2019) Chainer: a deep learning framework for accelerating the research cycle. In KDD, Cited by: §3.
  • [17] B. Zoph (2019) Learning data augmentation strategies for object detection. In arxiv, Cited by: §3.