We used OpenImages Challenge 2019 Object Detection dataset  as the training data for most of cases. The dataset is a subset of OpenImages V5 dataset . It contains 1.74M images, 14.6M bounding boxes, and 500 categories consisting of five different levels. Since the categories at different levels have a parent-children relationship, we expand the parent class for each bounding box in the inference stage. The whole OpenImageV5 with image-level label and segmentation label is used in weakly-supervised pretraining and label augmentation as mentioned in Sec. 5.9. We also use the COCO  and Object365  to train some expert models for the overlapped categories.
2 Decoupling Head
Since the breakthrough of object detection performance has been achieved by seminal R-CNN families [girshick2015region, girshick2015fast, 19] and powerful FPN , the subsequent performance enhancement of this task seems to be hindered by some concealed bottlenecks. Even the advanced algorithms bolstered by AutoML [7, Xu_2019_ICCV] have been delved, the performance gain is still limited to an easily accessible improvement range. As the most obvious distinction from the generic object classification task, the specialized sibling head for both classification and localization comes into focus and is widely used in most of advanced detectors including single stage family [liu2016ssd, 20, 9], two-stage family [5, 11, 24, 15, 12] and anchor free family [law2018cornernet]. Considering the two different tasks share almost the same parameters, a few works become conscious about the conflict between the two object functions in the sibling head and try to find a trade-off way.
IoU-Net [jiang2018acquisition] is the first to reveal this problem. They find the feature which generates a good classification score always predicts a coarse bounding box. To handle this problem, they first introduce an extra head to predict the IoU as the localization confidence, and then aggregate the localization confidence and the classification confidence together to be the final classification score. This approach does reduce the misalignment problem but in a compromise manner – the essential philosophy behind it is relatively raising the confidence score of a tight bounding box and reduce the score of a bad one. The misalignment still exists in each spatial point. Along with this direction, Double-Head R-CNN  is proposed to disentangle the sibling head into two specific branches for classification and localization, respectively. Despite of elaborate design of each branch, it can be deemed to disentangle the information by adding a new branch, essentially reduce the shared parameters of the two tasks. Although the satisfactory performance can be obtained by this detection head disentanglement, conflict between the two tasks still remain since the features fed into the two branches are produced by ROI Pooling from the same proposal.
In our work, we observation that the spatial misalignment between the two object functions in the sibling head can considerably hurt the training process, but this misalignment can be resolved by a very simple operator called Decoupling Head (DH), alternatively called TSD in 
. Considering the classification and regression, DH decouples them from the spatial dimension by generating two disentangled proposals for them, which are estimated by the shared proposal. This is inspired by the natural insight that for one instance, the features in some salient area may have rich information for classification while these around the boundary may be good at bounding box regression. We give the brief description for the proposed Decoupling Head (DH) in this section. More details will be introduced in a detached paper TSD. Extensive experiments demonstrate the advantages of DH compared with the original detection head in Faster RCNN.
2.1 Detail description
As shown in 1, different from the original detection head in Faster RCNN, in DH, we separate the classification and regression by auto-learned pixel-wised offset and global offset. The purpose of DH is to search the optimal feature extraction for classification and regression, respectively. Furthermore, to facilitate the learning of DH, we propose the Controllable Margin Loss (CML) to propel the whole learning.
Define the as the output feature of the ROI pooling, the learned offsets for classification and regression are generated by:
where and are the parameters in fully connected layers and . There are and where is the number of bins in ROI pooling.
Classification. For classification, the output feature of DHPooling is defined as:
where is the input feature map and is the number of pixels in -bin pre-defined in ROI pooling. The top-left corner is denoted as .
Regression. For regression, the output feature of DHPooling is defined as:
To facilitate the training, we propose CML to optimize the learning. For classification, the CML is defined as:
where is the classification score in the original detection head and is the classification score in DH.
is same as ReLU function. We useto represent the pre-defined margin. Similarly, for regression the CML is written as:
where and are the IoU of the refined proposal according to the predicted regression in original detection head and DH, respectively. and are set to 0.2 in our experiments.
More details and analysis will be presented on an independent article.
In the post-processing stage of object detection, NMS or soft-NMS is commonly used to filter the invalid bounding boxes. However, in our experiments, we find that a direct use of soft-NMS will degrade the performance. In order to better improve the performance, we adopt the Adj-NMS to incorporate the NMS and soft-NMS better. Given the detected bounding boxes, we preliminarily filter the boxes via the NMS operator with the threshold 0.5. And then, we adopt the soft-NMS operator to re-weight the scores of the other boxes by:
where is the weight to multiply the classification score and is set to 0.5.
4 Model Ensemble
4.1 Naive Ensemble
For model ensemble, we adopt the solution in PFDet  and the commonly used voting strategy where the bounding box location and confidence are voted by the top k boxes. Given the bounding boxes and the top k boxes (i[1,k]) with higher IoU, we first using the method in PFDet to reweight the classification score for each model via the in validation set. And then, the final classification score of is computed as:
The localization is computed as:
where is set to 4 in our experiments.
4.2 Auto Ensemble
We trained totally 28 models of different architectures, heads, data splits, class sampling strategies, augmentation strategies and supervisions. We first use the naive model ensemble mentioned above to aggregate detectors with similar settings, which reduces the detections from 28 to 11. Then we design and launch an auto ensemble method to merge them into 1.
Search space. Considering each detection as a leaf node and each ensemble operator as an parent node. The model ensemble can be formulated as a binary tree generation process. All the parent nodes are an aggregation of their children by a set of operations and the root will be the final detection. The search space includes the weight of detection score (a global scale factor for all the classes††Note that before the ensemble, we first re-weight the box score of each class by the relative AP value as mentioned in Sec. 4.1), box merging score, element dropout (only use the classification score or bounding box information of a model) and NMS type (naive NMS, soft-NMS and adj-NMS).
Search process. In the competition we adopt a two-stage searching process: first, we search the architecture of the binary tree with equal contribution for each child node; then we search the operators of parent nodes based on the fixed tree.
Result. Since such a large search space may lead to overfitting, we split the whole dataset (V5 train+val+test+challenge val) in to three parts, 80% for training and 210% as validation sets for tuning the ensemble strategy. The validation sets are elaborately mined to keep its distribution as similar to the whole dataset as possible. We only train the models ID 17-28 under this data setting. The autoEnsemble leads to 2.9%, 3.6% and 1.2%, 1.0% improvement on the two validation sets and 0.9% on the public lead-board compared to the Naive ensemble. We also observe an interesting result in the first stage: the detections with lower mAPs tend to locate at deeper leafs. We will provide an enhanced one-stage searching method and more details in an independent article.
5 Bag of Tricks for Detector
OpenImages dataset  exhibits the long-tail distribution characteristics: the number of categories is not balanced, and some categories of data are scarce.
is a widely used technique to handle the class imbalance problem. For each category, the images are sampled such that the probability of having at least one category instance in 500 categories is equal. Table1 shows the effectiveness of this sampling method. We use class-aware sampling in all the below-mentioned methods.
5.2 Decoupling Backbone
For models ID 25-28, we decouple the classification and regression from the stride 8 in the backbone. One branch focuses on the classification task where regression is given a lower weight and the other branch is the opposite.
5.3 Elaborate Augmentation
For models trained with 512 accelerators, we design a ’full class batch’ and a elaborate augmentation. For the ’full class batch’, we guarantee that there are at least one sample for each class. For the elaborate augmentation, we first randomly select a class and obtain one image containing it. And then, we apply the random rotation on this image (larger rotated varience for class with severely unbalanced aspect ratio such as ’flashlight’). Furthermore, we randomly select a scale to crop the image covering the bounding box of this class. For the trick of selecting the scale, we first generate the maximum image area which is constrained by the memory of accelerator, and then, we randomly sample a scale from the minimum scale to . The scale sampling obey the distribution of the ratio that longer side of a bbox divided by the long side of its image among the whole training set.
5.4 Expert Model
An expert model means that a detector trained on a subset of the dataset to predict a subset of categories. The motivation is that a general model is hard to perform well in all classes, so we need to select some categories for expert models to handle specifically.
There are two important factors to consider: the selection of positive and negative categories, and the ratio between the positive and negative categories. Previous papers  used predefined rules, such as selecting the least number or the worst-performing category in the validation set. The drawback of these predefined rules is that: it ignores the possibility of confusion between categories. E.g. ”Ski” and ”Snowboard” are an easy-to-confuse category pair. If we only choose ”Ski” data to train an expert model, it is easy to treat the ”Snowboard” in the validation set as ”Ski”, causing false-positive cases.
|Baseline (X50 FPN)||58.88|
|+ Class Aware Sampling||64.64|
The definition of ”easy to confuse” can be derived from three different perspectives:
a) Hierarchy tag: OpenImages dataset  has hierarchy-tag relationships between different categories. A straightforward method is to select sub-classes under the same parent node to train the expert model.
b) Confusion matrix
: If the two categories are easily confused, they will cause many false-positives as reflected in the confusion matrix.
c) Visual similarity
: The weight of the neural network can also be used to measure the distance between the two classes.
calculated the Euclidean distance of the features extracted by the last layer of ResNet-101 to define the visual similarity. We go further and consider the weights of the classification Fully Connected layer in the RCNN stage. The cosine angle between different categories are defined as:
We verify that if the semantics of the two categories are similar, then the corresponding cosine angle is also close to 1.
We train our expert model as following three steps:
1) Select the initial category , such as the lowest ten categories of validation mAP. Add images containing to the positive data subset .
2) Add the confused categories by using the cosine matrix. For each category who satisfy the requirement that , adding them to . equals 0.25 in our setting to ensure the ratio of positive and negative data is close to 1:3. Add images containing to the negative data subset .
3) Train a detector with the to predict categories .
During the inference stage, each RoI will have a corresponding classification score with the shape of . If the background classification score is larger than all other foreground scores, then this RoI will not be sent to the bounding box regression step. This modification can reduce a lot of unnecessary false-positive cases.
5.5 Anchor Selecting
5.6 Cascade RCNN
Cascade RCNN is designed for high quality object detection and can improve AP at high IOU thresholds,eg AP0.75. However, in this competition, the evaluation criterion only considers AP0.5, so we modified the IOU threshold for each RCNN level in Cascade-RCNN and redesigned the weight of each stage for the final result. We set the IOU thresholds to 0.5,0.5,0.6,0.7, and set weight of each stage to 0.75,1,0.25,0.25. It offers an increase of 0.7 mAP compared to the standard Cascade RCNN.
5.7 Weakly Supervised Training
There is a serious class imbalance issue in the OpenImage object detection dataset. Some classes only have a few images, which cause the model to perform poorly on these classes. We add some images which only have image-level annotations to improve the classification ability of our model. Specifically, We combine data with bounding-box level annotations and image classification level annotations to build a semi-supervised dataset and integrate a fully-supervised detector(Faster-RCNN) and a weakly-supervised detector(WSDDN) in an end-to-end manner. When encountering bounding-box level data, we use it to train the fully-supervised detector and constrain the weakly supervisory detector. when encountering image classification level data, we use it to train weakly supervised detector, and mine pseudo ground-truth from weakly-supervised results to train the fully supervised detector.
5.8 Relationships Between Categories
There are some special relationships between categories in the OpenImage dataset. For example, some classes always appear along with other classes, like Person and Guitar. In the training set, Person appears in 90.7% of the images which have a guitar. So when detected a bounding box of guitar with high confidence and there is a bounding-box of person with a certain confidence, we can improve the confidence the bounding-box of person. We denote the number of objects of category i in the training set as . The number of objects of category i co-occurring with category j as . We can get the conditional probability . We assume that the max confidence over all proposals of category in a image should greater than the highest conditional probability, i.e. .
In addition to the co-occurrence relationship, there are two special relationships, surround relationship and being surrounded relationship, as shown in the Fig.3. Surround relationships mean that bounding boxes of certain categories always surround bounding box of certain other categories. Being surrounded relationships mean that certain categories always appear inside the bounding box of certain other categories.
These special relationships between categories can be evidence to improve or reduce the confidence of certain bounding boxes, thereby improving detection performance.
5.9 Data Understanding
We find there are many confusing class definitions in OpenImage and some of them can be used to improve the accuracy. Such as ‘torch’ has various semantic meanings in train and validation, which is out of algorithm’s ability. So we expand the training samples of these confusing classes by both mixing some similar classes and using extra images with only image-level label. Here are some more examples: ‘torch’ and ‘flashlight’, ‘sword’ and ‘dagger’, ‘paper tower’ and ‘toilet paper’, ‘slow cooker’ and ‘pressure cooker’, ‘kitchen knife’ and ‘knife’.
Insufficient label. We also find some classes like ‘grape’ has too many group boxes and few instance boxex, so we use the bounding box of its segmentation label to extend the detection label. For some other classes such as ‘pressure cooker’ and ‘touch’, we crawling the top-100 results from google image and directly feed them into the training pipeline without hand labelling. A good property of these 200 crawled images is their backgrounds are pure enough so we directly use [0,0,1,1] as their bounding boxes.
6 Implementation Details
The 28 final models are trained by PyTorch
and Tensorflow and all of the backbones are first pre-trained on ImageNet dataset. All of the models are trained under different settings: 13/26 epochs with batch size 2N @ N accelerators, where ’N’s are in range of [32, 512] for different models based on the available number of accelerators. We warm up the learning rate from 0.001 toand then decay it by 0.1 at epoch 9 and 11 (or 18 and 22 for the 2x setting). At the inference stage, for validation set, we straightforwardly generate the result and for challenge test, we adopt the multi-scale test with [600, 800, 1000, 1333, 1666, 2000] and the final parameters are generated by averaging the parameters of epoch [9,13] (or [19,26] for the 2x setting). The basic detection framework is FPN  with Faster RCNN and the class-aware sampling is used for them.
7 Results of Object Detection
7.1 Ablation Study on DH
|Model||DH||DCN||Validation Set||MT||PA||Public LB|
We first study the effectiveness of DH on the validation set and challenge set with different backbones. Results are shown in Tab 2. For model ResNet50, we adopt the anchors with scale 8 and aspect ratio [, 1, 2]. For model ResNext101 and SENet154, we adopt the anchors with scale [8,11,14] and aspect ratio [0.1, , 1, 2, 4, 8]. Note that DH can always stably improve the performance by 34%.
7.2 Ablation Study on Adj-NMS and Voting Ensemble
Results are shown in Tab 3. Voting ensemble can obtain the 0.3 improvement. Note that the ensemble solution in PFDet cooperated with Adj-NMS can bring further improvement. The 4 models are trained with simple configuration without bells and whistles.
7.3 Final results
Given all the successful exploration, we train multiple backbones with the best setting and design as mentioned above, including: ResNet family, SENet family, ResNeXt family, NASNet††the original NASNet can not converge well in our experiments, here we use a modified version of it., NAS-FPN and EfficientNet††We modified some network parameters such as the depth multiplier to enable better convergency and training time-performance trade-off. family. We conclude some of our recorded results and break down the final results we achieved on the public lead-board as in Tab. 4. mean the SEResNet154 trained with 150, 27, 40 classes with low AP on validation set. means that we find total 64 classes co-exists in COCO dataset and OpenImage dataset. And so, we straightforwardly adopt the Mask RCNN with ResNet152 and Cascade RCNN with ResNet50 as the 64-classes expert model which are strained on COCO dataset. means we trained the expert class model with embedding the same classes in Object365 and there are total 8 expert models for this. At the final re-weighting stage, we generate different weights for different models to ensemble.
|Single Model (ID 1-16)||[58.596 - 60.5]|
|Single Model (ID 17-28) + 1 expert||[N/A - 63.596]|
|Naive ensemble ID 1-16||61.917|
|Mix ensemble+voting ID1-28+3experts (V1)||67.2|
8 Instance Segmentation
We observe that most state-of-the-art instance segmentation methods are based on Mask R-CNN in recent two years, in which the box regression and mask segmentation share almost the same features. We argue that this paradigm may not be the optimal solution. Instead, we revisit the traditional object detection and semantic segmentation and formulate instance segmentation problem as these two serial problems. In this way, the performance of segmentation can be further boost up due to more complex design of segmentation branch. In the bounding box detection step, we directly use the results from detection track. In the semantic segmentation step, we adopt state-of-the-art HRNet as our backbone to train network distinguish pixels from fore/background in one bounding box.
Based on our design mentioned above, the detection result has linear impact to the segmentation. So we takes most of our time in improving the performance of detection.
Base detection. As mentioned in our detection report, we train 28 global models, 3 expert models for the 150, 27 and 40 classes with the lowest AP on validation set, 2 expert models on COCO and 8 expert models on Object365 for the overlapped classes. The backbones include the ResNet family, SENet family, ResNeXt family, NASNet, NAS-FPN and EfficientNet family. We design a decoupling head, decoupling backbone, adj-NMS, multiple ensemble strategies and bag of tricks (sampling, elaborate augmentation, expert model strategies, anchor selection, modified cascade RCNN, weakly supervised pre-training, categories relationship based re-scoring, etc.)
Since the evaluation metric of the detection track is AP0.5, our detection methods may over-fit it and generate boxes not so tight. So we further train an cascade regressor head. Given the original detectors’ results as the proposals, we only train an regressor based on the feature generated by ROI align.
8.2 Segmentation Module
We give the detail description for the proposed Semantic Segmentation method. Due to the excellent and stable performance of the HRNet backbone on several tasks (e.g. image classification, object detection, semantic segmentation and keypoint), we choose HRNet as our backbone.
The architecture is illustrated in Figure 4. There are four stages, and the 2nd, 3rd and 4th stages are formed by repeating modularized multi-resolution blocks. A multi-resolution block consists of a multi-resolution group convolution and a multi-resolution convolution. The multi-resolution group convolution is a simple extension of the group convolution, which divides the input channels into several subsets of channels and performs a regular convolution over each subset over different spatial resolutions separately.
As shown in Fig. 5, in the final step, we rescale the low-resolution representations through bilinear upsampling to the high resolution, and concatenate the subsets of representations, resulting in the high-resolution representation, which we adopt for semantic segmentation.
8.2.3 Detail description
The network starts from a stem that consists of two strided 3 3 convolutions decreasing the resolution to 1/4. The 1st stage contains 4 residual units where each unit is formed by a bottleneck with the width 64, and is followed by one 3 3 convolution reducing the width of feature maps to C. The 2nd, 3rd, 4th stages contain 1, 4, 3 multi-resolution blocks, respectively. The widths (number of channels) of the convolutions of the four resolutions are C, 2C, 4C, and 8C, respectively. Each branch in the multiresolution group convolution contains 4 residual units and each unit contains two 3 3 convolutions in each resolution. In semantic segmentation , we mix the output representations from all the four resolutions through a 1
1 convolution, and produce a 15C-dimensional representation. Then, we pass the mixed representation at each position to a linear classifier with the softmax loss to predict the segmentation maps. Note that the segmentation maps are upsampled (4 times) to the input size by bilinear upsampling for both training and testing.
|Model||Training strategy||Test strategy||Public LB|
|HRNet2||[256,600]||[256,600] with flip||54.90|
8.3 Implement details for segmentation track
We follow the similar training protocol [25, 26]. The data are cropped from original image by detected bounding box, then the short side are random scaled from 200 to 300, and random horizontal flipping. We use the similar rotation augmentation settings in our detection solution. We adopt the SGD optimizer with the base learning rate of 0.08, the momentum of 0.9 and the weight decay of 0.0005. The poly learning rate policy with the power of 0.9 is used for dropping the learning rate. All the models are trained for 200K iterations with the batch size of 6N, 8N and 16N on N GPUs for different input resolution. N is set to 8, 32 or 80 in different experiments. SyncBN is used.
8.4 Model Ensemble
For model ensemble, we apply different training strategies to obtain different models with backbone HRNet. There are 4 models used for ensemble, including: HRNet trained with random scale from 200 to 300, HRNet trained with fixed scale [256,600], HRNet trained with fixed scale [112,112], HRNet trained with [256,600] and test with flip mechanism. The bounding boxes are generated by the detection model trained on the 500 classes OpenImage dataset. The map on public LB is . At the ensemble stage, given the mask map generated by different models, we directly adopt the voting mechanism for each pixel to obtain the final results.
8.5 Final Results of Segmentation
The results of different training strategy are show in Tab 5.
We appreciate the discussion with Kai Chen and Yi Zhang at the Multimedia Lab, CUHK. We also acknowledge the mmdetection team for the wonderful codebase.
-  (2018) PFDet: 2nd place solution to open images challenge 2018 object detection track. arXiv preprint arXiv:1809.00778. Cited by: §4.1, §5.4.
-  (2016) Weakly supervised deep detection networks. In , pp. 2846–2854. Cited by: §5.7.
-  (2018) Cascade r-cnn: delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162. Cited by: §5.6.
-  (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §9.
-  (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §2, Table 2.
-  (2018) Solution for large-scale hierarchical object detection datasets with incomplete annotation and data imbalance. arXiv preprint arXiv:1810.06208. Cited by: §5.1.
-  (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7036–7045. Cited by: §2.
-  (2019)(Website) External Links: Cited by: §1.
Scale-aware face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6186–6195. Cited by: §2.
-  (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §1, §5.1, §5.4.
Gradient harmonized single-stage detector.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8577–8584. Cited by: §2.
-  (2019) Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision 127 (3), pp. 225–238. Cited by: §2.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §2, §6.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
-  (2017) Recurrent scale approximation for object detection in cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 571–579. Cited by: §2.
-  Object365. Note: https://www.objects365.org/overview.html Cited by: §1.
-  (2016) Factors in finetuning deep model for object detection with long-tail distribution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 864–873. Cited by: §5.1.
-  (2017) Automatic differentiation in pytorch. Cited by: §6.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: 1st Place Solutions for OpenImage2019 - Object Detection and Instance Segmentation, §2, §5.5, §5.7.
-  (2018) Beyond trade-off: accelerate fcn-based face detector with higher accuracy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7756–7764. Cited by: §2.
-  (2020) Revisiting the sibling head in object detector. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.
-  (2019) Rethinking classification and localization in r-cnn. arXiv preprint arXiv:1904.06493. Cited by: §2.
-  (2019) Detecting 11k classes: large scale object detection without fine-grained bounding boxes. arXiv preprint arXiv:1908.05217. Cited by: §5.4.
-  (2017) Crafting gbd-net for object detection. IEEE transactions on pattern analysis and machine intelligence 40 (9), pp. 2109–2123. Cited by: §2.
-  (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §8.3.
-  (2018) Psanet: point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 267–283. Cited by: §8.3.