Towards Fine-grained Large Object Segmentation 1st Place Solution to 3D AI Challenge 2020 – Instance Segmentation Track

by   Zehui Chen, et al.

This technical report introduces our solutions of Team 'FineGrainedSeg' for Instance Segmentation track in 3D AI Challenge 2020. In order to handle extremely large objects in 3D-FUTURE, we adopt PointRend as our basic framework, which outputs more fine-grained masks compared to HTC and SOLOv2. Our final submission is an ensemble of 5 PointRend models, which achieves the 1st place on both validation and test leaderboards. The code is available at


RefineMask: Towards High-Quality Instance Segmentation with Fine-Grained Features

The two-stage methods for instance segmentation, e.g. Mask R-CNN, have a...

2nd Place Solution to Instance Segmentation of IJCAI 3D AI Challenge 2020

Compared with MS-COCO, the dataset for the competition has a larger prop...

First Place Solution of KDD Cup 2021 OGB Large-Scale Challenge Graph-Level Track

In this technical report, we present our solution of KDD Cup 2021 OGB La...

Advanced Deep Networks for 3D Mitochondria Instance Segmentation

Mitochondria instance segmentation from electron microscopy (EM) images ...

Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset

In this work we explore the task of instance segmentation with attribute...

2nd Place Solution for IJCAI-PRICAI 2020 3D AI Challenge: 3D Object Reconstruction from A Single Image

In this paper, we present our solution for the IJCAI–PRICAI–20 3D AI Cha...

Working with scale: 2nd place solution to Product Detection in Densely Packed Scenes [Technical Report]

This report describes a 2nd place solution of the detection challenge wh...

Code Repositories


1st Place Solutions of 3D AI Challenge 2020(IJCAI-PRICAI 2020 Workshop) - Instance Segmentation Track

view repo

1 Introduction

Recently, many modern instance segmentation approaches demonstrate outstanding performance on COCO and LVIS, such as HTC 

[Chen et al.2019a], SOLOv2 [Wang et al.2020], and PointRend [Kirillov et al.2020]. Most of these detectors focus on an overall performance on public datasets like COCO, which contains much smaller instances than 3D-FUTURE, while paying less attention to large objects segmentation. As illustrated in Figure 1, the size distribution of bounding boxes in 3D-FUTURE and COCO indicates that the former contains much larger objects while the latter is dominated by smaller instances. Thus, the prominent methods used in COCO, like MaskRCNN [He et al.2017] and HTC, may generate blurry contours for large instances. Their mask heads output segmentation from a limited small feature size (e.g., ), which is dramatically insufficient to represent large objects. All of these motivate us to segment large instances in a fine-grained and high-quality manner.
SOLOv2 builds an efficient single-shot framework with strong performance and dynamically generates predictions with much larger mask size (e.g., 1/4 scale of input size) than HTC. PointRend iteratively renders the output mask over adaptively sampled uncertain points in a coarse-to-fine fashion, which is naturally suitable for generating smooth and fine-grained instance boundaries. By conducting extensive experiments on HTC, SOLOv2 and PointRend, PointRend succeeds in producing finer mask boundaries and significantly outperforms other methods by a large margin. Our step-by-step modifications adopted on PointRend finally achieves state-of-the-art performance on 3D-FUTURE dataset, which yields 79.2 mAP and 77.38 mAP on validation and test set respectively. The final submission is an ensemble of 5 PointRend models with slightly different settings, reaching the 1st place in this competition.

Figure 1: Size distribution of bounding boxes in 3D-FUTURE and COCO. We randomly select 10,000 images for fair comparison. axis denotes the sqrt area of a bounding box and axis denotes the number of boxes within each corresponding interval.

2 Datasets

3D-FUTURE dataset is a recently public large-scale indoor dataset with 34 categories. Following the official splits, we adopt 12,144 images for training, 2,024 for validation and 6,072 for testing. From the size distribution of bounding boxes in 3D-FUTURE and COCO shown in Figure 1, the medium object size of 3D-FUTURE is about 250 while roughly 50 for COCO, indicating that 3D-FUTURE contains much more larger instances222Followed by 3D-FUTURE official setting, we refer area for small, for medium, and for large, compared to and defined in COCO.. This distribution divergence motivates us to explore fine-grained large object segmentation methods like PointRend.

Model Backbone DCN GC block MaskScoring MS Test SyncBN APs APm APl mAP
HTC Res2Net 49.72 67.15 80.13 74.58
SOLOv2 ResNeXt 50.03 68.58 81.81 75.29
PointRend ResNeXt 56.23 73.12 85.34 79.17
Table 1: Performance comparison among different methods on validation set (trackA)

. “DCN” means deformable neural network, “GC block” means global context block and “MS Test” means multi-scale testing.

Model Backbone Large Resolution P6 Feature DCN MP Train MP Test MS Test FP16 mAP
MaskRCNN Res50 53.2
Res50 62.9
Res50 64.0
PointRend X101-64x4d 69.4
X101-64x4d 71.6
X101-64x4d 74.3
Table 2: PointRend’s step-by-step performance on our own validation set (splitted from the original training set). “MP Train” means more points training and “MP Test” means more points testing. “P6 Feature” indicates adding P6 to default P2-P5 levels of FPN for both coarse prediction head and fine-grained point head. “FP16” means mixed precision training.

3 Methods

In this section, we introduce our practice on three competitive segmentation methods including HTC, SOLOv2 and PointRend. We show step-by-step modifications adopted on PointRend, which achieves better performance and outputs much smoother instance boundaries than other methods.

3.1 Hybrid Task Cascade

HTC is known as a competitive method for COCO and OpenImage. By enlarging the RoI size of both box and mask branches to 12 and 32 respectively for all three stages, we gain roughly 4 mAP improvement against the default settings in original paper. Mask scoring head [Huang et al.2019] adopted on the third stage gains another 2 mAP. Armed with DCN, GC block and SyncBN training, our HTC with Res2NetR101 backbone yields 74.58 mAP on validation set, as shown in Table 1. However, the convolutional mask heads adopted in all stages bring non-negligible computation and memory costs, which constrain the mask resolution and further limit the segmentation quality for large instances.

3.2 SOLOv2

Due to limited mask representation of HTC, we move on to SOLOv2, which utilizes much larger mask to segment objects. It builds an efficient yet simple instance segmentation framework, outperforming other segmentation methods like TensorMask [Chen et al.2019c], CondInst [Tian et al.2020] and BlendMask [Chen et al.2020] on COCO. In SOLOv2, the unified mask feature branch is dynamically convoluted by learned kernels, and the adaptively generated mask for each location benefits from the whole image view instead of cropped region proposals like HTC. Using ResNeXt101-64x4d plugined with DCN and GC block, SOLOv2 achieves 75.29 mAP on validation set (see Table 1). It’s worth noting that other attempts, including NASFPN, data augmentation and Mask Scoring, bring little improvement in our experiments.

Model Res2Net ResNeXt BFP EnrichFeat DCN MS Test mAP AP50 AP75 APs APm APl
77.21 90.09 82.88 47.30 71.98 81.90
77.38 89.34 83.28 45.31 71.21 82.24
PointRend 77.32 89.79 83.24 45.78 72.25 81.70
77.37 89.78 83.39 46.07 72.84 81.68
Table 3: PointRend’s performance on testing set (trackB). “EnrichFeat” means enhance the feature representation of coarse mask head and point head by increasing the number of fully-connected layers or its hidden sizes. “BFP” means Balanced Feature Pyramid. Note that BFP and EnrichFeat gain little improvements, we guess that our PointRend baseline already achieves promising performance (77.38 mAP).

3.3 PointRend

PointRend performs point-based segmentation at adaptively selected locations and generates high-quality instance mask. It produces smooth object boundaries with much finer details than previously two-stage detectors like MaskRCNN, which naturally benefits large object instances and complex scenes. Furthermore, compared to HTC’s mask head, PointRend’s lightweight segmentation head alleviates both memory and computation costs dramatically, thus enables larger input image resolutions during training and testing, which further improves the segmentation quality.
To fully understand which components contribute to PointRend’s performance, we construct our own validation set by randomly selecting 3000 images from original training data to evaluate offline. We will show the step-by-step improvements adopted on PointRend.
Bells and Whistles. MaskRCNN-ResNet50 is used as baseline and it achieves 53.2 mAP. For PointRend, we follow the same setting as [Kirillov et al.2020] except for extracting both coarse and fine-grained features from the P2-P5 levels of FPN, rather than only P2 described in the paper. Surprisingly, PointRend yields 62.9 mAP and surpasses MaskRCNN by a remarkable margin of 9.7 mAP. More Points Test. By increasing the number of subdivision points from default 28 to 70 during inference, we gain another 1.1 mAP with free training cost. Large Backbone. X101-64x4d [Xie et al.2017] is then used as large backbone and it boosts 6 mAP against ResNet50. DCN and More Points Train.

We adopt more interpolated points during training, by increasing the number of sampled points from original 14 to 26 for coarse prediction head, and from 14 to 24 for fine-grained point head. Then by adopting DCN 

[Dai et al.2017], we gain 71.6 mAP, which already outperforms HTC and SOLOV2 from our offline observation. Large Resolution and P6 Feature. Due to PointRend’s lightweight segmentation head and less memory consumption compared to HTC, the input resolution can be further increased from range [800,1000] to [1200,1400] during multi-scale training. P6 level of FPN is also added for both coarse prediction head and fine-grained point head, which finally yields 74.3 mAP on our splitted validation set. Other tricks we tried on PointRend give little improvement, including MaskScoring head, GC Block and DoubleHead [Wu et al.2020].
In the following, we refer the model in the last row (74.3 mAP) of Table 2 as PointRend baseline. The baseline trained on the official training set finally reaches 79.17 and 77.38 mAP on validation and testing set respectively, as shown in Table 1 and Table 3. It surpasses SOLOv2 by a large margin: 6.2, 4.5 and 3.5 mAP respectively for small, medium and large size on validation set. We believe that PointRend’s iteratively rendering process acts as a pivot for generating high-quality masks, especially fine-grained instance boundaries. Due to its superior performance, we only choose PointRend as ensemble candidates for the final submission.

4 Final Submission

4.1 Implementation Details

We implement PointRend using MMDetection [Chen et al.2019b] and adopt the modifications and tricks mentioned in Section 3.3. Both X101-64x4d and Res2Net101 [Gao et al.2019]

are used as our backbones, pretrained on ImageNet only. SGD with momentum 0.9 and weight decay 1e-4 is adopted. The initial learning rate is set to 0.01 for Res2Net101 and 0.02 for X101-64x4d defaultly and decayed by factor 0.1 at epoch 32. During training process, the batch size is 8 (one image per GPU) and all BN statistics are freezed. Mixed precision training enables to reduce GPU memory. The input images are randomly resized to

, which is uniformly sampled from range . All models are trained for 44 epochs. For inference, images are resized to and horizontal flip is used.

4.2 Test Performance

As shown in Table 3, all PointRend models achieve promising performance. Even without ensemble, our PointRend baseline, which yields 77.38 mAP, has already achieved 1st place on the test leaderboard. Note that several attempts, like BFP [Pang et al.2019] and EnrichFeat, give no improvements against PointRend baseline, while they serve as final ensemble candidates. In addition to models listed in Table 3, another PointRend with slightly different setting (stacking two BFP modules, and increasing the RoIAlign size from original 7 to 10 for bounding box branch) is trained and achieves 76.95 mAP on testing set. So, there are 5 models used for final ensemble.

4.3 Model Ensemble

Our final submission is an ensemble of 5 PointRend models. We compare two different ensemble strategies: one is Linear-Reweight [Huang et al.2020], and the other is a linear interpolation based on their scores (Linear-Interpolation). Formally, given a list of model candidates and their scores , Linear-Interpolation strategy reweights each model with coefficient :


where and are set to 0.6 and 1.0, respectively.
To optimize for AP, soft-NMS is adopted. As shown in Table 4, Linear-Interpolation is chosen as final ensemble strategy which boosts the best single model’s performance by 1.6 mAP, slightly better than Linear-Reweight.

Method mAP
Ensemble Candidates [76.9577.38]
Linear-Reweight 78.92
Linear-Interpolation 79.03
Table 4: Final ensemble results on testing set (trackB).
Figure 2: Example of segmentation results on validation dataset from three best single models: (a)(d) HTC, (b)(e) SOLOv2 and (c)(f) PointRend. PointRend predicts masks with substantially finer details around object boundaries. All figures are best viewed digitally with zoom.

5 Visualization

As shown in Figure 2, we compare HTC, SOLOv2 and PointRend by visualizing their predictions. It can be seen that PointRend generates much finer and smoother segmentation boundaries than HTC and SOLOv2, it also handles overlapped instances gradely (see top-left corner in Figure 2). Meanwhile, PointRend succeeds in distinguishing holes inside objects as background while HTC and SOLOv2 may predict incorrectly as foreground (see bottom line in Figure 2). We attribute PointRend’s success to the iteratively rendering process, which performs point-based segmentations at adaptively selected uncertain points and learns to output more fine-grained object contours.

6 Conclusion

In this work, we conduct extensive experiments for HTC, SOLOv2 and PointRend on 3D-FUTURE dataset, among which PointRend achieves the best performance and generates smoother object boundaries. By focusing on coarse-to-fine large objects segmentation, our final submission, an ensemble of 5 PointRends, achieves the 1st place for the 3D AI Challenge - Instance Segmentation Track.


  • [Chen et al.2019a] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, et al. Hybrid task cascade for instance segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4974–4983, 2019.
  • [Chen et al.2019b] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  • [Chen et al.2019c] Xinlei Chen, Ross Girshick, Kaiming He, and Piotr Dollár. Tensormask: A foundation for dense object segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2061–2069, 2019.
  • [Chen et al.2020] Hao Chen, Kunyang Sun, Zhi Tian, Chunhua Shen, Huang, et al. BlendMask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8573–8581, 2020.
  • [Dai et al.2017] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, et al. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 764–773, 2017.
  • [Gao et al.2019] Shanghua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, et al. Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [He et al.2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. arXiv preprint arXiv:1703.06870, 2017.
  • [Huang et al.2019] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask Scoring R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6409–6418, 2019.
  • [Huang et al.2020] Zehao Huang, Zehui Chen, Qiaofei Li, Hongkai Zhang, and Naiyan Wang. 1st place solutions of waymo open dataset challenge 2020 – 2D object detection track, 2020.
  • [Kirillov et al.2020] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9799–9808, 2020.
  • [Pang et al.2019] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra R-CNN: Towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 821–830, 2019.
  • [Tian et al.2020] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for instance segmentation. arXiv preprint arXiv:2003.05664, 2020.
  • [Wang et al.2020] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. SOLOv2: Dynamic, Faster and Stronger. arXiv preprint arXiv:2003.10152, 2020.
  • [Wu et al.2020] Yue Wu, Yinpeng Chen, Lu Yuan, Zicheng Liu, Lijuan Wang, et al. Rethinking classification and localization for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10186–10195, 2020.
  • [Xie et al.2017] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492–1500, 2017.