3DFuture_ins_seg
1st Place Solutions of 3D AI Challenge 2020(IJCAI-PRICAI 2020 Workshop) - Instance Segmentation Track
view repo
This technical report introduces our solutions of Team 'FineGrainedSeg' for Instance Segmentation track in 3D AI Challenge 2020. In order to handle extremely large objects in 3D-FUTURE, we adopt PointRend as our basic framework, which outputs more fine-grained masks compared to HTC and SOLOv2. Our final submission is an ensemble of 5 PointRend models, which achieves the 1st place on both validation and test leaderboards. The code is available at https://github.com/zehuichen123/3DFuture_ins_seg.
READ FULL TEXT VIEW PDF1st Place Solutions of 3D AI Challenge 2020(IJCAI-PRICAI 2020 Workshop) - Instance Segmentation Track
Recently, many modern instance segmentation approaches demonstrate outstanding performance on COCO and LVIS, such as HTC
[Chen et al.2019a], SOLOv2 [Wang et al.2020], and PointRend [Kirillov et al.2020]. Most of these detectors focus on an overall performance on public datasets like COCO, which contains much smaller instances than 3D-FUTURE, while paying less attention to large objects segmentation. As illustrated in Figure 1, the size distribution of bounding boxes in 3D-FUTURE and COCO indicates that the former contains much larger objects while the latter is dominated by smaller instances. Thus, the prominent methods used in COCO, like MaskRCNN [He et al.2017] and HTC, may generate blurry contours for large instances. Their mask heads output segmentation from a limited small feature size (e.g., ), which is dramatically insufficient to represent large objects. All of these motivate us to segment large instances in a fine-grained and high-quality manner.3D-FUTURE dataset is a recently public large-scale indoor dataset with 34 categories. Following the official splits, we adopt 12,144 images for training, 2,024 for validation and 6,072 for testing. From the size distribution of bounding boxes in 3D-FUTURE and COCO shown in Figure 1, the medium object size of 3D-FUTURE is about 250 while roughly 50 for COCO, indicating that 3D-FUTURE contains much more larger instances222Followed by 3D-FUTURE official setting, we refer area for small, for medium, and for large, compared to and defined in COCO.. This distribution divergence motivates us to explore fine-grained large object segmentation methods like PointRend.
Model | Backbone | DCN | GC block | MaskScoring | MS Test | SyncBN | APs | APm | APl | mAP |
---|---|---|---|---|---|---|---|---|---|---|
HTC | Res2Net | 49.72 | 67.15 | 80.13 | 74.58 | |||||
SOLOv2 | ResNeXt | 50.03 | 68.58 | 81.81 | 75.29 | |||||
PointRend | ResNeXt | 56.23 | 73.12 | 85.34 | 79.17 |
. “DCN” means deformable neural network, “GC block” means global context block and “MS Test” means multi-scale testing.
Model | Backbone | Large Resolution | P6 Feature | DCN | MP Train | MP Test | MS Test | FP16 | mAP |
---|---|---|---|---|---|---|---|---|---|
MaskRCNN | Res50 | 53.2 | |||||||
Res50 | 62.9 | ||||||||
Res50 | 64.0 | ||||||||
PointRend | X101-64x4d | 69.4 | |||||||
X101-64x4d | 71.6 | ||||||||
X101-64x4d | 74.3 | ||||||||
In this section, we introduce our practice on three competitive segmentation methods including HTC, SOLOv2 and PointRend. We show step-by-step modifications adopted on PointRend, which achieves better performance and outputs much smoother instance boundaries than other methods.
HTC is known as a competitive method for COCO and OpenImage. By enlarging the RoI size of both box and mask branches to 12 and 32 respectively for all three stages, we gain roughly 4 mAP improvement against the default settings in original paper. Mask scoring head [Huang et al.2019] adopted on the third stage gains another 2 mAP. Armed with DCN, GC block and SyncBN training, our HTC with Res2NetR101 backbone yields 74.58 mAP on validation set, as shown in Table 1. However, the convolutional mask heads adopted in all stages bring non-negligible computation and memory costs, which constrain the mask resolution and further limit the segmentation quality for large instances.
Due to limited mask representation of HTC, we move on to SOLOv2, which utilizes much larger mask to segment objects. It builds an efficient yet simple instance segmentation framework, outperforming other segmentation methods like TensorMask [Chen et al.2019c], CondInst [Tian et al.2020] and BlendMask [Chen et al.2020] on COCO. In SOLOv2, the unified mask feature branch is dynamically convoluted by learned kernels, and the adaptively generated mask for each location benefits from the whole image view instead of cropped region proposals like HTC. Using ResNeXt101-64x4d plugined with DCN and GC block, SOLOv2 achieves 75.29 mAP on validation set (see Table 1). It’s worth noting that other attempts, including NASFPN, data augmentation and Mask Scoring, bring little improvement in our experiments.
Model | Res2Net | ResNeXt | BFP | EnrichFeat | DCN | MS Test | mAP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|---|---|---|---|---|
77.21 | 90.09 | 82.88 | 47.30 | 71.98 | 81.90 | |||||||
77.38 | 89.34 | 83.28 | 45.31 | 71.21 | 82.24 | |||||||
PointRend | 77.32 | 89.79 | 83.24 | 45.78 | 72.25 | 81.70 | ||||||
77.37 | 89.78 | 83.39 | 46.07 | 72.84 | 81.68 |
PointRend performs point-based segmentation at adaptively selected locations and generates high-quality instance mask. It produces smooth object boundaries with much finer details than previously two-stage detectors like MaskRCNN, which naturally benefits large object instances and complex scenes. Furthermore, compared to HTC’s mask head, PointRend’s lightweight segmentation head alleviates both memory and computation costs dramatically, thus enables larger input image resolutions during training and testing, which further improves the segmentation quality.
To fully understand which components contribute to PointRend’s performance, we construct our own validation set by randomly selecting 3000 images from original training data to evaluate offline. We will show the step-by-step improvements adopted on PointRend.
Bells and Whistles. MaskRCNN-ResNet50 is used as baseline and it achieves 53.2 mAP. For PointRend, we follow the same setting as [Kirillov et al.2020] except for extracting both coarse and fine-grained features from the P2-P5 levels of FPN, rather than only P2 described in the paper. Surprisingly, PointRend yields 62.9 mAP and surpasses MaskRCNN by a remarkable margin of 9.7 mAP. More Points Test. By increasing the number of subdivision points from default 28 to 70 during inference, we gain another 1.1 mAP with free training cost. Large Backbone. X101-64x4d [Xie et al.2017] is then used as large backbone and it boosts 6 mAP against ResNet50. DCN and More Points Train.
We adopt more interpolated points during training, by increasing the number of sampled points from original 14 to 26 for coarse prediction head, and from 14 to 24 for fine-grained point head. Then by adopting DCN
[Dai et al.2017], we gain 71.6 mAP, which already outperforms HTC and SOLOV2 from our offline observation. Large Resolution and P6 Feature. Due to PointRend’s lightweight segmentation head and less memory consumption compared to HTC, the input resolution can be further increased from range [800,1000] to [1200,1400] during multi-scale training. P6 level of FPN is also added for both coarse prediction head and fine-grained point head, which finally yields 74.3 mAP on our splitted validation set. Other tricks we tried on PointRend give little improvement, including MaskScoring head, GC Block and DoubleHead [Wu et al.2020].We implement PointRend using MMDetection [Chen et al.2019b] and adopt the modifications and tricks mentioned in Section 3.3. Both X101-64x4d and Res2Net101 [Gao et al.2019]
are used as our backbones, pretrained on ImageNet only. SGD with momentum 0.9 and weight decay 1e-4 is adopted. The initial learning rate is set to 0.01 for Res2Net101 and 0.02 for X101-64x4d defaultly and decayed by factor 0.1 at epoch 32. During training process, the batch size is 8 (one image per GPU) and all BN statistics are freezed. Mixed precision training enables to reduce GPU memory. The input images are randomly resized to
, which is uniformly sampled from range . All models are trained for 44 epochs. For inference, images are resized to and horizontal flip is used.As shown in Table 3, all PointRend models achieve promising performance. Even without ensemble, our PointRend baseline, which yields 77.38 mAP, has already achieved 1st place on the test leaderboard. Note that several attempts, like BFP [Pang et al.2019] and EnrichFeat, give no improvements against PointRend baseline, while they serve as final ensemble candidates. In addition to models listed in Table 3, another PointRend with slightly different setting (stacking two BFP modules, and increasing the RoIAlign size from original 7 to 10 for bounding box branch) is trained and achieves 76.95 mAP on testing set. So, there are 5 models used for final ensemble.
Our final submission is an ensemble of 5 PointRend models. We compare two different ensemble strategies: one is Linear-Reweight [Huang et al.2020], and the other is a linear interpolation based on their scores (Linear-Interpolation). Formally, given a list of model candidates and their scores , Linear-Interpolation strategy reweights each model with coefficient :
(1) |
where and are set to 0.6 and 1.0, respectively.
To optimize for AP, soft-NMS is adopted. As shown in Table 4, Linear-Interpolation is chosen as final ensemble strategy which boosts the best single model’s performance by 1.6 mAP, slightly better than Linear-Reweight.
Method | mAP |
---|---|
Ensemble Candidates | [76.9577.38] |
Linear-Reweight | 78.92 |
Linear-Interpolation | 79.03 |
As shown in Figure 2, we compare HTC, SOLOv2 and PointRend by visualizing their predictions. It can be seen that PointRend generates much finer and smoother segmentation boundaries than HTC and SOLOv2, it also handles overlapped instances gradely (see top-left corner in Figure 2). Meanwhile, PointRend succeeds in distinguishing holes inside objects as background while HTC and SOLOv2 may predict incorrectly as foreground (see bottom line in Figure 2). We attribute PointRend’s success to the iteratively rendering process, which performs point-based segmentations at adaptively selected uncertain points and learns to output more fine-grained object contours.
In this work, we conduct extensive experiments for HTC, SOLOv2 and PointRend on 3D-FUTURE dataset, among which PointRend achieves the best performance and generates smoother object boundaries. By focusing on coarse-to-fine large objects segmentation, our final submission, an ensemble of 5 PointRends, achieves the 1st place for the 3D AI Challenge - Instance Segmentation Track.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4974–4983, 2019.