Instance segmentation is one of the general but challenging tasks in computer vision. In generally, instance segmentation can be split into two steps: object detection, and pixel classification. So the current instance segmentation task is directly based on advances in object detection like SSD, Faster R-CNN , and R-FCN . According to the different types of detection architecture, instance segmentation tasks can be divided into two categories, single-stage instance segmentation and two-stage instance segmentation.
The commonly used two-stage instance segmentation methods focus primarily on the performance over speed. Due to the using of a cascade strategy, these methods are usually time-consuming. In addition, their dependence on the feature localization makes them difficult to accelerate. Some of the recently proposed one stage instance segmentation methods, eg. YOLACT , partly solve those problems by dividing the instance segmentation task into two parallel subtasks: prototype mask generation and per-instance mask coefficients prediction. It is a effective way to speed up existing two-stage methods like Mask R-CNN . However, in order to represent different shape instances in an image, all those methods above require lots of anchors and memory.
To handle this issue, we propose an instance segmentation method based on the one-stage anchor-free detection framework. Inspired by some efficient anchor-free detection methods such as FCOS , CenterNet [7, 8], etc, which obtain reasonable trade-off between speed and performance by eliminating the predefined set of anchor boxes. Based on FCOS, the proposed instance segmentation task is divided into two subtasks similar to YOLACT. As shown in Fig.1
(yellow box), one subtask which predicts mask coefficients is assembled into each head of the detector by combining the classification and regression branches. Only one group of mask coefficients of each sample needs to be predicted since the anchor-free mechanism reduces the total training memory footprint. The other subtask which generates the prototype masks is directly implemented as an FCN (green box). All those tasks are implemented in parallel based on single-stage architecture to speed up the training phase. Also, to enhance the performance without any additional hyperparameters, we propose a center-aware ground truth scheme, which can effectively preserve the center of each instance during the training and achieve a stable improvement.
Our contributions can be summarized as follows: (1) We propose an instance segmentation method based on anchor-free mechanism, which has great advantages in speed and memory usage. (2) We propose a center aware ground truth scheme, which effectively improves the performance of our framework in detection and instance segmentation tasks.
Ii Related work
Two-Stage Instance Segmentation. Instance segmentation can be solved by bounding box detection then semantic segmentation within each box, which is adopted by most of existing two-stage methods. Based on Faster R-CNN , Mask R-CNN  simply adds an mask branch to predict mask of each instance. Mask Scoring R-CNN  re-scores the confidence of mask from classification score by adding a mask-IoU branch, which makes the network to predict the IoU of mask and ground-truth. FCIS  predicts a set of position-sensitive output channels which simultaneously address object classes, boxes, and masks. The above state-of-the-art methods can achieve satisfy performance but are time-consuming.
Single-Stage Instance Segmentation. SPRNet  has an encoder-decoder structure, in which classification, regression and mask branches are processed in parallel. It generates each instance mask from a single pixel, and then resize the mask to fit the corresponding box to get the final instance-level prediction. In the decoding part, each pixel is used as an instance carrier to generate the instance mask, on which consecutive deconvolutions are applied to get the final predictions. YOLACT  divide the instance segmentation into two parallel subtasks, the prototype mask generation and the pre-instance mask coefficient prediction. Then, the generated prototypes are combined linearly by using the corresponding predicted coefficients and cropped with a predicting bounding box. TensorMask 
investigates the paradigms of dense sliding window instance segmentation by using structured 4D tensors to represent masks over a spatial domain. All of above methods use anchor-based detection backbone, which requires plenty of memory footprint in the training phase. Polarmask formulates the instance segmentation problem as instance center classification and dense distance regression in a polar coordinate. ExtremeNet  uses keypoint detection to predict 8 extreme points of one instance and generates an octagon mask, which achieves relatively reasonable object mask prediction. It is a anchor-free method, but the octagon mask encoded method might not depict the mask precisely. We propose a novel instance segmentation method by combining the single-stage anchor free framework and robust mask encoding method.
In this section, the proposed method is introduced in detail. The pipeline is shown in Fig.1. In the section III-A, we explore the application of the anchor-free mechanism on instance segmentation task. In the section III-B, we propose a center-aware ground truth to improve the performance.
Iii-a Single-stage anchor-free instance segmentation
YOLACT  is a real-time instance segmentation method in which instance segmentation can be divided into two parallel subtasks: 1) mask coefficients prediction and 2) prototypes prediction. In this paper, we follow this parallel mechanism to accelerate the model.
Anchor-free for mask coefficients
Instance segmentation depends strongly on the accuracy of bounding box. To obtain a high-quality bounding box of an instance, the proposed SAIS is based on the FCOS , an one-stage anchor-free architecture that achieves state-of-the-art performance on object detection tasks. As shown in Fig.1, each head has two branches, one is used to detect 4 bounding boxes regressions, the other is used to predict 1 center possibility and class confidences. Different from FCOS , in each head, a new output layer is added to predict mask coefficients for each sample (Fig. 1 yellow box). To extract enrich semantic information, we firstly fuse the two branches (classification branch and regression branch) before predicting mask coefficients, followed by a convolutional layer with channels to predict mask coefficients of each sample. In the proposed method, each sample only has outputs, which has fewer network output variables than the commonly used anchor-based methods with anchor boxes per sample.
Note that the prototype generation branch (protonet) predicts a set of prototype masks for the entire image. The protonet is implemented as an FCN whose last layer is with the same channels as the mask coefficient prediction layer. The final instance masks are generated by combining the mask prototypes and mask coefficients. For each sample, the mask coefficient is produced by the heads of FPN while the mask prototype is generated by protonet and shared by all samples. As shown in Fig.1 (blue box), the final instance mask of this sample is obtained by a single matrix multiplication and sigmoid:
where is a matrix and is an matrix. The single-stage architecture is composed of the fully convolutional layers, and all subtasks are executed in parallel, which achieves a high speed.
Iii-B Center-aware ground truth
The labels of all tasks are selected from the ground-truth map. If a location falls into multiple bounding boxes, it is considered as an ambiguous sample. To enough consider small objects, a simple way is to choose the bounding box with the minimal area as its regression label as shown in Fig.2 (top-right). One big issue is that the center of some large objects may be covered by small objects if the centers of two objects close enough. It may result in incorrect labels which are selected near the area that the real center is covered by another small object. As shown in Fig.2 (red circle in the top-right one), the area in the red circle is the center of object 1, but we select the labels from object 2 as its ground-truth.
To address this issue, we propose a new center-aware method to select reasonable labels. In our approach, the center distribution of an object is considered as prior information and makes sure that the real center of each object is preserved in training. Then we choose the bounding box with the minimal area as its regression label. Our method can be formally described as follows:
where we have instances in a raw image, means the area of the bounding box of -th instance. means all instances are sorted by the size of areas from small to large, which makes sure the small objects are firstly considered. We calculate the center distribution of each object by Equation (3), where , , , and are the distances from the location to the four sides of the bounding box, as shown in Fig.2 (bottom-left). Finally, as shown in Fig.2 (bottom-right), we choose the object corresponding to the largest score center as the ground truth for each location. The area and center distribution are considered in our method simultaneously in the proposed method to achieve better performance.
|keeping aspect ratio||(1333, 800)||36.7||21.2|
|fixed size||(768, 768)||35.9||28.2|
We report results on MS COCO’s instance segmentation task  using the standard metrics for the task. We train on train2017 and evaluate on val2017 and test-dev. We implement our method on mmdetection .
In our experiments, our network is trained using stochastic gradient descent (SGD) for 12 epochs with a mini-batch of 16 images. The initial learning rate and momentum are 0.01 and 0.9 respectively. The learning rate is reduced by a factor of 10 at epoch 8, 11 respectively. Specifically, the input image is resized to
. The output channel of protonet is set to 32. We initialize backbone networks with the weights pretrained on ImageNet.
Iv-a Ablation study
Fixed Input Size. As shown in TABLE I, we find an interesting phenomenon that fixing the input size achieves a gain of 7% in term of mask prediction accuracy in comparison with keeping the aspect ratio, even if the size of the former is lower than the latter. We argue that the inputs with the fixed size make the model easily represent instance-level semantic context.
Center Awareness. To evaluate the effectiveness of our proposed center-aware target, we implement our method on two different tasks, object detection and instance segmentation. FCOS  is the state-of-the-art method used for object detection in which the offsets of the bounding box are predicted based on the center position. The results, shown in TABLE II, reveal that the center-aware target achieves a gain of 0.2% and 0.3% in term of on object detection and instance segmentation tasks respectively. We argue that it is important for instance segmentation to predict the masks from the center of object.
Feature Fusion. To achieve competitive performance, we fuse the feature maps from classification and regression branches to predict the mask coefficients without additional parameters. The results shown in TABLE III reveal that the performance gain benefits from the fusion of the feature maps, especially in small instance. It is reasonable that bounding box (regression branch) contributes extra information for mask coefficients prediction.
Iv-B Comparison with the state-of-the-art.
In this part, we compare the performance of the proposed method with various state-of-the-art methods including both two-stage and single-stage models on MS COCO dataset. Our method outputs are visualized in Fig. 3.
The results show that, without bells and whistles, our proposed method is able to achieve competitive performance in comparison with one-stage methods. In less than quarter training epochs without data augmentation and additional semantic loss , SAIS-768 outperforms YOLACT-550 with the same ResNet-50-FPN backbone and ExtremeNet with Hourglass-104 backbone by 0.5% and 9.3% in , respectively. Anchor-free architecture is used in SAIS, which achieves less training memory footprint than all those anchor-based methods. SAIS-640 with ResNet-50-FPN also achieves 29.2 FPS on TITIAN X GPU without Fast NMS  and light-weight head  that are exploited in YOLACT. Specially, SAIS-768 achieves 25.4 FPS over YOLACT-700 with the same ResNet-101-FPN backbone. It reveals that anchor-free mechanism is superior to anchor-base in terms of speed. Compared to two-stage methods, SAIS-640 achieves higher FPS and less memory footprint in the training phase. In summary, the proposed method, which fuses anchor-free framework and parallel instance segmentation subtasks, achieves competitive performance on speed and accuracy.
In this paper, we propose a single-stage anchor-free instance segmentation method in which all tasks are parallel implemented. To enhance the performance, a center-aware ground truth is designed without any additional parameters. Our framework achieves competitive performance on MS COCO dataset. In the future, we will focus on lightweight framework for instance segmentation, which is a promising direction for industrial applications.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg, ”SSD: Single shot multibox detector,” 2015. [Online]. Available: arXiv: 1512.02325.
-  S. Ren, K. He, R. Girshick and J. Sun, ”Faster R-CNN: Towards real-time object detection with region proposal,” 2015. [Online]. Available: arXiv: 1506.01497.
-  J. Dai, Y. Li, K. He, J. Sun, ”R-FCN: Object detection via region-based fully convolutional networks,” 2016. [Online]. Available: arXiv: 1605.06409.
-  D. Bolya, C. Zhou, F. Xiao and Y. Lee, ”YOLACT: Real-time instance segmentation,” 2019. [Online]. Available: arXiv: 1904.02689.
-  K. He, G. Gkioxari, P. Dollár and R. Girshick, ”Mask R-CNN,” 2017. [Online]. Available: arXiv: 1703.06870.
-  Z. Tian, C. Shen, H. Chen and T. He, ”FCOS: Fully convolutional one-stage object detection,” 2019. [Online]. Available: arXiv: 1904.01355.
-  K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang and Q. Tian, ”CenterNet: Keypoint triplets for object detection,” 2019. [Online]. Available: arXiv: 1904.08189.
-  X. Zhou, D. Wang, P. Krähenbühl, ”Objects as points,” 2019. [Online]. Available: arXiv: 1904.07850.
-  Z. Huang, L. Huang, Y. Gong, C. Huang and X. Wang, ”Mask scoring r-cnn,” Proc. IEEE CVPR, pp.6409-6418, 2019.
-  J. Yu, J. Yao, J. Zhang, Z. Yu and D. Tao, ”SPRNet: single pixel reconstruction for one-stage instance segmentation,” 2019. [Online]. Available: arXiv:1904.07426.
-  Y. Li, H. Qi, J. Dai, X. Ji and Y. Wei, ”Fully convolutional instance-aware semantic segmentation,” Proc.IEEE CVPR, Jun.2017.
He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2016:770-778.
-  X. Chen, R. Girshick, K. He and P. Doll’́ar, ”Tensormask: A foundation for dense object segmentation,” 2019. [Online]. Available: arXiv:1903.12174.
-  E. Xie, P. Sun, X. Song, W. Wang, X. Liu, D. Liang, C. Sehn and P. Luo, ”PolarMask: Single shot instance segmentation with polar representation,” 2019. [Online]. Available: arXiv:1909.13226.
-  X. Zhou, J. Zhuo and P. Krahenbuhl, ”Bottom-up object detection by grouping extreme and center points,” Proc.IEEE CVPR, pp.850-859, 2019.
-  T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollär and C. Zitnick, ”Microsoft coco: Common objects in context,” Proc.IEEE ECCV, 2014.
-  K. Chen, J. Wang, J. Pang, et al. ”MMDetection: Open mmlab detection toolbox and benchmark,” 2019. [Online]. Available: arXiv:1906.07155.
-  J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei-Fei, ”ImageNet: A large-scale hierarchical image database,” Proc.IEEE CVPR, pp.248-255, 2009.