Log In Sign Up

SAIS: Single-stage Anchor-free Instance Segmentation

In this paper, we propose a simple yet efficientinstance segmentation approach based on the single-stage anchor-free detector, termed SAIS. In our approach, the instancesegmentation task consists of two parallel subtasks which re-spectively predict the mask coefficients and the mask prototypes.Then, instance masks are generated by linearly combining theprototypes with the mask coefficients. To enhance the quality ofinstance mask, the information from regression and classificationis fused to predict the mask coefficients. In addition, center-aware target is designed to preserve the center coordination ofeach instance, which achieves a stable improvement in instancesegmentation. The experiment on MS COCO shows that SAISachieves the performance of the exiting state-of-the-art single-stage methods with a much less memory footpr


page 2

page 3

page 4


Mask Encoding for Single Shot Instance Segmentation

To date, instance segmentation is dominated by twostage methods, as pion...

PointINS: Point-based Instance Segmentation

A single-point feature has shown its effectiveness in object detection. ...

YOLACT++: Better Real-time Instance Segmentation

We present a simple, fully-convolutional model for real-time (>30 fps) i...

YOLACT: Real-time Instance Segmentation

We present a simple, fully-convolutional model for real-time instance se...

LeafMask: Towards Greater Accuracy on Leaf Segmentation

Leaf segmentation is the most direct and effective way for high-throughp...

Impoved RPN for Single Targets Detection based on the Anchor Mask Net

Common target detection is usually based on single frame images, which i...

The Devil is in the Boundary: Exploiting Boundary Representation for Basis-based Instance Segmentation

Pursuing a more coherent scene understanding towards real-time vision ap...

I Introduction

Instance segmentation is one of the general but challenging tasks in computer vision. In generally, instance segmentation can be split into two steps: object detection, and pixel classification. So the current instance segmentation task is directly based on advances in object detection like SSD

[1], Faster R-CNN [2], and R-FCN [3]. According to the different types of detection architecture, instance segmentation tasks can be divided into two categories, single-stage instance segmentation and two-stage instance segmentation.

The commonly used two-stage instance segmentation methods focus primarily on the performance over speed. Due to the using of a cascade strategy, these methods are usually time-consuming. In addition, their dependence on the feature localization makes them difficult to accelerate. Some of the recently proposed one stage instance segmentation methods, eg. YOLACT [4], partly solve those problems by dividing the instance segmentation task into two parallel subtasks: prototype mask generation and per-instance mask coefficients prediction. It is a effective way to speed up existing two-stage methods like Mask R-CNN [5]. However, in order to represent different shape instances in an image, all those methods above require lots of anchors and memory.

To handle this issue, we propose an instance segmentation method based on the one-stage anchor-free detection framework. Inspired by some efficient anchor-free detection methods such as FCOS [6], CenterNet [7, 8], etc, which obtain reasonable trade-off between speed and performance by eliminating the predefined set of anchor boxes. Based on FCOS, the proposed instance segmentation task is divided into two subtasks similar to YOLACT. As shown in Fig.1

(yellow box), one subtask which predicts mask coefficients is assembled into each head of the detector by combining the classification and regression branches. Only one group of mask coefficients of each sample needs to be predicted since the anchor-free mechanism reduces the total training memory footprint. The other subtask which generates the prototype masks is directly implemented as an FCN (green box). All those tasks are implemented in parallel based on single-stage architecture to speed up the training phase. Also, to enhance the performance without any additional hyperparameters, we propose a center-aware ground truth scheme, which can effectively preserve the center of each instance during the training and achieve a stable improvement.

Our contributions can be summarized as follows: (1) We propose an instance segmentation method based on anchor-free mechanism, which has great advantages in speed and memory usage. (2) We propose a center aware ground truth scheme, which effectively improves the performance of our framework in detection and instance segmentation tasks.

Ii Related work

Fig. 1: The network architecture of our proposed method, where , , and denote the feature maps of the backbone network and to are the feature levels used for the final prediction. is the height and width of feature maps. In the ProtoNet, Arrows indicate layers, except for the final which is . Different lines mean the down-sampling ratio of the level of feature maps to the input image.

Two-Stage Instance Segmentation. Instance segmentation can be solved by bounding box detection then semantic segmentation within each box, which is adopted by most of existing two-stage methods. Based on Faster R-CNN [2], Mask R-CNN [5] simply adds an mask branch to predict mask of each instance. Mask Scoring R-CNN [9] re-scores the confidence of mask from classification score by adding a mask-IoU branch, which makes the network to predict the IoU of mask and ground-truth. FCIS [11] predicts a set of position-sensitive output channels which simultaneously address object classes, boxes, and masks. The above state-of-the-art methods can achieve satisfy performance but are time-consuming.

Single-Stage Instance Segmentation. SPRNet [10] has an encoder-decoder structure, in which classification, regression and mask branches are processed in parallel. It generates each instance mask from a single pixel, and then resize the mask to fit the corresponding box to get the final instance-level prediction. In the decoding part, each pixel is used as an instance carrier to generate the instance mask, on which consecutive deconvolutions are applied to get the final predictions. YOLACT [4] divide the instance segmentation into two parallel subtasks, the prototype mask generation and the pre-instance mask coefficient prediction. Then, the generated prototypes are combined linearly by using the corresponding predicted coefficients and cropped with a predicting bounding box. TensorMask [13]

investigates the paradigms of dense sliding window instance segmentation by using structured 4D tensors to represent masks over a spatial domain. All of above methods use anchor-based detection backbone, which requires plenty of memory footprint in the training phase. Polarmask

[14] formulates the instance segmentation problem as instance center classification and dense distance regression in a polar coordinate. ExtremeNet [15] uses keypoint detection to predict 8 extreme points of one instance and generates an octagon mask, which achieves relatively reasonable object mask prediction. It is a anchor-free method, but the octagon mask encoded method might not depict the mask precisely. We propose a novel instance segmentation method by combining the single-stage anchor free framework and robust mask encoding method.

Iii Method

In this section, the proposed method is introduced in detail. The pipeline is shown in Fig.1. In the section III-A, we explore the application of the anchor-free mechanism on instance segmentation task. In the section III-B, we propose a center-aware ground truth to improve the performance.

Iii-a Single-stage anchor-free instance segmentation

YOLACT [4] is a real-time instance segmentation method in which instance segmentation can be divided into two parallel subtasks: 1) mask coefficients prediction and 2) prototypes prediction. In this paper, we follow this parallel mechanism to accelerate the model.

Anchor-free for mask coefficients

Instance segmentation depends strongly on the accuracy of bounding box. To obtain a high-quality bounding box of an instance, the proposed SAIS is based on the FCOS [6], an one-stage anchor-free architecture that achieves state-of-the-art performance on object detection tasks. As shown in Fig.1, each head has two branches, one is used to detect 4 bounding boxes regressions, the other is used to predict 1 center possibility and class confidences. Different from FCOS [6], in each head, a new output layer is added to predict mask coefficients for each sample (Fig. 1 yellow box). To extract enrich semantic information, we firstly fuse the two branches (classification branch and regression branch) before predicting mask coefficients, followed by a convolutional layer with channels to predict mask coefficients of each sample. In the proposed method, each sample only has outputs, which has fewer network output variables than the commonly used anchor-based methods with anchor boxes per sample.

Mask prediction

Note that the prototype generation branch (protonet) predicts a set of prototype masks for the entire image. The protonet is implemented as an FCN whose last layer is with the same channels as the mask coefficient prediction layer. The final instance masks are generated by combining the mask prototypes and mask coefficients. For each sample, the mask coefficient is produced by the heads of FPN while the mask prototype is generated by protonet and shared by all samples. As shown in Fig.1 (blue box), the final instance mask of this sample is obtained by a single matrix multiplication and sigmoid:


where is a matrix and is an matrix. The single-stage architecture is composed of the fully convolutional layers, and all subtasks are executed in parallel, which achieves a high speed.

Iii-B Center-aware ground truth

Fig. 2: The difference between center-aware ground truth and area-aware ground truth. In those ground truth, each location includes 4 properties (classes, center score, bounding boxes, and instance masks), different colors represent the label from different objects, black means negative samples.

The labels of all tasks are selected from the ground-truth map. If a location falls into multiple bounding boxes, it is considered as an ambiguous sample. To enough consider small objects, a simple way is to choose the bounding box with the minimal area as its regression label as shown in Fig.2 (top-right). One big issue is that the center of some large objects may be covered by small objects if the centers of two objects close enough. It may result in incorrect labels which are selected near the area that the real center is covered by another small object. As shown in Fig.2 (red circle in the top-right one), the area in the red circle is the center of object 1, but we select the labels from object 2 as its ground-truth.

To address this issue, we propose a new center-aware method to select reasonable labels. In our approach, the center distribution of an object is considered as prior information and makes sure that the real center of each object is preserved in training. Then we choose the bounding box with the minimal area as its regression label. Our method can be formally described as follows:


where we have instances in a raw image, means the area of the bounding box of -th instance. means all instances are sorted by the size of areas from small to large, which makes sure the small objects are firstly considered. We calculate the center distribution of each object by Equation (3), where , , , and are the distances from the location to the four sides of the bounding box, as shown in Fig.2 (bottom-left). Finally, as shown in Fig.2 (bottom-right), we choose the object corresponding to the largest score center as the ground truth for each location. The area and center distribution are considered in our method simultaneously in the proposed method to achieve better performance.

input size
keeping aspect ratio (1333, 800) 36.7 21.2
fixed size (768, 768) 35.9 28.2
TABLE I: Comparing the performance of inputs with different types.
Method w/ c
FCOS[6] o 36.7 55.5 39.3 21.9 40.5 48.0
36.9 55.7 39.5 21.5 40.9 47.3
SAIS o 28.2 47.8 29.0 9.3 30.6 44.9
28.5 48.4 29.3 10.2 30.7 45.1
TABLE II: Comparing the results on object detection and instance segmentation. ’w/ c’ means the target with center-aware. For object detection, evaluate annotation type is bbox. For instance segmentation, evaluate annotation type is mask. All methods make use of ResNet-50-FPN as backbone
w/o 28.5 48.4 29.3 10.2 30.7 45.1
w/ 28.7 49.2 29.4 10.4 32.2 44.5
TABLE III: Feature fusion for mask coefficient prediction. Comparison the performance w/ (w/o) summed the classification branch and regression branch. If w/o summation, the mask coefficients are predicted only by the classification branch.
Method Backbone epochs aug FPS Mem(GB) GPU
FCIS ResNet-101-C5 12 o 29.5 51.5 30.2 8.0 31.0 49.7 6.6 - Xp
Mask R-CNN ResNet-101-FPN 12 o 37.5 60.2 40.0 19.8 41.2 51.4 8.6 5.7 Xp
ExtremeNet Hourglass-104 100 18.9 44.5 13.7 10.4 20.4 28.3 4.1 - Xp
YOLACT-550 ResNet-50-FPN 48 28.2 46.6 29.2 9.2 29.3 44.8 42.5 3.8 Xp
SAIS-640 ResNet-50-FPN 12 o 27.6 47.2 28.3 9.4 30.5 44.0 29.2 1.8 X
SAIS-768 ResNet-50-FPN 12 o 28.7 49.2 29.4 10.4 32.2 44.5 26.9 2.5 X
YOLACT-700 ResNet-101-FPN 48 31.2 50.6 32.8 12.1 33.3 47.1 23.6 - Xp
SAIS-768 ResNet-101-FPN 12 o 30.7 51.6 31.7 11.3 34.3 46.8 25.4 3.6 X
SAIS-768 ResNeXt-101-FPN 12 o 32.5 55.8 33.6 13.8 35.5 50.3 18.2 6.6 X
TABLE IV: Comparison with state-of-the-art. Instance segmentation mask AP on the COCO test-dev. The FPS of our model is reported on TITAN X GPUs Better backbones bring expected gains: deeper networks do better, and ResNeXt improves on ResNet.

Iv Experiments

We report results on MS COCO’s instance segmentation task [16] using the standard metrics for the task. We train on train2017 and evaluate on val2017 and test-dev. We implement our method on mmdetection [17].

Training details.

In our experiments, our network is trained using stochastic gradient descent (SGD) for 12 epochs with a mini-batch of 16 images. The initial learning rate and momentum are 0.01 and 0.9 respectively. The learning rate is reduced by a factor of 10 at epoch 8, 11 respectively. Specifically, the input image is resized to

. The output channel of protonet is set to 32. We initialize backbone networks with the weights pretrained on ImageNet


Iv-a Ablation study

Fixed Input Size. As shown in TABLE I, we find an interesting phenomenon that fixing the input size achieves a gain of 7% in term of mask prediction accuracy in comparison with keeping the aspect ratio, even if the size of the former is lower than the latter. We argue that the inputs with the fixed size make the model easily represent instance-level semantic context.

Center Awareness. To evaluate the effectiveness of our proposed center-aware target, we implement our method on two different tasks, object detection and instance segmentation. FCOS [6] is the state-of-the-art method used for object detection in which the offsets of the bounding box are predicted based on the center position. The results, shown in TABLE II, reveal that the center-aware target achieves a gain of 0.2% and 0.3% in term of on object detection and instance segmentation tasks respectively. We argue that it is important for instance segmentation to predict the masks from the center of object.

Feature Fusion. To achieve competitive performance, we fuse the feature maps from classification and regression branches to predict the mask coefficients without additional parameters. The results shown in TABLE III reveal that the performance gain benefits from the fusion of the feature maps, especially in small instance. It is reasonable that bounding box (regression branch) contributes extra information for mask coefficients prediction.

Iv-B Comparison with the state-of-the-art.

In this part, we compare the performance of the proposed method with various state-of-the-art methods including both two-stage and single-stage models on MS COCO dataset. Our method outputs are visualized in Fig. 3.

The results show that, without bells and whistles, our proposed method is able to achieve competitive performance in comparison with one-stage methods. In less than quarter training epochs without data augmentation and additional semantic loss [4], SAIS-768 outperforms YOLACT-550 with the same ResNet-50-FPN backbone and ExtremeNet with Hourglass-104 backbone by 0.5% and 9.3% in , respectively. Anchor-free architecture is used in SAIS, which achieves less training memory footprint than all those anchor-based methods. SAIS-640 with ResNet-50-FPN also achieves 29.2 FPS on TITIAN X GPU without Fast NMS [4] and light-weight head [4] that are exploited in YOLACT. Specially, SAIS-768 achieves 25.4 FPS over YOLACT-700 with the same ResNet-101-FPN backbone. It reveals that anchor-free mechanism is superior to anchor-base in terms of speed. Compared to two-stage methods, SAIS-640 achieves higher FPS and less memory footprint in the training phase. In summary, the proposed method, which fuses anchor-free framework and parallel instance segmentation subtasks, achieves competitive performance on speed and accuracy.

The quantity results shown in Fig. 3 reveal that the quality masks are generated in our method by robust mask encoding method without repooling operation (RoI Pooling/Align [2, 5]) for original feature.

Fig. 3: Quantitative examples on the MS COCO test-dev. For each image, one color corresponds to one instance in that image.

V Conclusion

In this paper, we propose a single-stage anchor-free instance segmentation method in which all tasks are parallel implemented. To enhance the performance, a center-aware ground truth is designed without any additional parameters. Our framework achieves competitive performance on MS COCO dataset. In the future, we will focus on lightweight framework for instance segmentation, which is a promising direction for industrial applications.