1 Introduction
Instance segmentation is one of the fundamental tasks in computer vision, which enables numerous downstream vision applications. It is challenging as it requires to predict both the location and the semantic mask of each instance in an image. Therefore intuitively instance segmentation can be solved by bounding box detection then semantic segmentation within each box, adopted by twostage methods, such as Mask RCNN
[12]. Recent trends in the vision community have spent more effort in designing simpler pipelines of bounding box detectors [14, 18, 25, 26, 28] and subsequent instancewise recognition tasks including instance segmentation [2, 4, 29], which is also the main focus of our work here. Thus, our aim is to design a conceptually simple mask prediction module that can be easily plugged into many offtheshelf detectors, enabling instance segmentation.Instance segmentation is usually solved by binary classification in a spatial layout surrounded by bounding boxes, shown in Figure 1(b). Such pixeltopixel correspondence prediction is luxurious, especially in the singleshot methods. Instead, we point out that masks can be recovered successfully and effectively if the contour is obtained. An intuitive method to locate contours is shown in Figure 1(c), which predicts the Cartesian coordinates of the point composing the contour. Here we term it as Cartesian Representation. The second approach is Polar Representation, which applies the angle and the distance as the coordinate to locate points, shown in Figure 1(d).
In this work, we design instance segmentation methods based on the Polar Representation since its inherent advantages are as follows: (1) The origin point of the polar coordinate can be seen as the center of object. (2) Starting from the origin point, the point in contour is determined by the distance and angle. (3) The angle is naturally directional and makes it very convenient to connect the points into a whole contour. We claim that Cartesian Representation may exhibit first two properties similarly. However, it lacks the advantage of the third property.
We instantiate such an instance segmentation method by using the recent object detector FCOS [25], mainly for its simplicity. Note that, it is possible to use other detectors such as RetinaNet [18], YOLO [23] with minimal modification to our framework. Specifically, we propose PolarMask, formulating instance segmentation as instance center classification and dense distance regression in a polar coordinate, shown in Figure 2. The model takes an input image and predicts the distance from a sampled positive location (candidates of the instance center) to the instance contour at each angle, and after assembling, outputs the final mask. The overall pipeline of PolarMask is almost as simple and clean as FCOS. It introduces negligible computation overhead. Simplicity and efficiency are the two key factors to single shot instance segmentation, and PolarMask achieves them successfully.
Furthermore, PolarMask can be viewed as a generalization of FCOS, or FCOS is a special case of PolarMask since bounding boxes can be viewed as the simplest mask with only 4 direction. Thus, one is suggested to use PolarMask over FCOS for instance recognition wherever mask annotation is available [5, 19].
In order to maximize the advantages of Polar Representation, we propose Polar Centerness and Polar IoU Loss to deal with sampling highquality center examples and optimization for dense distance regression, respectively. They improve mask accuracy by about 15% relatively, showing considerable gains under stricter localization metrics. Without bells and whistles, PolarMask achieves 32.9% in mask mAP with singlemodel and singlescale training/testing on the challenging COCO dataset [19].
The main contributions of this work are threefold:

We introduce a new method for instance segmentation, termed PolarMask, to model instance masks in the polar coordinate, which converts instance segmentation to two parallel tasks: instance center classification and dense distance regression. The main desirable characteristics of PolarMask is being simple and effective.

We propose the Polar IoU Loss and Polar Centerness, tailored for our framework. We show that the proposed Polar IoU loss can largely ease the optimization and considerably improve the accuracy, compared with standard loss such as the smooth loss. In parallel, Polar Centerness improves the original idea of ‘centreness’ in FCOS, leading to further performance boost.

For the first time, we demonstrate a much simpler and flexible instance segmentation framework achieving competitive performance compared with more complex onestage methods, which typically involve multiscale train and longer training time. We hope that PolarMask can serve as a fundamental and strong baseline for single shot instance segmentation.
2 Related Work
TwoStage Instance Segmentation. Twostage instance segmentation often formulates this task as the paradigm of ‘Detect then Segment’ [16, 12, 20, 15]. They often detect bounding boxes then perform segmentation in the area of each bounding box. The main idea of FCIS [16] is to predict a set of positionsensitive output channels fully convolutionally. These channels simultaneously address object classes, boxes, and masks, making the system fast. Mask RCNN [12], built upon Faster RCNN, simply adds an additional mask branch add use RoIAlign to replace RoIPooling [11] for improved accuracy. Following Mask RCNN, PANet [20] introduces bottomup path augmentation, adaptive feature pooling and fullyconnected fusion to boost up the performance of instance segmentation. Mask Scoring RCNN [15] rescores the confidence of mask from classification score by adding a maskIoU branch, which makes the network to predict the IoU of mask and groundtruth.
In summary, the above methods typically consist of two steps, first detecting bounding box and then segmenting in each bounding box. They can achieve stateoftheart performance but are often slow.
One Stage Instance Segmentation. Deep Watershed Transform [1] uses fully convolutional networks to predict the energy map of the whole image and use the watershed algorithm to yield connected components corresponding to object instances. InstanceFCN [6] uses instancesensitive score maps for generating proposals. It first produces a set of instancesensitive score maps, then an assembling module is used to generate object instances in a sliding window. The recent YOLACT [2] first generates a set of prototype masks, the linear combination coefficients for each instance, and bounding boxes, then linearly combines the prototypes using the corresponding predicted coefficients and then crops with a predicted bounding box. TensorMask [4]
investigates the paradigm of dense slidingwindow instance segmentation, using structured 4D tensors to represent masks over a spatial domain. ExtremeNet
[29] uses keypoint detection to predict 8 extreme points of one instance and generates an octagon mask, achieving relatively reasonable object mask prediction. The backbone of ExtremeNet is HourGlass [21], which is very heavy and often needs longer training time. It also needs several steps for postprocessing including grouping. In contrast, our method is simpler than ExtremeNet while much achieving better results than ExtremeNet.Note that these methods do not model instances directly and they can sometime be hard to optimize (e.g., longer training time, more data augmentation and extra labels). Our PolarMask directly models instance segmentation with a much simpler and flexible way of two paralleled branches: classifying each pixel of masscenter of instance and regressing the dense distance of rays between masscenter and contours. The most significant advantage of PolarMask is being simple and efficient compared with all above methods. In the experiments, we have not adopted many training tricks, such as data augmentation and longer training time, since our goal is to design a conceptually simple and flexible mask prediction module.
3 Our Method
In this section, we first briefly introduce the overallarchitecture of the proposed PolarMask. Then, we reformulate instance segmentation with the proposed Polar Representation. Next, we introduce a novel concept of Polar Centerness to ease the procedure of choosing highquality center samples. Finally, we introduce a new Polar IoU Loss to optimize dense regression problem.
3.1 Architecture
PolarMask is a simple, unified network composed of a backbone network [13], a feature pyramid network [17], and two or three taskspecific heads, depending on whether predicting bounding boxes.^{1}^{1}1It is optional to have the box prediction branch or not. As we empirically show, the box prediction branch has little impact on mask prediction. The settings of the backbone and feature pyramid network are same as FCOS [25]. While there exist many stronger candidates for those components, we align these setting with FCOS to show the simplicity and effectiveness of our instance modeling method.
3.2 Polar Mask Segmentation
In this section, we will describe how to model instances in the polar coordinate in detail.
Polar Representation Given an instance mask, we firstly sample a candidate center of the instance and the point located on the contour , , , , . Then, starting from the center, rays are emitted uniformly with the same angle interval (e.g., , ), whose length is determined from the center to the contour.
Thus we model the instance mask in the polar coordinate as center and rays. Since the angle interval is predefined, only the length of the ray needs to be predicted. In this way, we formulate the instance segmentation as instance center classification and dense distance regression in a polar coordinate.
Mass Center There are many choices for the center of the instance, such as box center or masscenter. How to choose a better center depends on its effect on the mask prediction performance. Here we verify the upper bound of box center and masscenter, and conclude that masscenter is more advantageous. Details are in Figure 7
. We explain that the masscenter has a greater probability of falling inside the instance, compared with the box center. Although for some extreme cases, such as a donut, neither masscenter nor box center lies inside the instance. We leave it for further research.
Center Samples Location is considered as a center sample if it falls into areas around the masscenter of any instance. Otherwise it is a negative sample. We define the region for sampling positive pixels to be 1.5strides [25] of the feature map from the masscenter to left, top, right and bottom. Thus each instance has about 916 pixels near the masscenter as center examples. It has two advantages: (1) Increasing the number of positive samples from 1 to 916 can largely avoid imbalance of positive and negative samples. Nevertheless, focal loss [18] is still needed when training the classification branch. (2) Masscenter may not be the best center sample of an instance. More candidate points make it possible to automatically find the best center of one instance. We will discuss it in details in Section 3.3.
Distance Regression Given a center point and the intersection points located on the contour , , , , , the angle and the distance between the center point and each contour point can be computed easily, from which the required rays can be picked up in most cases. However, there are some corner cases:

If one ray has multiple intersection points with the contour of instance, we directly choose the one with the maximum length.

If one ray, which starts from the center outside of the mask, does not have intersection points with the contour of an instance at some certain angles, we set its regression target as the minimum value (e.g., ).

If the intersection point between a ray and the contour happens to be a subpixel (i.e., pixel coordinates are not integers), we can always use an interpolation method, such as linear interpolation, to estimate its regression target.
We argue that these corner cases are the main obstacles of restricting the upper bound of Polar Representation from reaching 100% AP. However, it is not supposed to be seen as Polar Representation being inferior to the nonparametric Pixelwise Representation. The evidence is twofolds. First, even the Pixelwise Representation is far away from the upper bound of 100% AP in practice, since some operation, such as downsampling, is indispensable. Second, current performance is far away from the upper bound regardless of the Pixelwise Representation or Polar Representation. Therefore, the research effort is suggested to better spend on improving the practical performance of models, rather than the theoretical upper bound.
The training of regression branch is nontrivial. First, the mask branch in PolarMask is actually a dense distance regression task since every training example has rays (e.g., ). It may cause the imbalance between the regression loss and classification loss. Second, for one instance, its rays are relevant and should be trained as a whole, rather than being seen as a set of independent regression examples. Therefore, we put forward the Polar IoU Loss, discussed in details in Section 3.4.
Mask Assembling During inference, the network outputs the classification and centerness, we multiply centerness with classification and obtain final confidence scores. We only assemble masks from at most 1k topscoring predictions per FPN level, after thresholding the confidence scores at 0.05. The top predictions from all levels are merged and nonmaximum suppression (NMS) with a threshold of 0.5 is applied to yield the final results. Here we introduce the mask assembling process and a fast NMS process.
Given a center sample and the ray’s length , , , , , we can calculate the position of each corresponding contour point with the following formula:
(1) 
(2) 
Starting from , the contour points are connected one by one, shown in Figure 3 and finally assembles a whole contour as well as the mask.
We apply NMS to remove redundant masks. To fasten the process, We calculate the smallest bounding boxes of masks and then apply NMS based on the IoU of boxes. We verify that such a simplified postprocessing do not negatively effect the final mask performance.
3.3 Polar Centerness
Centerness [25] is introduced to suppress these lowquality detected objects without introducing any hyperparameters and it is proven to be effective in object bounding box detection. However, directly transferring it to our system can be suboptimal since its centerness is designed for bounding boxes and we care about mask prediction.
Given a set for the length of rays of one instance, where and are the maximum and minimum of the set. We propose Polar Centerness:
(3) 
Specifically, we add a single layer branch, in parallel with the classification branch to predict Polar Centerness of a location, as shown in Figure 2. It is a simple yet effective strategy to reweight the points so that the closer and are, higher weight the point is assigned. Experiments show that Polar Centerness improves the accuracy especially under stricter localization metrics, such as AP.
3.4 Polar IoU Loss
As discussed above, the method of polar segmentation converts the task of instance segmentation into a set of regression problems. In most cases in the field of object detection and segmentation, smooth loss [9] and IoU loss [27] are the two effective ways to supervise the regression problems. However, smooth
loss overlooks the correlation between samples of the same objects, thus, resulting in less accurate localization. IoU loss, however, the training procedure considers the optimization as a whole, and directly optimizes the metric of interest. Nevertheless, computing the IoU of the predicted mask and its groundtruth is tricky and very difficult to implement parallel computations. In this work, we derive an easy and effective algorithm to compute mask IoU based on the polar vector representation and achieve competitive performance, as shown in Figure
5.We introduce Polar IoU Loss starting from the definition of IoU, which is the ratio of interaction area over union area between the predicted mask and groundtruth. In the polar coordinate system, for an instance, mask IoU is calculated as follows:
(4) 
where regression target and predicted are length of the ray, angle is . Then we transform it to the discrete form^{2}^{2}2For notation convenience, we define:
(6) 
When approaches infinity, the discrete form is equal to continuous form. We assume that the rays are uniformly emitted, so , which further simplifies the expression. We empirically observe that the power form has little impact on the performance if it is discarded and simplified into the following form:
(7) 
Polar IoU Loss is the binary cross entropy (BCE) loss of Polar IoU. Since the optimal IoU is always 1, the loss is actually is negative logarithm of Polar IoU:
(8) 
Our proposed Polar IoU Loss exhibits two advantageous properties: (1) It is differentiable, enabling back propagation; and is very easy to implement parallel computations, thus facilitating a fast training process. (2) It predicts the regression targets as a whole. It improves the overall performance by a large margin compared with smooth loss, shown in our experiments. (3) As a bonus, Polar IoU Loss is able to automatically keep the balance between classification loss and regression loss of dense distance prediction. We will discuss it in detail in our experiments.






4 Experiments
We present results of instance segmentation on the challenging COCO benchmark [19]. Following common practice [12, 4], we train using the union of 80K train images and a 35K subset of val images (), and report ablations on the remaining 5K val. images (). We also compare results on . We adopt the 1 training strategy [10, 3], single scale training and testing of image shortedge as 800 unless otherwise noted.
Training Details In ablation study, ResNet50FPN [13, 17] is used as our backbone networks and the same hyperparameters with FCOS [25]
are used. Specifically, our network is trained with stochastic gradient descent (SGD) for 90K iterations with the initial learning rate being 0.01 and a minibatch of 16 images. The learning rate is reduced by a factor of 10 at iteration 60K and 80K, respectively. Weight decay and momentum are set as 0.0001 and 0.9, respectively. We initialize our backbone networks with the weights pretrained on ImageNet
[8]. The input images are resized to have their shorter side being 800 and their longer side less or equal to 1333.4.1 Ablation Study
Verification of Upper Bound The first concern about PolarMask is that it might not depict the mask precisely. In this section we prove that this concern may not be necessary. Here we verify the upper bound of PolarMask as the IoU of predicted mask and groundtruth when all of the rays regress to the distance equal to groundtruth. The verification results on different numbers of rays are shown in Figure 7. It can be seen that IoU is approaching to nearly perfect (above 90%) when the number of rays increases, which shows that Polar Segmentation is able to model the mask very well. Therefore, the concern about the upper bound of PolarMask is not necessary. Also, it is more reasonable to use masscenter than bounding boxcenter as the center of an instance because the bounding box center is more likely to fall out of the instance.
Number of Rays It plays the fundamental role in the whole system of PolarMask. From Table 0(a) and Figure 7, more rays show higher upper bound and better AP. For example, 36 rays improve by 1.5% AP compared to 18 rays. Also, too many rays, 72 rays, saturate the performance since it already depicts the mask contours well and the number of rays is no longer the main factor constraining the performance.
Polar IoU Loss vs. Smooth Loss We test both Polar IoU Loss and Smooth Loss in our architecture. We note that the regression loss of Smooth Loss is significantly larger than the classification loss, since our architecture is a task of dense distance prediction. To cope with the imbalance, we select different factor to regression loss in Smooth Loss. Experiments results are shown in Table 0(b). Our Polar IoU Loss achieves 27.7% AP without balancing regression loss and classification loss. In contrast, the best setting for Smooth Loss achieves 25.1% AP, a gap of 2.6% AP, showing that Polar IoU Loss is more effective than Smooth loss for training the regression task of distance between masscenter and contours.
We hypothesize that the gap may come from two folds. First, the Smooth Loss may need more hyperparameter search to achieve better performance, which can be timeconsuming compared to the Polar IoU Loss. Second, Polar IoU Loss predicts all rays of one instance as a whole, which is superior to Smooth Loss.
In Figure 6 we compare some results using the Smooth Loss and Polar IoU Loss respectively. Smooth Loss exhibits systematic artifacts, suggesting that it lacks supervision of the level of the whole object. PolarMask shows more smooth and precise contours.
Polar Centerness vs. Cartesian Centerness The comparison experiments are shown in Table 0(c). Polar Centerness improves by 1.4% AP overall. Particularly, AP and AP are raised considerably, 2.3% AP and 2.6% AP, respectively.
We explain as follows. On the one hand, lowquality masks make more negative effect on highIoU. On the other hand, large instances have more possibility of large difference between maximum and minimum lengths of rays, which is exactly the problem which Polar Centerness is committed to solve.
Box Branch Most of previous methods of instance segmentation require the bounding box to locate area of object and then segment the pixels inside the object. In contrast, PolarMask is capable to directly output the mask without bounding box.
In this section, we test whether the additional bounding box can help improve the mask AP as follows. If the ray reaches outside of the bounding box, the ray is cut off at the boundary. From Table 0(d), we can see that bounding box makes little difference to performance of mask prediction. Thus, we do not have the bounding box prediction head in PolarMask for simplicity and faster speed.
Backbone Architecture Table 0(e)
shows the results of PolarMask on different backbones. It can be seen that better feature extracted by deeper and advanced design networks improve the performance as expected.
Speed vs. Accuracy Larger image sizes yield higher accuracy, in slower inference speeds. Table 0(f) shows the speed/accuracy tradeoff for different input image scales, defined by the shorter image side. The FPS is reported on an outdated TitanX GPU.
method  backbone  epochs  aug  AP  AP  AP  AP  AP  AP 
twostage  
MNC [7]  ResNet101C4  12  24.6  44.3  24.8  4.7  25.9  43.6  
FCIS [16]  ResNet101C5dilated  12  29.2  49.5    7.1  31.3  50.0  
Mask RCNN [12]  ResNeXt101FPN  12  37.1  60.0  39.4  16.9  39.9  53.5  
onestage  
ExtremeNet [29]  Hourglass104  100  ✓  18.9  44.5  13.7  10.4  20.4  28.3 
TensorMask [4]  ResNet101FPN  72  ✓  37.1  59.3  39.4  17.1  39.1  51.6 
YOLACT [2]  ResNet101FPN  48  ✓  31.2  50.6  32.8  12.1  33.3  47.1 
PolarMask  ResNet101FPN  12  30.4  51.9  31.0  13.4  32.4  42.8  
PolarMask  ResNeXt101FPN  12  32.9  55.4  33.8  15.5  35.1  46.3 
4.2 Comparison to stateoftheart
We evaluate PolarMask on the COCO dataset and compare  results to stateoftheart methods including both onestage and twostage models, shown in Table 2. PolarMask outputs are visualized in Figure 8.
Without any bells and whistles, PolarMask is able to achieve competitive performance with more complex onestage methods. Since our aim is to design a conceptually simple and flexible mask prediction module, many improvements methods [24, 22], such as multiscale training and longer training time is beyond the scope of this work. We argue that the gap of YOLACT [2] and PolarMask comes from more training epochs and data augmentation. If these methods are applied to PolarMask, the performance can be readily improved. Besides, the gap of TensorMask [4] and PolarMask arises from tensor bipyramid and aligned representation. Considering these methods are timecosting and memorycosting, we do not plug them to PolarMask.
5 Conclusion
PolarMask is a single shot anchorfree instance segmentation method with two paralleled branches: classifying masscenter of instances and regressing the dense lengths of rays between sampled locations around the masscenter and contours. Different from previous works that typically solve mask prediction as binary classification in a spatial layout, PolarMask puts forward polar representation and transforms mask prediction to dense distance regression. PolarMask is designed almost as simple and clean as single shot object detectors, introducing negligible computing overhead. We hope that the proposed PolarMask framework can serve as a fundamental and strong baseline for single shot instance segmentation task.
References

[1]
(2017)
Deep watershed transform for instance segmentation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5221–5229. Cited by: §2.  [2] (2019) YOLACT: Realtime instance segmentation. arXiv. Cited by: §1, §2, §4.2, Table 2.
 [3] (2019) MMDetection: open mmlab detection toolbox and benchmark. External Links: 1906.07155 Cited by: §4.
 [4] (2019) Tensormask: a foundation for dense object segmentation. arXiv preprint arXiv:1903.12174. Cited by: §1, §2, §4.2, Table 2, §4.

[5]
(2016)
The cityscapes dataset for semantic urban scene understanding
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1.  [6] (2016) Instancesensitive fully convolutional networks. In European Conference on Computer Vision, pp. 534–549. Cited by: §2.
 [7] (2016) Instanceaware semantic segmentation via multitask network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3150–3158. Cited by: Table 2.
 [8] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §4.
 [9] (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587. Cited by: §3.4.
 [10] (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: Table 2, §4.
 [11] (2015) Fast RCNN. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448. Cited by: §2.
 [12] (2017) Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §2, Table 2, §4.
 [13] (201606) Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1, §4.
 [14] (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874. Cited by: §1.
 [15] (2019) Mask scoring rcnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6409–6418. Cited by: §2.
 [16] (2017) Fully convolutional instanceaware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2359–2367. Cited by: §2, Table 2.
 [17] (201707) Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1, §4.
 [18] (201710) Focal loss for dense object detection. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §1, §3.2.
 [19] (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §1, §4.
 [20] (2018) Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768. Cited by: §2.

[21]
(2016)
Stacked hourglass networks for human pose estimation
. In European conference on computer vision, pp. 483–499. Cited by: §2.  [22] (2018) Megdet: a large minibatch object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6181–6189. Cited by: §4.2.
 [23] (2016) You only look once: unified, realtime object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. Cited by: §1.
 [24] (2018) An analysis of scale invariance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3578–3587. Cited by: §4.2.
 [25] (2019) FCOS: fully convolutional onestage object detection. Proc. International Conf. Computer Vision (ICCV). Cited by: §1, §1, §3.1, §3.2, §3.3, §4.
 [26] (2019) RepPoints: point set representation for object detection. arXiv preprint arXiv:1904.11490. Cited by: §1.
 [27] (2016) Unitbox: an advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, pp. 516–520. Cited by: §3.4.
 [28] (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §1.
 [29] (2019) Bottomup object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 850–859. Cited by: §1, §2, Table 2.
Comments
There are no comments yet.