Contour Loss for Instance Segmentation via k-step Distance Transformation Image

02/22/2021 ∙ by Xiaolong Guo, et al. ∙ Beijing University of Chemical Technology 5

Instance segmentation aims to locate targets in the image and segment each target area at pixel level, which is one of the most important tasks in computer vision. Mask R-CNN is a classic method of instance segmentation, but we find that its predicted masks are unclear and inaccurate near contours. To cope with this problem, we draw on the idea of contour matching based on distance transformation image and propose a novel loss function, called contour loss. Contour loss is designed to specifically optimize the contour parts of the predicted masks, thus can assure more accurate instance segmentation. In order to make the proposed contour loss to be jointly trained under modern neural network frameworks, we design a differentiable k-step distance transformation image calculation module, which can approximately compute truncated distance transformation images of the predicted mask and corresponding ground-truth mask online. The proposed contour loss can be integrated into existing instance segmentation methods such as Mask R-CNN, and combined with their original loss functions without modification of the inference network structures, thus has strong versatility. Experimental results on COCO show that contour loss is effective, which can further improve instance segmentation performances.



There are no comments yet.


page 1

page 4

page 5

page 6

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Instance segmentation is a basic problem in computer vision. The main task of instance segmentation is to acquire the location and the pixel-wise semantic information of each instance. Benefited from tremendous development of deep learning [1] in object detection and semantic segmentation, instance segmentation based on deep learning has a rapid progress over a short period of time. However, due to diversity of objects and overlap between them, instance segmentation is still a challenging problem.

Taking classic instance segmentation method Mask R-CNN [2] as an example, although it can predict a general mask, contour of the mask (called predicted contour later) is neither clear nor accurate. As far as we know, it could be a fatal problem in some applications. For example, in vision-based robot grabbing, clear and accurate contour is essential to the quality of grab detection. Our goal in this work is to make the predicted mask and its ground-truth mask not only consistent on the whole, but as consistent as possible near the contour.

Recently, some studies suggest that introducing additional supervisory signals beyond RGB data may provide new clues in a complementary mode to improve performance of instance segmentation [3, 4, 5]. Those works respectively introduce depth, shape and key points as auxiliary information to efficiently perform instance segmentation. Although those methods achieve good performance from different perspectives, they usually require complicated pip-lines or large amount of training parameters. Moreover, the segmentation results near the contour are still not ideal.

To improve instance segmentation accuracy near the contour, this paper borrows classic idea of image distance transformation technology [6], and brings distance transformation image (DTI) into instance segmentation for providing contour supervisory signals. Two main contributions are included in our method. One aspect is that we propose a novel loss function named contour loss based on DTI to optimize the contour part specially. In particular, we firstly calculate predicted k-step DTI and ground-truth k-step DTI for the predicted mask and the ground-truth mask respectively. Then, we accumulate coverage values of one contour image onto k-step DTI of the other contour image, and the average of the two normalized coverage values is regarded as the difference measure between the predicted contour and the ground-truth contour. We define contour difference calculated in this manner as contour loss, which can be integrated into existing instance segmentation algorithms such as Mask R-CNN without modifying their neural network structures. The other aspect is that we design a differentiable k-step DTI calculation module, which approximately computes truncated DTIs of the predicted mask and the ground-truth mask online. The proposed module can be jointly trained in modern neural network frameworks without addition of other training parameters. To our best knowledge, this is the first analytic DTI module adaptive to current neural network frameworks. Experiments on COCO [7] show that the proposed contour loss is effective to produce more accurate and clearer masks, and can further improve the instance segmentation performances.

The rest of this paper is organized as follows: Section II introduces related works. Section III details the proposed contour loss. Section IV illustrates experiments to verify contour loss. Section V draws main conclusions.

Ii Related Works

This section firstly introduces mainstream instance segmentation algorithms to date, and then summarizes separately those combining edge or boundary information where the differences between our method and theirs are analyzed.

Ii-a Mainstream Instance Segmentation Methods

According to the processing pipe-line, current mainstream instance segmentation algorithms can be categorized into segmentation-based methods and detection-based methods.

Segmentation-based methods firstly perform semantic segmentation in the image and then produce instance masks based on semantic information combination. FCN [8] has achieved remarkable success in the field of semantic segmentation, and numerous researchers have tried to apply it to instance segmentation. Dai et al. proposed Instance-FCN [9]. This method firstly generated a set of instance sensitive score maps which were used to predict semantic information of different relative positions of the same instance, and then got object masks through assembly accordingly. Li et al. proposed FCIS [10]. The authors used the feature representation of inside/outside position sensitive score maps to solve the problem that the same pixel may have different semantics in different regions of interest, and determined object category while generating object’s mask. Pham et al. came up with Biseg [11]. They used semantic segmentation fractional graph and Li et al.’s inside/outside position sensitive score maps as prior information, regarded instance masks as posterior information, and deduced object masks from prior information using Bayesian model. Wang et al. proposed SOLO [12]. This algorithm divided the input image into S×S grids, used FPN [13] to distinguish objects of different scales, and tried to directly segment masks from the image.

Fig. 1: The overall sketch of the proposed contour loss.

Detection-based methods firstly rely on object detector to locate targets in the image, and then perform pixel-level classification within each target area. He et al. proposed Mask R-CNN in 2017 [2], which took full advantage of object detector to achieve high instance segmentation accuracy. Since then, detection-based instance segmentation methods represented by Mask R-CNN have gradually become the mainstream. The essence of Mask R-CNN is to add a feature alignment module and a mask branch to Faster R-CNN [14]. Inspired by previous works, Fu et al. developed RetinaMask [15] which was a real-time single-stage instance segmentation algorithm. PA-Net [16] established information flow between low-level features and high-level features, which further improved the instance segmentation precision. Mask Scoring R-CNN [17] depended on a Mask IoU branch to handle the problem of mismatch between mask quality and mask score. HTC [18] made full use of reciprocal relationship between detection task and segmentation task to integrate and learn complementary features of each stage. Combining rich context information between the mask branches in different stages, it greatly improved instance segmentation accuracy. In general, the performance of detection-based methods is better than that of segmentation-based methods. Thus, we choose to evaluate our proposed method on Mask R-CNN framework.

Ii-B Methods Combining Edge or Boundary Information

Recently, there are some attempts to incorporate edges or boundaries to facilitate instance segmentation.

Kang et al. [19] extended the edge of ground-truth mask to inside and outside by k pixels, and assigned pixel values of the extended parts empirically. This method was conducive to learning richer edge information and achieved a little performance improvement in both object detection and instance segmentation. However, the method is very sensitive to hyper-parameter k, which needs to be adjusted for different databases. Moreover, most values of k contribute to negative gains. Instead, our method is built upon classic image distance transformation technology and has a solid theoretical foundation. It is not sensitive to hyper-parameter k, and there is no hyper-parameter for loss fusion, which shows strong generality.

Roland et al. [20] used classical Sobel [21] operator to extract edge images of predicted mask and ground-truth mask respectively. The error between edge images was measured by mean square error (MSE) loss, which improved instance segmentation accuracy of object’s edge. Beside edge images which only contain simple position information, we design k-step DTI module to encode additional distance information, which can be essentially regarded as an active contour model and can learn the object’s contour better.

Hayder et al. [22]

took DTI as a mask representation, and predicted DTI of ground-truth mask through a complex neural network branch. This method relayed on an explicit encode-decode module and special post-processing steps to produce objects’ masks. Although image distance transformation is good at describing the closeness of similar contour points, distance transformation values of regions far from the contour are easily affected by various disturbances. Therefore, the algorithm’s stability needs to be improved. Differently, we design a truncated DTI module which is inferable and differentiable. By truncation, the algorithm pays more attention to the optimization of the contour points. Inferable and differentiable characteristics make our truncated DTI to be used as an evaluation metric. When applying truncated DTI, the inference network structure of the original algorithm can be inherited and preserved to produce more accurate masks and no further post processing steps are needed.

Cheng et al. [23] trained a new branch to predict the edges of masks to exploit edge information, directly increasing training parameters. Our method does not need to modify network structure of basic algorithm and does not increase training parameters, only optimizing existing parameters.

In short, the main difference from the above works is that we design an inferable and differentiable implementation of truncated DTI, which can generate new supervisory information online to specifically optimize object’s contour part. Another difference is that we propose contour loss on the foundation of the truncated DTI, which achieves better performances compared with existing methods.

Iii Method

In this section, we firstly introduce the overall sketch of the proposed contour loss for instance segmentation. Then, we demonstrate the procedure of computing k-step DTI, i.e. the truncated DTI, which is used for the computation of contour loss. In the end, we detail the mathematical definition as well as pseudocode of contour loss.

Iii-a Overall Architecture

Mask R-CNN is a general instance segmentation framework, but it takes no consideration of the segmentation quality near the contour. To overcome this drawback, we design a contour loss function on the foundation of k-step DTI and integrate it into Mask R-CNN to achieve joint training. The calculation process of contour loss is shown in Fig. 1 Contour loss does not change the original network structure and can also be applied to other instance segmentation frameworks.

As shown in Fig. 1

, the calculation process of contour loss starts from the mask branch’s output of the present instance segmentation method. Firstly, according to the prediction of the classification branch, the predicted mask response is selected from the mask branch. Secondly, a simulated binarization operation is conducted on the selected mask response to approximately obtain the predicted mask. Thirdly, a fixed parameter convolution layer with

Sobel operator as its convolution kernel is utilized to convolve the predicted mask and the ground-truth mask to get the predicted contour response and the ground-truth contour response respectively. Finally, image distance transformation operation is conducted on the predicted contour response and the ground-truth contour response to get the predicted k-step DTI and the ground-truth k-step DTI respectively. The coverage values of one contour response image onto k-step DTI of the other contour response image are accumulated. Contour loss is defined as average of the two normalized coverage values. It can be jointly trained with the original mask loss to make the object mask more accurate and clearer near the contour. Various parts of the proposed contour loss are detailed as follows.

Fig. 2: The mathematical function curve for binarization (a) and the binarized results (b).
Fig. 3: Typical predicted mask responses (a) and ground-truth masks (b) are shown at the top row. Their corresponding contour responses and k-step DTIs are displayed at the middle row and bottom row, respectively.

Binarization of the predicted mask response. Denote be the predicted mask response selected from the output of the mask branch. We utilize a differentiable mathematical function to approximately binarize it to obtain the predicted mask .


where and represent the slope and threshold (binarization value), respectively. We set to 20 and to 0.5 by default. The purpose of using the mathematical function is to simulate the binarization operation in a differentiable manner, which corresponds to the step of obtaining the predicted binary masks at inference stage. The curve of the mathematical function is shown on Fig. 2(a). Several pairs of the predicted mask responses (first row) and their corresponding simulated binary images (second row) are shown on Fig. 2(b).

Calculation of the contour response. We construct a fixed parameter convolution layer with Sobel operator as its convolution kernel to convolve the predicted mask and the ground-truth mask in both and directions to obtain the predicted contour response and the ground-truth contour response , respectively.


where is the standard convolution operation, and is the absolute value. Typical contour response images are shown in Fig. 3, where the first row of Fig. 3(a) and Fig. 3(b) respectively shows the predicted mask responses and the ground-truth masks. The second row of them shows their corresponding contour responses.

DTI of the contour response. The values of pixels in DTI that are far away from the contour may be unstable, thus can interfere with the optimization process. In order to make contour loss focus on optimizing the object’s contour parts, we use a threshold k to truncate the DTI of the contour response. In other words, pixel values of DTI exceeding k are set to k, and the resulting image is called k-step DTI. The last row of Fig. 3 shows the computed k-step DTIs. We can see that pixel values close to object’s contour are smaller (shown darker), while those far away from the contour are larger (shown brighter). The white areas indicate that their pixel values reach the truncated threshold k.

Note that the computation of k-step

DTI must be differentiable, otherwise contour loss cannot be backpropagated during the network training phase. Therefore, we design an approximated differentiable module of the

k-step DTI under modern neural network structures, which is referred as kSDT (k-Step Distance Transformation) algorithm (see Subsection III-B for specific principle and implementation details). We apply kSDT to the predicted contour response and the ground-truth contour response to compute the predicted k-step DTI and the ground-truth k-step DTI , respectively.


Computation of the contour loss. Each pixel value of k-step DTI represents the distance between it and the closest point of the contour, which can be used to measure the difference between two contours. To compute contour loss, we firstly accumulate coverage values of one contour response image onto k-step DTI of the other contour response image. Then, we regard the average of the two normalized coverage values as the difference measure between the predicted contour response and the ground-truth contour response. During training stage, when the predicted contour response deviates from the ground-truth contour response, contour loss will optimize and correct the predicted mask response, making the predicted mask obtained at inference stage more accurate and clearer near the contour. Specific principle and implementation details of contour loss are shown in Subsection III-C.

Joint training with contour loss. Numerous studies have shown that multi-task learning performs better than single-task learning. Thus, we define a multi-task loss for each training batch which is expressed as follows.


where the classification loss , box regression loss , and the mask loss are the same as those in Mask R-CNN. is the proposed contour loss (see Subsection III-C).

Fig. 4: A binary image(a) and its DTI(b) shown as a heat map.
Fig. 5: The calculation flow chart of kSDT.

Iii-B k-Step DTI

Before presenting the approximated differentiable implementation of the proposed k-step DTI under modern neural networks, we first review the concept of image distance transformation. Image distance transformation is a classical technology in computer vision which has already been implemented in Open-CV, MATLAB and other common tools. DTI refers to a kind of gray image obtained after image distance transformation operation on an input binary image whose foreground pixel value is 1 and the background pixel value is 0. Each pixel value of DTI represents the distance between the pixel and the closest background pixel on the input binary image. Denote be the pixel value of the input binary image at pixel and be the pixel value of its DTI at pixel . Naturally, is 0 when equals to 0, and is greater than 0 when equals to 1. In addition, is small when its location is close to the background region of the binary image, while is large when its location is far away from the background region. Fig. 4 shows a binary image (actually a binary mask of a car) and its DTI which is shown as a heat map. The brighter the pixel on the heat map, the larger its pixel value, and vice versa.

Different from the above common DTI, the calculation of k-step DTI needs some minor changes. Given an initial binary contour image, points belonging to the contour are foreground and the rests are background. We then perform an opposite operation on it to make points belonging to the contour to be background, and the rests to be foreground. Finally, we apply image distance transformation operation to the obtained binary image with opposite value and use a truncation threshold k to obtain the expected k-step DTI. The acquired k-step DTI describes the closest distance to the contour for each pixel, which can be effectively used to measure the difference between contours.

However, the computation of the above k-step DTI has factors which are not differentiable, and there are no available modules under existing deep learning frameworks. To solve this problem, we propose an approximated differentiable implementation of the k-step DTI suitable for current neural networks, called kSDT. Fig. 5 shows the calculation flow chart of kSDT. The black solid arrow in Fig. 5 represents the data flow, and the black dashed arrow represents the output of k-step DTI.

The algorithm takes the contour response in Subsection III-A as initial input which is denoted as (0-mask). By iteratively executing k () groups of {, } operations in formula (9) and formula (10), the k-step DTI with opposite value is obtained which is denoted as . The final k-step DTI can be calculated accordingly in formula (11).


where represents an one-step dilation operator, and represents element-wise addition. The above calculation process is differentiable except for the dilation operator .

To make the whole process differentiable, we further design an approximated differentiable one-step dilation operator. Taking the computation of as an example, the specific calculation process is shown in Fig. 6. The algorithm firstly constructs a fixed parameter convolution layer with a smooth operator in formula (12) as its convolution kernel to convolve once. Then, formula (1) is utilized to approximately binarize the smoothed image to get the expected dilated image (1-mask). Here, we set to 20 and to 0.1. By taking the dilated image as the input of the next stage, dilation result of each stage (k-mask) can be iteratively obtained. Note that the input for one-step dilation operator is not restricted to binary image, making the whole k-step DTI module (kSDT) compatible to continuous response maps. The calculation process of kSDT is summarized in Algorithm 1.



2:  for  in  do
5:  end for
Algorithm 1 kSDT
Fig. 6: Schematic diagram of the designed differentiable one-step dilation operator.
Fig. 7: Typical training images with polygon mask annotations in COCO 2014.

Iii-C Contour Loss Function

Contour loss can be measured by the distance between the predicted contour and the ground-truth contour. Mathematically, let and be the ground-truth contour and the predicted contour, respectively. represents a contour point on , then the distance between and is usually defined as the distance between and its closest ground-truth contour point:


Let and be DTIs of and , respectively. According to the process of computing DTI, the distance between and in formula (13) is equal to the coverage value on by :


Therefore, the distance between the predicted contour and the ground-truth contour can be computed based on DTIs:


where and are the numbers of contour points for and , respectively.

In order to ensure the differentiable property, we design a continous version of formula (15) to calculate contour loss which employs the continous contour responses and their k-step DTIs:


where represents Hadamard product, represents global average pooling, and is a smooth term to avoid zero division.

Assuming that a total of positive samples are obtained in a training batch, the final contour loss function can be expressed as:


Algorithm 2 provides detailed calculation process of contour loss, where represents the predicted mask response, and represent the predicted mask and the ground-truth mask respectively, represents the convolution operation with Sobel kernel, and represents contour loss.

0.6 Input: Images from COCO

Algorithm 2 Calculation Process of
2:  for image-annotation in COCO do
3:     -
4:     for  in a batch  do
11:     end for
13:  end for

Iv Experiments

Iv-a Dataset and Metrics

In order to verify the effectiveness and the generalization ability of the proposed contour loss, we conducted extensive experiments on COCO which is a widely used benchmark dataset for common objects instance segmentation. This dataset is challenging due to the large number of target categories and the wide ranges of object scales. Fig. 7 shows some typical training images with polygon mask annotations. We use COCO 2014 to do experiments which includes 82783 training images and 40504 validation images. We train models on the whole training set, and report results on the mini validation set which contains 5000 images. We use the standard metrics, i.e. COCO AP, to evaluate all models including: mAP, AP50, AP75, APs, APm, APl.

Method mAP AP50 AP75 APs APm APl k
baseline 34.28 55.94 36.20 15.82 36.75 50.84
34.39 55.88 36.46 15.81 36.98 51.26 1
34.54 56.16 36.54 16.26 36.95 51.42 2
34.39 56.10 36.19 15.74 36.74 51.18 3
34.36 55.89 36.42 15.82 36.76 51.13 4
34.36 56.02 36.37 15.99 36.79 50.99 5
34.18 55.91 36.13 15.92 36.50 50.74 6
TABLE I: The impact of different values of k on mask accuracy(). represents contour loss.
Fig. 8: Visualization of the data of Table 1. (a): mAP, AP50, AP75. (b): APs, APm, APl.

Iv-B Implementation Details and Experimental Setup

We implement contour loss on top of Mask R-CNN for its efficiency and good performance. We use Res-50+FPN [24]

as the backbone network by default. Our code is based on the open source project mask r-cnn benchmark

[25]. We initialize the backbone network with the weights pre-trained on Image-Net [26]. We train the instance segmentation network for a total of 180K iterations. We set the initial learning rate to 0.01 and reduce it by a factor of 0.1 and 0.01 after 120K and 160K iterations respectively. We train all models on 4 NVIDIA 2080Ti GPUs utilizing SGD with 8 images per mini-batch. Unless specified, the input image is resized to have 800 pixels along the shorter side and their longer side less than or equal to 1333. Other hyper-parameters are kept consistent with the open source project. For larger backbones, we follow the linear scaling rule [27] to adjust the learning rate schedule when decreasing mini-batch size.

It’s noteworthy that contour loss usually plays an auxiliary role. In other words, we only enable contour loss after some iterations of the original algorithm. Specifically, we firstly train the original Mask R-CNN to 120K iterations (save as a checkpoint), then enable the contour loss (load the saved checkpoint), and continue to train it to 180K iterations. Naturally, our baseline is the Mask R-CNN trained from 120K (load the saved checkpoint) to 180K iterations without contour loss. On one hand, this can reduce the verification time of the proposed method. On the other hand, it may prevent instability of the mask branch caused by contour loss at early training stage.

Method mAP AP50 AP75 APs APm APl
baseline 34.28 55.94 36.20 15.82 36.75 50.84
34.33 55.87 36.39 15.79 36.87 51.02
34.42 55.95 36.55 15.88 37.01 51.00
34.54 56.16 36.54 16.26 36.95 51.42

The highest value in each column is shown in bold, and the second highest value is underlined.

TABLE II: Performance comparison of different loss functions().
Method Backbone CL mAP AP50 AP75 APs APm APl
MR Res-50+FPN 34.28 55.94 36.20 15.82 36.75 50.84
MR Res-50+FPN 34.54 56.16 36.54 16.26 36.95 51.42
MR Res-101+FPN 35.79 58.02 38.16 16.75 38.70 53.09
MR Res-101+FPN 35.96 58.04 38.35 16.52 38.89 53.50
MR Res-X-101 38.16 61.00 40.93 18.61 40.91 55.27
MR Res-X-101 38.29 61.16 41.19 18.54 41.01 55.67
HTC* Res-50+FPN 37.7 59.1 40.2 19.3 40.4 53.4
HTC* Res-50+FPN 37.9 59.3 40.3 19.2 40.6 53.1

*Note that the implementation of HTC is based on mm-detection [28]. We train it on COCO 2017 train (115K images) and report results on COCO 2017val (5K images).

TABLE III: Performance comparison of different instance segmentation algorithms(). MR represents Mask RCNN. CL represents contour loss.

Iv-C Evaluation on Hyper-parameter k

An important parameter involved in calculation of contour loss is k. We explored the impact of different values of k on mask accuracy. We selected 6 different values of k for experiments, including: 1,2,3,4,5,6. The results are summarized in Table I. From the table we can see that k is not sensitive for contour loss in the ranges of 15. Under the auxiliary supervision of contour loss, most of the evaluation metrics of the benchmark algorithm have been improved to a certain extent. Enjoying contour loss can achieve the maximum gain of 0.26% mAP, 0.22% AP50, 0.34% AP75, 0.44% APs, 0.2% APm, and 0.58% APl respectively. We set k to 2 in the following experiments.

Fig. 9: Qualitative comparison between the benchmark algorithm (top row) and our method (bottom row).

Fig. 8 visualizes the data of Table I. The horizontal axis represents the value of k, and the vertical axis represents mask accuracy (represented by decimals). Each subplot represents the comparison result under one kind of metric. It can be seen from the figures that under the auxiliary supervision of contour loss, mask accuracy can be improved on most metrics.

Iv-D Ablation Study

In order to further verify the performance of contour loss, we choose two alternatives to it for ablation experiments.

  • MSE Edge Loss: We utilize MSE loss to calculate the distance between the predicted contour response and the ground-truth contour response .


    where is the total number of positive samples.

  • MSE Contour Loss: We utilize MSE loss to reduce the error between the predicted k-step DTI and the ground-truth k-step DTI .


    where is the total number of positive samples.

Experimental results are summarized in Table II. The last line represents the proposed contour loss. From the table we can observe that: (1) Compared with the benchmark algorithm, all of the three loss functions can improve the mask accuracy. (2) Contour loss is superior to the other two loss functions, and it can get the best mask accuracy.

Iv-E Comparative Study

In order to verify the generalization ability of contour loss, we choose to conduct comparative experiments on Mask R-CNN with different backbones and HTC with Res-50+FPN backbone. Experimental results are summarized in Table III. The proposed contour loss respectively brings the gains of 0.13%0.26% mAP, 0.16%0.22% AP50, 0.26%0.34% AP75, 0.44% APs, 0.1%0.2% APm and 0.4%0.58% APl on Mask R-CNN. HTC with contour loss achieves the gains of 0.2% mAP, 0.2% AP50, 0.1% AP75 and 0.2% APm respectively. Thus, contour loss is effective for different instance segmentation methods.

Iv-F Qualitative Analysis

Fig. 9 shows the qualitative segmentation results of Mask R-CNN (top row) and “Mask R-CNN + Contour Loss” (bottom row). The results are based on backbone network of Res-50+FPN. For the convenience of comparison, we only show the contours of the predicted masks. By comparing the areas indicated by red dashed arrow, we can see that object masks segmented by our method have more accurate and clearer contours, which proves the effectiveness of the proposed method.

V Conclusions

In this paper, we introduce classic distance transformation image (DTI) into instance segmentation. We propose a contour loss function based on the designed differentiable k-step DTIs to specifically optimize the contour parts of the predicted masks. Contour loss can be effectively integrated into existing instance segmentation methods and combined with their original loss functions to gain more accurate and clearer masks. The proposed method does not need to modify the original network structure or increase more training parameters, thus has strong versatility. Experiments on COCO show that contour loss is effective and can further improve the performance of current instance segmentation methods. In future work, we will explore the possibility of applying contour loss to instance segmentation of unseen objects.


This work is partly supported by National Natural Science Foundation of China (Grant No. U19B2033, Grant No.62076020), and National Key R&D Program (Grant No. 2019YFF0301801).


  • [1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [2] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in ICCV, 2017.
  • [3] L. Ye, Z. Liu and Y. Wang, “Depth-aware object instance segmentation,” in ICIP, 2017.
  • [4] H. Y. Kim and B. R. Kang, “Instance segmentation and object detection with bounding shape masks,” arXiv preprint arXiv:1810.10327, 2018.
  • [5] X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottom-up object detection by grouping extreme and center points,” in CVPR, 2019.
  • [6] G. Borgefors, “Distance transformations in digital images,” Comput. Vis. Graphics Image Process., 1986.
  • [7] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
  • [8] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
  • [9] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully convolutional networks,” in ECCV, 2016.
  • [10] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-aware semantic segmentation,” in CVPR, 2017.
  • [11] V. Q. Pham and S. Ito, T. Kozakaya, “Biseg: Simultaneous instance segmentation and semantic segmentation with fully convolutional networks,” in BMVC, 2017.
  • [12] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li, “Solo: Segmenting objects by locations,” in ECCV, 2020.
  • [13] T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017.
  • [14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., 2016.
  • [15] C. Y. Fu, M. Shvets, and A. C. Berg, “RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free,” arXiv preprint arXiv:1901.03353, 2019.
  • [16] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path Aggregation Network for Instance Segmentation,” in CVPR, 2018.
  • [17] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask scoring r-cnn,” in CVPR, 2019.
  • [18] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy and D. Lin, “Hybrid task cascade for instance segmentation,” in CVPR, 2019.
  • [19] B. R. Kang, H. Lee, K. Park, H. Ryu, and H. Y. Kim, “BshapeNet: Object detection and instance segmentation with bounding shape masks,” Pattern Recognit. Lett., 2020.
  • [20] R. S. Zimmermann and J. N. Siems, “Faster training of Mask R-CNN by focusing on instance boundaries,” Comput. Vis. and Image Underst., 2019.
  • [21] J. Kittler, “On the accuracy of the Sobel edge detector,” Image and Vision Computing., 1983.
  • [22] Z. Hayder, X. He, M. Salzmann, “Boundary-aware Instance Segmentation,” in CVPR, 2017.
  • [23] T. Cheng, X. Wang, L. Huang, and W. Liu, “Boundary-preserving mask R-CNN,” in ECCV, 2020.
  • [24] K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [25]

    Faster R-CNN and Mask R-CNN in Py-torch 1.0., 2019. [Online;2019-11-6].
  • [26]

    J. Deng, W. Dong, R. Socher, L. Li, K. Li and F. F. Li, “ImageNet: A large-scale hierarchical image database,” in CVPR, 2009.

  • [27] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Q. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
  • [28] Mm detection., 2019. [Online;accessed 2020-12-16].