I Introduction
Instance segmentation is a basic problem in computer vision. The main task of instance segmentation is to acquire the location and the pixelwise semantic information of each instance. Benefited from tremendous development of deep learning [1] in object detection and semantic segmentation, instance segmentation based on deep learning has a rapid progress over a short period of time. However, due to diversity of objects and overlap between them, instance segmentation is still a challenging problem.
Taking classic instance segmentation method Mask RCNN [2] as an example, although it can predict a general mask, contour of the mask (called predicted contour later) is neither clear nor accurate. As far as we know, it could be a fatal problem in some applications. For example, in visionbased robot grabbing, clear and accurate contour is essential to the quality of grab detection. Our goal in this work is to make the predicted mask and its groundtruth mask not only consistent on the whole, but as consistent as possible near the contour.
Recently, some studies suggest that introducing additional supervisory signals beyond RGB data may provide new clues in a complementary mode to improve performance of instance segmentation [3, 4, 5]. Those works respectively introduce depth, shape and key points as auxiliary information to efficiently perform instance segmentation. Although those methods achieve good performance from different perspectives, they usually require complicated piplines or large amount of training parameters. Moreover, the segmentation results near the contour are still not ideal.
To improve instance segmentation accuracy near the contour, this paper borrows classic idea of image distance transformation technology [6], and brings distance transformation image (DTI) into instance segmentation for providing contour supervisory signals. Two main contributions are included in our method. One aspect is that we propose a novel loss function named contour loss based on DTI to optimize the contour part specially. In particular, we firstly calculate predicted kstep DTI and groundtruth kstep DTI for the predicted mask and the groundtruth mask respectively. Then, we accumulate coverage values of one contour image onto kstep DTI of the other contour image, and the average of the two normalized coverage values is regarded as the difference measure between the predicted contour and the groundtruth contour. We define contour difference calculated in this manner as contour loss, which can be integrated into existing instance segmentation algorithms such as Mask RCNN without modifying their neural network structures. The other aspect is that we design a differentiable kstep DTI calculation module, which approximately computes truncated DTIs of the predicted mask and the groundtruth mask online. The proposed module can be jointly trained in modern neural network frameworks without addition of other training parameters. To our best knowledge, this is the first analytic DTI module adaptive to current neural network frameworks. Experiments on COCO [7] show that the proposed contour loss is effective to produce more accurate and clearer masks, and can further improve the instance segmentation performances.
Ii Related Works
This section firstly introduces mainstream instance segmentation algorithms to date, and then summarizes separately those combining edge or boundary information where the differences between our method and theirs are analyzed.
Iia Mainstream Instance Segmentation Methods
According to the processing pipeline, current mainstream instance segmentation algorithms can be categorized into segmentationbased methods and detectionbased methods.
Segmentationbased methods firstly perform semantic segmentation in the image and then produce instance masks based on semantic information combination. FCN [8] has achieved remarkable success in the field of semantic segmentation, and numerous researchers have tried to apply it to instance segmentation. Dai et al. proposed InstanceFCN [9]. This method firstly generated a set of instance sensitive score maps which were used to predict semantic information of different relative positions of the same instance, and then got object masks through assembly accordingly. Li et al. proposed FCIS [10]. The authors used the feature representation of inside/outside position sensitive score maps to solve the problem that the same pixel may have different semantics in different regions of interest, and determined object category while generating object’s mask. Pham et al. came up with Biseg [11]. They used semantic segmentation fractional graph and Li et al.’s inside/outside position sensitive score maps as prior information, regarded instance masks as posterior information, and deduced object masks from prior information using Bayesian model. Wang et al. proposed SOLO [12]. This algorithm divided the input image into S×S grids, used FPN [13] to distinguish objects of different scales, and tried to directly segment masks from the image.
Detectionbased methods firstly rely on object detector to locate targets in the image, and then perform pixellevel classification within each target area. He et al. proposed Mask RCNN in 2017 [2], which took full advantage of object detector to achieve high instance segmentation accuracy. Since then, detectionbased instance segmentation methods represented by Mask RCNN have gradually become the mainstream. The essence of Mask RCNN is to add a feature alignment module and a mask branch to Faster RCNN [14]. Inspired by previous works, Fu et al. developed RetinaMask [15] which was a realtime singlestage instance segmentation algorithm. PANet [16] established information flow between lowlevel features and highlevel features, which further improved the instance segmentation precision. Mask Scoring RCNN [17] depended on a Mask IoU branch to handle the problem of mismatch between mask quality and mask score. HTC [18] made full use of reciprocal relationship between detection task and segmentation task to integrate and learn complementary features of each stage. Combining rich context information between the mask branches in different stages, it greatly improved instance segmentation accuracy. In general, the performance of detectionbased methods is better than that of segmentationbased methods. Thus, we choose to evaluate our proposed method on Mask RCNN framework.
IiB Methods Combining Edge or Boundary Information
Recently, there are some attempts to incorporate edges or boundaries to facilitate instance segmentation.
Kang et al. [19] extended the edge of groundtruth mask to inside and outside by k pixels, and assigned pixel values of the extended parts empirically. This method was conducive to learning richer edge information and achieved a little performance improvement in both object detection and instance segmentation. However, the method is very sensitive to hyperparameter k, which needs to be adjusted for different databases. Moreover, most values of k contribute to negative gains. Instead, our method is built upon classic image distance transformation technology and has a solid theoretical foundation. It is not sensitive to hyperparameter k, and there is no hyperparameter for loss fusion, which shows strong generality.
Roland et al. [20] used classical Sobel [21] operator to extract edge images of predicted mask and groundtruth mask respectively. The error between edge images was measured by mean square error (MSE) loss, which improved instance segmentation accuracy of object’s edge. Beside edge images which only contain simple position information, we design kstep DTI module to encode additional distance information, which can be essentially regarded as an active contour model and can learn the object’s contour better.
Hayder et al. [22]
took DTI as a mask representation, and predicted DTI of groundtruth mask through a complex neural network branch. This method relayed on an explicit encodedecode module and special postprocessing steps to produce objects’ masks. Although image distance transformation is good at describing the closeness of similar contour points, distance transformation values of regions far from the contour are easily affected by various disturbances. Therefore, the algorithm’s stability needs to be improved. Differently, we design a truncated DTI module which is inferable and differentiable. By truncation, the algorithm pays more attention to the optimization of the contour points. Inferable and differentiable characteristics make our truncated DTI to be used as an evaluation metric. When applying truncated DTI, the inference network structure of the original algorithm can be inherited and preserved to produce more accurate masks and no further post processing steps are needed.
Cheng et al. [23] trained a new branch to predict the edges of masks to exploit edge information, directly increasing training parameters. Our method does not need to modify network structure of basic algorithm and does not increase training parameters, only optimizing existing parameters.
In short, the main difference from the above works is that we design an inferable and differentiable implementation of truncated DTI, which can generate new supervisory information online to specifically optimize object’s contour part. Another difference is that we propose contour loss on the foundation of the truncated DTI, which achieves better performances compared with existing methods.
Iii Method
In this section, we firstly introduce the overall sketch of the proposed contour loss for instance segmentation. Then, we demonstrate the procedure of computing kstep DTI, i.e. the truncated DTI, which is used for the computation of contour loss. In the end, we detail the mathematical definition as well as pseudocode of contour loss.
Iiia Overall Architecture
Mask RCNN is a general instance segmentation framework, but it takes no consideration of the segmentation quality near the contour. To overcome this drawback, we design a contour loss function on the foundation of kstep DTI and integrate it into Mask RCNN to achieve joint training. The calculation process of contour loss is shown in Fig. 1 Contour loss does not change the original network structure and can also be applied to other instance segmentation frameworks.
As shown in Fig. 1
, the calculation process of contour loss starts from the mask branch’s output of the present instance segmentation method. Firstly, according to the prediction of the classification branch, the predicted mask response is selected from the mask branch. Secondly, a simulated binarization operation is conducted on the selected mask response to approximately obtain the predicted mask. Thirdly, a fixed parameter convolution layer with
Sobel operator as its convolution kernel is utilized to convolve the predicted mask and the groundtruth mask to get the predicted contour response and the groundtruth contour response respectively. Finally, image distance transformation operation is conducted on the predicted contour response and the groundtruth contour response to get the predicted kstep DTI and the groundtruth kstep DTI respectively. The coverage values of one contour response image onto kstep DTI of the other contour response image are accumulated. Contour loss is defined as average of the two normalized coverage values. It can be jointly trained with the original mask loss to make the object mask more accurate and clearer near the contour. Various parts of the proposed contour loss are detailed as follows.Binarization of the predicted mask response. Denote be the predicted mask response selected from the output of the mask branch. We utilize a differentiable mathematical function to approximately binarize it to obtain the predicted mask .
(1)  
(2) 
where and represent the slope and threshold (binarization value), respectively. We set to 20 and to 0.5 by default. The purpose of using the mathematical function is to simulate the binarization operation in a differentiable manner, which corresponds to the step of obtaining the predicted binary masks at inference stage. The curve of the mathematical function is shown on Fig. 2(a). Several pairs of the predicted mask responses (first row) and their corresponding simulated binary images (second row) are shown on Fig. 2(b).
Calculation of the contour response. We construct a fixed parameter convolution layer with Sobel operator as its convolution kernel to convolve the predicted mask and the groundtruth mask in both and directions to obtain the predicted contour response and the groundtruth contour response , respectively.
(3) 
(4)  
(5) 
where is the standard convolution operation, and is the absolute value. Typical contour response images are shown in Fig. 3, where the first row of Fig. 3(a) and Fig. 3(b) respectively shows the predicted mask responses and the groundtruth masks. The second row of them shows their corresponding contour responses.
DTI of the contour response. The values of pixels in DTI that are far away from the contour may be unstable, thus can interfere with the optimization process. In order to make contour loss focus on optimizing the object’s contour parts, we use a threshold k to truncate the DTI of the contour response. In other words, pixel values of DTI exceeding k are set to k, and the resulting image is called kstep DTI. The last row of Fig. 3 shows the computed kstep DTIs. We can see that pixel values close to object’s contour are smaller (shown darker), while those far away from the contour are larger (shown brighter). The white areas indicate that their pixel values reach the truncated threshold k.
Note that the computation of kstep
DTI must be differentiable, otherwise contour loss cannot be backpropagated during the network training phase. Therefore, we design an approximated differentiable module of the
kstep DTI under modern neural network structures, which is referred as kSDT (kStep Distance Transformation) algorithm (see Subsection IIIB for specific principle and implementation details). We apply kSDT to the predicted contour response and the groundtruth contour response to compute the predicted kstep DTI and the groundtruth kstep DTI , respectively.(6)  
(7) 
Computation of the contour loss. Each pixel value of kstep DTI represents the distance between it and the closest point of the contour, which can be used to measure the difference between two contours. To compute contour loss, we firstly accumulate coverage values of one contour response image onto kstep DTI of the other contour response image. Then, we regard the average of the two normalized coverage values as the difference measure between the predicted contour response and the groundtruth contour response. During training stage, when the predicted contour response deviates from the groundtruth contour response, contour loss will optimize and correct the predicted mask response, making the predicted mask obtained at inference stage more accurate and clearer near the contour. Specific principle and implementation details of contour loss are shown in Subsection IIIC.
Joint training with contour loss. Numerous studies have shown that multitask learning performs better than singletask learning. Thus, we define a multitask loss for each training batch which is expressed as follows.
(8) 
where the classification loss , box regression loss , and the mask loss are the same as those in Mask RCNN. is the proposed contour loss (see Subsection IIIC).
IiiB kStep DTI
Before presenting the approximated differentiable implementation of the proposed kstep DTI under modern neural networks, we first review the concept of image distance transformation. Image distance transformation is a classical technology in computer vision which has already been implemented in OpenCV, MATLAB and other common tools. DTI refers to a kind of gray image obtained after image distance transformation operation on an input binary image whose foreground pixel value is 1 and the background pixel value is 0. Each pixel value of DTI represents the distance between the pixel and the closest background pixel on the input binary image. Denote be the pixel value of the input binary image at pixel and be the pixel value of its DTI at pixel . Naturally, is 0 when equals to 0, and is greater than 0 when equals to 1. In addition, is small when its location is close to the background region of the binary image, while is large when its location is far away from the background region. Fig. 4 shows a binary image (actually a binary mask of a car) and its DTI which is shown as a heat map. The brighter the pixel on the heat map, the larger its pixel value, and vice versa.
Different from the above common DTI, the calculation of kstep DTI needs some minor changes. Given an initial binary contour image, points belonging to the contour are foreground and the rests are background. We then perform an opposite operation on it to make points belonging to the contour to be background, and the rests to be foreground. Finally, we apply image distance transformation operation to the obtained binary image with opposite value and use a truncation threshold k to obtain the expected kstep DTI. The acquired kstep DTI describes the closest distance to the contour for each pixel, which can be effectively used to measure the difference between contours.
However, the computation of the above kstep DTI has factors which are not differentiable, and there are no available modules under existing deep learning frameworks. To solve this problem, we propose an approximated differentiable implementation of the kstep DTI suitable for current neural networks, called kSDT. Fig. 5 shows the calculation flow chart of kSDT. The black solid arrow in Fig. 5 represents the data flow, and the black dashed arrow represents the output of kstep DTI.
The algorithm takes the contour response in Subsection IIIA as initial input which is denoted as (0mask). By iteratively executing k () groups of {, } operations in formula (9) and formula (10), the kstep DTI with opposite value is obtained which is denoted as . The final kstep DTI can be calculated accordingly in formula (11).
(9)  
(10)  
(11) 
where represents an onestep dilation operator, and represents elementwise addition. The above calculation process is differentiable except for the dilation operator .
To make the whole process differentiable, we further design an approximated differentiable onestep dilation operator. Taking the computation of as an example, the specific calculation process is shown in Fig. 6. The algorithm firstly constructs a fixed parameter convolution layer with a smooth operator in formula (12) as its convolution kernel to convolve once. Then, formula (1) is utilized to approximately binarize the smoothed image to get the expected dilated image (1mask). Here, we set to 20 and to 0.1. By taking the dilated image as the input of the next stage, dilation result of each stage (kmask) can be iteratively obtained. Note that the input for onestep dilation operator is not restricted to binary image, making the whole kstep DTI module (kSDT) compatible to continuous response maps. The calculation process of kSDT is summarized in Algorithm 1.
(12) 
IiiC Contour Loss Function
Contour loss can be measured by the distance between the predicted contour and the groundtruth contour. Mathematically, let and be the groundtruth contour and the predicted contour, respectively. represents a contour point on , then the distance between and is usually defined as the distance between and its closest groundtruth contour point:
(13) 
Let and be DTIs of and , respectively. According to the process of computing DTI, the distance between and in formula (13) is equal to the coverage value on by :
(14) 
Therefore, the distance between the predicted contour and the groundtruth contour can be computed based on DTIs:
(15) 
where and are the numbers of contour points for and , respectively.
In order to ensure the differentiable property, we design a continous version of formula (15) to calculate contour loss which employs the continous contour responses and their kstep DTIs:
(16) 
where represents Hadamard product, represents global average pooling, and is a smooth term to avoid zero division.
Assuming that a total of positive samples are obtained in a training batch, the final contour loss function can be expressed as:
(17) 
Algorithm 2 provides detailed calculation process of contour loss, where represents the predicted mask response, and represent the predicted mask and the groundtruth mask respectively, represents the convolution operation with Sobel kernel, and represents contour loss.
Iv Experiments
Iva Dataset and Metrics
In order to verify the effectiveness and the generalization ability of the proposed contour loss, we conducted extensive experiments on COCO which is a widely used benchmark dataset for common objects instance segmentation. This dataset is challenging due to the large number of target categories and the wide ranges of object scales. Fig. 7 shows some typical training images with polygon mask annotations. We use COCO 2014 to do experiments which includes 82783 training images and 40504 validation images. We train models on the whole training set, and report results on the mini validation set which contains 5000 images. We use the standard metrics, i.e. COCO AP, to evaluate all models including: mAP, AP50, AP75, APs, APm, APl.
Method  mAP  AP50  AP75  APs  APm  APl  k 
baseline  34.28  55.94  36.20  15.82  36.75  50.84  
34.39  55.88  36.46  15.81  36.98  51.26  1  
34.54  56.16  36.54  16.26  36.95  51.42  2  
34.39  56.10  36.19  15.74  36.74  51.18  3  
34.36  55.89  36.42  15.82  36.76  51.13  4  
34.36  56.02  36.37  15.99  36.79  50.99  5  
34.18  55.91  36.13  15.92  36.50  50.74  6 
IvB Implementation Details and Experimental Setup
We implement contour loss on top of Mask RCNN for its efficiency and good performance. We use Res50+FPN [24]
as the backbone network by default. Our code is based on the open source project mask rcnn benchmark
[25]. We initialize the backbone network with the weights pretrained on ImageNet [26]. We train the instance segmentation network for a total of 180K iterations. We set the initial learning rate to 0.01 and reduce it by a factor of 0.1 and 0.01 after 120K and 160K iterations respectively. We train all models on 4 NVIDIA 2080Ti GPUs utilizing SGD with 8 images per minibatch. Unless specified, the input image is resized to have 800 pixels along the shorter side and their longer side less than or equal to 1333. Other hyperparameters are kept consistent with the open source project. For larger backbones, we follow the linear scaling rule [27] to adjust the learning rate schedule when decreasing minibatch size.It’s noteworthy that contour loss usually plays an auxiliary role. In other words, we only enable contour loss after some iterations of the original algorithm. Specifically, we firstly train the original Mask RCNN to 120K iterations (save as a checkpoint), then enable the contour loss (load the saved checkpoint), and continue to train it to 180K iterations. Naturally, our baseline is the Mask RCNN trained from 120K (load the saved checkpoint) to 180K iterations without contour loss. On one hand, this can reduce the verification time of the proposed method. On the other hand, it may prevent instability of the mask branch caused by contour loss at early training stage.
Method  mAP  AP50  AP75  APs  APm  APl 
baseline  34.28  55.94  36.20  15.82  36.75  50.84 
34.33  55.87  36.39  15.79  36.87  51.02  
34.42  55.95  36.55  15.88  37.01  51.00  
34.54  56.16  36.54  16.26  36.95  51.42 
The highest value in each column is shown in bold, and the second highest value is underlined.
Method  Backbone  CL  mAP  AP50  AP75  APs  APm  APl 
MR  Res50+FPN  34.28  55.94  36.20  15.82  36.75  50.84  
MR  Res50+FPN  34.54  56.16  36.54  16.26  36.95  51.42  
MR  Res101+FPN  35.79  58.02  38.16  16.75  38.70  53.09  
MR  Res101+FPN  35.96  58.04  38.35  16.52  38.89  53.50  
MR  ResX101  38.16  61.00  40.93  18.61  40.91  55.27  
MR  ResX101  38.29  61.16  41.19  18.54  41.01  55.67  
HTC*  Res50+FPN  37.7  59.1  40.2  19.3  40.4  53.4  
HTC*  Res50+FPN  37.9  59.3  40.3  19.2  40.6  53.1 
*Note that the implementation of HTC is based on mmdetection [28]. We train it on COCO 2017 train (115K images) and report results on COCO 2017val (5K images).
IvC Evaluation on Hyperparameter k
An important parameter involved in calculation of contour loss is k. We explored the impact of different values of k on mask accuracy. We selected 6 different values of k for experiments, including: 1,2,3,4,5,6. The results are summarized in Table I. From the table we can see that k is not sensitive for contour loss in the ranges of 15. Under the auxiliary supervision of contour loss, most of the evaluation metrics of the benchmark algorithm have been improved to a certain extent. Enjoying contour loss can achieve the maximum gain of 0.26% mAP, 0.22% AP50, 0.34% AP75, 0.44% APs, 0.2% APm, and 0.58% APl respectively. We set k to 2 in the following experiments.
Fig. 8 visualizes the data of Table I. The horizontal axis represents the value of k, and the vertical axis represents mask accuracy (represented by decimals). Each subplot represents the comparison result under one kind of metric. It can be seen from the figures that under the auxiliary supervision of contour loss, mask accuracy can be improved on most metrics.
IvD Ablation Study
In order to further verify the performance of contour loss, we choose two alternatives to it for ablation experiments.

MSE Edge Loss: We utilize MSE loss to calculate the distance between the predicted contour response and the groundtruth contour response .
(18) where is the total number of positive samples.

MSE Contour Loss: We utilize MSE loss to reduce the error between the predicted kstep DTI and the groundtruth kstep DTI .
(19) where is the total number of positive samples.
Experimental results are summarized in Table II. The last line represents the proposed contour loss. From the table we can observe that: (1) Compared with the benchmark algorithm, all of the three loss functions can improve the mask accuracy. (2) Contour loss is superior to the other two loss functions, and it can get the best mask accuracy.
IvE Comparative Study
In order to verify the generalization ability of contour loss, we choose to conduct comparative experiments on Mask RCNN with different backbones and HTC with Res50+FPN backbone. Experimental results are summarized in Table III. The proposed contour loss respectively brings the gains of 0.13%0.26% mAP, 0.16%0.22% AP50, 0.26%0.34% AP75, 0.44% APs, 0.1%0.2% APm and 0.4%0.58% APl on Mask RCNN. HTC with contour loss achieves the gains of 0.2% mAP, 0.2% AP50, 0.1% AP75 and 0.2% APm respectively. Thus, contour loss is effective for different instance segmentation methods.
IvF Qualitative Analysis
Fig. 9 shows the qualitative segmentation results of Mask RCNN (top row) and “Mask RCNN + Contour Loss” (bottom row). The results are based on backbone network of Res50+FPN. For the convenience of comparison, we only show the contours of the predicted masks. By comparing the areas indicated by red dashed arrow, we can see that object masks segmented by our method have more accurate and clearer contours, which proves the effectiveness of the proposed method.
V Conclusions
In this paper, we introduce classic distance transformation image (DTI) into instance segmentation. We propose a contour loss function based on the designed differentiable kstep DTIs to specifically optimize the contour parts of the predicted masks. Contour loss can be effectively integrated into existing instance segmentation methods and combined with their original loss functions to gain more accurate and clearer masks. The proposed method does not need to modify the original network structure or increase more training parameters, thus has strong versatility. Experiments on COCO show that contour loss is effective and can further improve the performance of current instance segmentation methods. In future work, we will explore the possibility of applying contour loss to instance segmentation of unseen objects.
Acknowledgment
This work is partly supported by National Natural Science Foundation of China (Grant No. U19B2033, Grant No.62076020), and National Key R&D Program (Grant No. 2019YFF0301801).
References
 [1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in ICLR, 2015.
 [2] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask RCNN,” in ICCV, 2017.
 [3] L. Ye, Z. Liu and Y. Wang, “Depthaware object instance segmentation,” in ICIP, 2017.
 [4] H. Y. Kim and B. R. Kang, “Instance segmentation and object detection with bounding shape masks,” arXiv preprint arXiv:1810.10327, 2018.
 [5] X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottomup object detection by grouping extreme and center points,” in CVPR, 2019.
 [6] G. Borgefors, “Distance transformations in digital images,” Comput. Vis. Graphics Image Process., 1986.
 [7] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
 [8] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
 [9] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instancesensitive fully convolutional networks,” in ECCV, 2016.
 [10] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instanceaware semantic segmentation,” in CVPR, 2017.
 [11] V. Q. Pham and S. Ito, T. Kozakaya, “Biseg: Simultaneous instance segmentation and semantic segmentation with fully convolutional networks,” in BMVC, 2017.
 [12] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li, “Solo: Segmenting objects by locations,” in ECCV, 2020.
 [13] T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017.
 [14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster RCNN: Towards realtime object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., 2016.
 [15] C. Y. Fu, M. Shvets, and A. C. Berg, “RetinaMask: Learning to predict masks improves stateoftheart singleshot detection for free,” arXiv preprint arXiv:1901.03353, 2019.
 [16] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path Aggregation Network for Instance Segmentation,” in CVPR, 2018.
 [17] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask scoring rcnn,” in CVPR, 2019.
 [18] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy and D. Lin, “Hybrid task cascade for instance segmentation,” in CVPR, 2019.
 [19] B. R. Kang, H. Lee, K. Park, H. Ryu, and H. Y. Kim, “BshapeNet: Object detection and instance segmentation with bounding shape masks,” Pattern Recognit. Lett., 2020.
 [20] R. S. Zimmermann and J. N. Siems, “Faster training of Mask RCNN by focusing on instance boundaries,” Comput. Vis. and Image Underst., 2019.
 [21] J. Kittler, “On the accuracy of the Sobel edge detector,” Image and Vision Computing., 1983.
 [22] Z. Hayder, X. He, M. Salzmann, “Boundaryaware Instance Segmentation,” in CVPR, 2017.
 [23] T. Cheng, X. Wang, L. Huang, and W. Liu, “Boundarypreserving mask RCNN,” in ECCV, 2020.
 [24] K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.

[25]
Faster RCNN and Mask RCNN in Pytorch 1.0.
https://github.com/facebookresearch/maskrcnnbenchmark, 2019. [Online;2019116]. 
[26]
J. Deng, W. Dong, R. Socher, L. Li, K. Li and F. F. Li, “ImageNet: A largescale hierarchical image database,” in CVPR, 2009.
 [27] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Q. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
 [28] Mm detection. https://github.com/openmmlab/mmdetection, 2019. [Online;accessed 20201216].
Comments
There are no comments yet.