ContourRender: Detecting Arbitrary Contour Shape For Instance Segmentation In One Pass

06/07/2021 ∙ by Tutian Tang, et al. ∙ Shanghai Jiao Tong University 0

Direct contour regression for instance segmentation is a challenging task. Previous works usually achieve it by learning to progressively refine the contour prediction or adopting a shape representation with limited expressiveness. In this work, we argue that the difficulty in regressing the contour points in one pass is mainly due to the ambiguity when discretizing a smooth contour into a polygon. To address the ambiguity, we propose a novel differentiable rendering-based approach named ContourRender. During training, it first predicts a contour generated by an invertible shape signature, and then optimizes the contour with the more stable silhouette by converting it to a contour mesh and rendering the mesh to a 2D map. This method significantly improves the quality of contour without iterations or cascaded refinements. Moreover, as optimization is not needed during inference, the inference speed will not be influenced. Experiments show the proposed ContourRender outperforms all the contour-based instance segmentation approaches on COCO, while stays competitive with the iteration-based state-of-the-art on Cityscapes. In addition, we specifically select a subset from COCO val2017 named COCO ContourHard-val to further demonstrate the contour quality improvements. Codes, models, and dataset split will be released.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 8

page 10

page 11

page 12

page 13

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Instance segmentation is one of the fundamental tasks in computer vision. Aside from object detection, it also needs to predict the shape of the interested objects, which are usually represented by masks

He et al. (2017); Huang et al. (2019); Liu et al. (2018); Kuo et al. (2019); Chen et al. (2019a); Wang et al. (2020) or contours Xu et al. (2019); Xie et al. (2020); Peng et al. (2020); Castrejon et al. (2017); Acuna et al. (2018). While the former is widely adopted by the mainstream researches, the contour-based approaches draw increasing attentions, mostly due to the advantage on the control of the shape which is beneficial to applications such as automatic annotation Castrejon et al. (2017); Acuna et al. (2018); Ling et al. (2019), autonomous driving Homayounfar et al. (2020); Wang et al. (2019); Peng et al. (2020), and medical image analysis Gur et al. (2019); Schmidt et al. (2018). Since a contour shape (polygon) is a closed curve which consists of several ordered points, predicting the contour shape is essentially regressing the point coordinates. As the coordinates can have large numeric range and are intrinsically unstable (See Fig. 1), previous works usually regress them cumbersomely during inference with iterative optimizations Peng et al. (2020); Castrejon et al. (2017); Acuna et al. (2018); Cheng et al. (2019); Akbari et al. (2021); Hatamizadeh et al. (2020); Gur et al. (2019), cascaded optimizations Upschulte et al. (2021), or compromised representations Xu et al. (2019); Schmidt et al. (2018); Xie et al. (2020). In this work, we explore to regress a decent contour in one pass without limitation on the shape representation.

Figure 1: The ambiguity of the contour point coordinates. Left: A continuous circle can be discretized in different ways. The inner areas of the discretized contours are similar but the coordinates are very different; Right: Angle-fixed polar coordinate representation (Xu et al. (2019); Xie et al. (2020); Schmidt et al. (2018)) is rather stable but compromise the expressiveness on concave shapes.

Based on the observation from Fig. 1, it is natural to think that optimizing the inner content is more stable in comparison to directly regressing the point coordinates. To achieve this, we introduce two designs namely coordinate signature and contour mesh. A coordinate signature is transformed from point coordinates and introduced to facilitate the learning of coordinate information, thus it should be compact, robust, and invertible. Discrete cosine transform (DCT) 12 coefficients are adopted as the coordinate signature. Discussion about DCT coefficients and other choices of the coordinate signatures can be found in Sec. 3.1. However, since the coordinate signature is only an interpretation of the coordinates and cannot address the problem of discretization ambiguity, we further convert the reconstructed coordinates to a contour mesh and render the contour mesh to a silhouette with an off-the-shelf differentiable renderer Kato et al. (2018). The scheme of contour mesh construction is discussed in Sec. 3.2. Since the differentiable renderer plays the central role, we name the proposed novel contour regression pipeline as ContourRender. To show that the proposed ContourRender is independent of the object detector design, we equip it with the popular one-stage detector FCOS Tian et al. (2019) and two-stage Mask R-CNN He et al. (2017). The modification details are described in Sec. 3.3.1.

To quantitatively evaluate the proposed ContourRender, we conducted the experiments on the challenging COCO Lin et al. (2014) and Cityscapes Cordts et al. (2016) datasets. To show the advantages over predicting complex contours, we select 358 images according to the shape convexity Qi et al. (2019) from COCO val2017 datasets to form a COCO HardContour-val subset. Our approach outperforms all the contour-based methods. Since the differentiable renderer is not used in the inference phase, the computational overhead to produce the contour is mainly originated from the coordinate signature prediction and inverse transformation, which is negligible since it can be easily parallelized.

We summarize our contributions in two folds. First, we propose a novel contour-based approach for instance segmentation named ContourRender. The ContourRender can achieve accurate contour prediction by DCT coordinate signature and differentiable contour mesh rendering without the need for iterative regression or compromise on the contour representation. Second, the ContourRender adapted from Mask R-CNN or FCOS both outperform previous contour-based baselines on COCO, Cityscapes, and our proposed challenging COCO HardContour-val subset.

2 Related Works

Contour-based Instance Segmentation

For many applications, contour representation is of particular interest due to the ability to explicitly describe the shape with points. Though we can certainly convert the object masks produced by the mask-based instance segmentation frameworks He et al. (2017); Huang et al. (2019); Liu et al. (2018); Kuo et al. (2019); Chen et al. (2019a); Wang et al. (2020) to contours with post-processing Bradski (2000), it is still worthy to explore an end-to-end scheme to obtain the contour. Such attempts can be roughly divided into three categories based on their inference behaviors: one-pass approaches, iterative approaches, and cascaded approaches.

One-pass approach. Since the coordinate representation is hard to be regressed directly, ESE-Seg Xu et al. (2019) adopted a polar-coordinate representation and approximate the function with Chebyshev polynomials. Later, PolarMask Xie et al. (2020)

took a similar design, with a polar IOU which directly optimizes the contour IOU. The polar-coordinate system enables sampling the contour point along the fixed angle sequence, which can largely reduce the ambiguity of the Cartesian coordinate system (from 2 DoFs to 1 DoF). However, it also limits the expressive ability on the concave shapes (Fig.

1). Another attempt is to regress from a pre-defined point set anchor, as in Wei et al. (2020). However, according to Appendix 6.6, the statistics about the offset ranges from the pre-defined point set anchor and the origin point have no significant difference.

Iterative approach. A well-known iterative approach is active contour Kass et al. (1988). Since Marcos et al. (2018) incorporated the active contour into an end-to-end learning framework, many improvements have been proposed Cheng et al. (2019); Akbari et al. (2021); Hatamizadeh et al. (2020); Peng et al. (2020); Gur et al. (2019). Among them, Gur et al. (2019) integrated active contour with a differentiable renderer. The ideology is similar to ours, but their approach requires iterations and the authors only apply it to rather simple shapes like buildings or road scenes. DeepSnake Peng et al. (2020) proposed a circle convolution operator to adapt to the polygon representation. It achieves state-of-the-art performance over contour-based instance segmentation frameworks on several benchmarks. Aside from the active contour-based methods, by regarding the contour with different perspectives, RNNs (regarded as ordered point sequence) Castrejon et al. (2017); Acuna et al. (2018) and GCN (regarded as point graph) Ling et al. (2019) are also adopted to address the polygon regression problem. They all require iteration operations to achieve satisfactory results.

Cascaded approach. Unlike the iterative approaches that refine the prediction progressively with the shared modules, the cascaded approach usually designs individual components for prediction refinement. In Upschulte et al. (2021), after the contour proposals are generated, the coarse prediction will be improved by a local refinement process where the refined offsets are learned individually.

Differentiable Renderer

The rendering process can be viewed as mapping from a mesh to an image. Though many factors can be taken into consideration when making the rendering differentiable, such as gradient approximation Loper and Black (2014); Kato et al. (2018); Genova et al. (2018); Kato and Harada (2019), rasterization Rhodin et al. (2015); Liu et al. (2019); Chen et al. (2019c) and ray-tracing Li et al. (2018); Zhang et al. (2019); Loubet et al. (2019); Zhang et al. (2020); Nimier-David et al. (2020). As silhouette is all we need, rasterization and ray-tracing-based neural renderers are rather complicated. Thus we adopt a simple renderer Kato et al. (2018). It obtains the differentiable gradient by sampling in a smooth manner. On the other hand, point cloud-based neural renderers Roveri et al. (2018); Wiles et al. (2020); Yifan et al. (2019) seem more straightforward than mesh with respect to the contour points. However, in practice, we find the point cloud renderer cannot bring out the improvement effect as the mesh renderers do. We have two speculations on this observation: (1). when the point radius is too large, the coordinate information is blurred in the circles, which introduces another kind of coordinate ambiguity. (2). when the point radius is too small, the coordinate information is predominant, as well as the coordinate ambiguity mentioned in Fig. 1. This motivates us to design the contour mesh (Sec. 3.2).

Figure 2: The overall pipeline of ContourRender. The feature map will be feedforwarded to an MLP to obtain the signature coefficients , and then an inverse transformation is applied on to obtain the predicted contour points . Following, the contour points are integrated into the contour mesh, and the mesh will be put through the differentiable renderer to obtain a silhouette . All the operations in the ContourRender branch are fully differentiable so that it can be learned end-to-end. And the contour mesh construction and the differentiable rendering process can be removed in the inference phase, thus the inference time is not affected.

3 Method

Given an image , a general contour-based instance segmentation framework aims to predict all the locations and the contour shapes , where is the number of the detected objects, and is the number of the points for each contour shape.

For contour regression, we first predict the discrete cosine transform (DCT) coefficients of the contour shape and obtain a coarse contour with the inverse DCT (iDCT) operation (Sec. 3.1). We convert the coarse contour to a contour mesh and render the corresponding silhouette after that(Sec. 3.2).

For the object detection part, we suggest two choices namely Mask R-CNN He et al. (2017) and FCOS Tian et al. (2019) and demonstrate how to modify them to adapt with the ContourRender in Sec. 3.3. To note, the ContourRender can be similarly extended to other object detectors.

If not specified, we set for main analysis and experiments. How point number affects the performance please refer to Appendix 6.1. The overall pipeline is illustrated in Fig. 2.

3.1 Coordinate Signature

As mentioned earlier, the coordinate signature should be compact, robust, and invertible to fit into the neural network learning paradigm. There are many candidates, such as Cartesian coordinates, polar coordinates, dictionary coefficients of the Cartesian coordinates, DCT descriptor of the Cartesian coordinates, etc.

In the main paper, we will compare the Cartesian coordinates and their dictionary and DCT descriptors to illustrate how we decide the most preferable one before training. The experiments reported in Sec. 4.5 confirms the foresight. For the other signatures, please refer to Appendix 6.2.

3.1.1 Computation of the Signatures and the Invertibility

Given a contour shape , the computation of the signatures and the inverse transform to reconstructed contour shape are described as the following.

Coordinate Signature Given a ground-truth bounding box , we have , , and , where stands for the center and stands for the scale. Inversely, during inference, given the predicted bounding box, the exact position and scale can be obtained by . Note that all operations here are element-wise.

Dictionary Coordinate Signature are given by optimizing with Orthogonal Matching Pursuit (OMP) Mallat and Zhang (1993) which follows Mairal et al. (2009), where is the learned dictionary atoms. To recover , it first converts back to by , and then convert to following the same procedure as coordinate signature.

DCT Coordinate Signature , the applied DCT is 1-D for each axis 12. Similarly, to obtain , it first converts , and then convert to in the same way with coordinate signature.

It is noteworthy some common choices Yang et al. (2012) of contour signatures are not invertible, such as centroid distance function and contour curvature.

3.1.2 The Representation Power and Compactness

Compactness means that the signatures should be able to aggregate most information to a few coefficients with a constrained numeric range. Empirically, we find that a more compact signature is easier to learn. Unfortunately, we do not have rigid theory for this observation, but some discussions are made in Appendix 6.3. We conduct an offline analysis on the COCO val2017 dataset, which consists of 36781 instance shapes of 80 categories. We evaluate three signatures on the reconstruction ability concerning the valid coefficient length (number of non-zero coefficients), as shown in Fig. 3. The reconstruction error is defined as . is calculated by comparing the ground truth mask and the reconstructed mask by filling the contour polygon and resizing to the original size. When reducing the valid coefficient length, we gradually set the coefficients to zero from the last elements to the first.

Figure 3: Left: Reconstruction Error on COCO val2017 w.r.t valid length; Right:

mIOU on COCO val2017 affected by different variance of Gaussian noise.

3.1.3 Sensitivity Analysis and Robustness

As the neural networks are known to regress with noise, it is informed to test the sensitivity of the signatures. We randomly add Gaussian noise to each coefficient element in the signatures. is set to 0.1, 0.2, 0.3, 0.4, 0.5. We show how much mIOU loss will the Gaussian noises cause on different signatures. The results are shown in Fig. 3.

3.1.4 On the Choice of the Coordinate Signature

We have compared three coordinate signatures from three perspectives. Among them, all the signatures can be invertible to reconstruct the original contour shape. While coordinate signature has the best performance on resisting the noise, it is not compact enough. Thus the DCT signature is adopted. Admittedly, it is possible there exist better coordinate signatures, we left it as the future works.

3.1.5 Signature Loss

Given the signature obtained from the ground truth mask, we can directly supervise the signature with:

(1)

where the is the predicted signature. is the smooth L1 loss.

After learning on the signature loss, we can already obtain a coarse contour, by converting the predicted according to the procedures described in Sec. 3.1.1. But as mentioned earlier, signature learning cannot address the ambiguous nature of the coordinates. Thus we propose to train on a more stable silhouette through differentiable rendering.

3.2 Contour Mesh

According to the discussion in Sec. 2 and the results from Appendix 6.4, we adopt NMR Kato et al. (2018) as the off-the-shelf renderer. In this case, we should convert the contour into a contour-derived mesh (short for contour mesh) before rendering.

3.2.1 Construction of Contour Mesh

As the differentiable gradient from the neural renderer will back-propagate to the vertex, i.e. the contour points, the construction process should try to avoid damaging contour points.

A mesh is composed of 3D vertex and triangle faces. 3D vertex can be simply obtained by stacking all-zero z-axis to the contour points, while the ways of constructing the faces need to be discussed.

There are two intuitive approaches to construct the contour mesh, i.e. Internal Linking and External Linking. Internal linking means constructing the mesh faces based on the existing contour points, while external linking means we first find a set of the external points and link the contour points with them to form the faces. Both approaches can be conducted with different strategies. Here we give some intuitive thoughts and select one for the main experiments. The illustration of linking strategies is displayed in Fig. 4.

Internal Linking. For internal linking, each neighboring point pair will forward counter-clockwise to find the first point which can make these three points a triangle with an area larger than a threshold . We set for all the datasets.

External Linking. For external linking, an instant idea is to perform the clustering algorithm such as KMeans MacQueen and others (1967) to obtain the clustering centers. The clustering centers are most likely to be inside the contour. And then we can link the contour points to their corresponding clustering centers to form a contour mesh. This process is denoted as ’External Linking-KMeans’. However, as we would like to have the linking process run on the fly, in practice, we take another simple and faster approach. As the contour predicted is already zero-centered (because the supervised signature is obtained after zero-centered), we can simply shrink the contour by multiplying a scaling factor less than 1, and link the corresponding points between original and shrunk contours to form the contour mesh. We set for all the datasets. Different choices of the scaling factor are discussed in Appendix 6.5. This process is named as ’External Linking-Shrink’.

Figure 4: Different intuitive ways to construct different kinds of contour meshes: (a) Internal Linking; (b) External Linking - KMeans; (c) External Linking - Shrink.

3.2.2 Silhouette Loss

After we transform the contour points to contour mesh, we can generate the corresponding silhouette by NMR Kato et al. (2018). The silhouette is denoted as , while we put the ground truth contour forward the same process and result in . In this way, the silhouette loss is defined by:

(2)

is a mask-based loss. In practice, we adopt Lovasz Softmax Berman et al. (2018), but we also discuss the common choices of MSE loss, BCE loss, and Dice loss in Sec. 4.5.

3.3 ContourRender with Object Detectors

Figure 5: Head Modification on One-Stage detector FCOS and Two-stage Mask R-CNN.

3.3.1 ContourRender Branch

A generic ContourRender branch is designed as shown in Fig. 2. In this section, we will describe how it can be integrated into existing object detectors. For those who can only predict the object location like FCOS, we show how to add the ContourRender branch to enable shape prediction. For those object detection frameworks that already can do instance segmentation like Mask R-CNN, we show how to replace the original mask branch with our ContourRender branch. The modifications are shown in Fig. 5.

Add Modification of FCOS As the FCOS can only perform object detections, we add the ContourRender branch by taking the same input feature with the box head.

Replace Modification of Mask R-CNN As the Mask R-CNN is already an instance segmentation framework, we replace the original mask branch to the proposed ContourRender branch. The original mask features with the shape of after the RoiAlign operators are directly flattened and fed into the ContourRender branch.

3.3.2 Overall Objective Functions

With the object detector, we denote the object detection objective functions as , the details of for each detector remain the same, and can be referred to the original papers He et al. (2017); Tian et al. (2019). Therefore, the overall learning objective function for ContourRender is given as follows:

(3)

where and are the adjusting weights for different loss, which are determined by cross-validation and discussed in Sec. 4.1.

4 Experiments

4.1 Implementation Details

We implement our method in two public frameworks, e.g. MMDetection Chen et al. (2019b) and Detectron2 Wu et al. (2019)

. The default backbone is ResNet-50 with ImageNet-pretrained weights. We train the network for 150K iterations with stochastic gradient descent (SGD), which finished within 18 hours on 4 RTX Titan GPUs. The initial learning rate is 0.02 and the mini-batch size is 16. The learning rate is reduced to 0.001 and 0.0001 at 60k and 140k, respectively. Weight decay and momentum are set to 0.0001 and 0.9, respectively. To stabilize the training process, we set

and during the first 60K iterations. Then, we set till the end and make decay linearly to till 70K. The performance gap between the two frameworks is negligible.

4.2 Datasets

In this work, we will use three datasets to evaluate the performance of the ContourRender on Mask R-CNN and FCOS, namely COCO Lin et al. (2014), Cityscapes Cordts et al. (2016) and COCO ContourHard-val.

Coco

COCO is a large-scale dataset with 860001 shapes for training and 36781 shapes for validation. It has large shape varieties including a fair amount of separated shapes. To deal with the separated shapes, we just connect the separate contours by stacking them to form a larger contour.

COCO ContourHard-val

As most COCO object shapes are fairly round, like shapes in bear, stop sign, and frisbee, etc., even imperfect contour representation can obtain fairly good performance. Thus, to better compare the contour representation, we select a subset from COCO val2017 as the COCO ContourHard-val set. The selection is based on the convexity mentioned in Qi et al. (2019). Only images with object shape of convexity smaller than 0.4 are selected to be validated. As a result, 358 images are chosen, which contains 5054 shapes of 77 categories. Statistics for each category please refer to Appendix 6.7.

Cityscapes

Cityscapes is a popular road-scene dataset with 2975 images 54060 shapes for training, and 500 images with 10415 shapes for validation. It includes 8 categories mostly concerning people and vehicles on the road.

4.3 Results on COCO

Comparison with Other Baselines

We compare our methods with both mask and contour-based methods, it shows our method is competitive with mask-based methods, especially, our methods only drops 1 AP by replacing the mask branch of Mask R-CNN to ContourRender. Considering contour is rather limited to express separated shapes, or hollowed shapes, a little performance loss is expected.

With regarding to contour-based methods, for one-pass approach baseline PolarMask Xie et al. (2020), in the same setting with FCOS Tian et al. (2019) as the base detector, our method outperform 1.5 AP on COCO val2017, and 2.6 AP on COCO ContourHard-val (Table 3). Besides, our methods also beat the iterative approach DeepSnake Peng et al. (2020), even though the latter approach use the iterative optimization to refine the contour while ours do not.

Method Backbone AP AP AP AP AP AP
Mask-based
Mask R-CNN He et al. (2017) Res50FPN 33.6 55.2 35.3 - - -
Mask R-CNN He et al. (2017) Res101FPN 35.7 58.0 37.8 15.5 38.1 52.4
PANet Liu et al. (2018) Res50FPN 36.6 58.0 39.3 16.3 38.1 53.1
SOLOv2 Wang et al. (2020) Res50FPN 38.8 59.9 41.7 16.5 41.7 56.2
Contour-based
DeepSnake Peng et al. (2020) DLA-34 30.3 - - - - -
ESE-Seg Xu et al. (2019) DarkNet-53 21.6 48.7 22.4 - - -
PointSetNet Wei et al. (2020) RX101FPN-DCN 34.6 60.1 34.9 - - -
PolarMask Xie et al. (2020) Res50FPN 29.1 49.5 29.7 12.6 31.8 42.3
PolarMask Xie et al. (2020) Res101FPN 30.4 51.9 31.0 13.4 32.4 42.8
PolarMask Xie et al. (2020) RX101FPN-DCN 36.2 59.4 37.7 17.8 37.7 51.5
Ours (F) Res50FPN 30.6 51.6 30.8 14.7 31.5 40.3
Ours (F) Res101FPN 31.7 53.4 32.2 15.6 33.8 45.4
Ours (F) RX101FPN-DCN 37.9 61.1 38.2 20.3 39.6 52.7
Ours (M) Res50FPN 32.6 54.6 33.8 15.9 34.6 46.7
Ours (M) Res101FPN 34.8 56.8 36.4 18.1 36.8 49.2
Ours (M) RX101FPN-DCN 41.0 62.8 42.5 23.9 43.4 54.1
Table 1: AP metrics on COCO val2017. Backbone: Res50FPN means ResNet-50 He et al. (2016) with FPN structure Lin et al. (2017), Res101FPN can be inferred; RX means ResNext Xie et al. (2017); DCN is the deformable convolution Dai et al. (2017); DarkNet is introduced in Redmon and Farhadi (2018); DLA is introduced in Peng et al. (2020). Ours (F) means base detector is FCOS, and ours (M) means it is adapted from Mask R-CNN.
Comparison with Other Baselines On COCO ContourHard-val

Since the validation dataset is built specifically for contour comparison, we only compare the contour-based methods on this validation set. As shown in Table 3, our method beats the baseline methods by a large margin.

Method Backbone AP AP AP
ESE-Seg* Res50FPN 22.3 41.6 22.7
PolarMask Res50FPN 24.3 43.2 24.5
Ours (F) Res50FPN 26.9 45.6 27.1
Ours (M) Res50FPN 27.8 48.3 28.3
Table 2: AP metrics on COCO ContourHard-val. *: We change the base detector of ESE-Seg to FCOS to make fair comparison.
Method Backbone AP
PolygonRNN++ Acuna et al. (2018) Res50M 22.8
DeepSnake* Peng et al. (2020) DLA-34 28.2
DeepSnake Peng et al. (2020) DLA-34 31.7
Ours (M) Res50FPN 28.8
Table 3: AP metrics on Cityscapes test split. *: DeepSnake w/o cascaded detection to handle fragmented instances, as well as ours. In this case, even though it inferences iteratively, ours is better. Res50M: A modified ResNet-50 structure.
Figure 6: The qualitative results on COCO test-dev.

4.4 Results on Cityscapes

Since most contour-based approach report the mIOU metric on Cityscapes, we decide to follow the same training/testing and evaluation protocol. We notice that a few methods Peng et al. (2020); Acuna et al. (2018) report AP metric on Cityscapes, to compare with them, we also report the AP metric on Cityscapes in Table 3. It is interesting to find that though almost all these contour-based methods are iterative approaches, our one-pass approach can outperform most of them.

Method mIOU person rider car truck bus train mcycle bicycle
PolygonRNN Castrejon et al. (2017) 61.4 64.0 60.6 71.2 68.0 69.5 53.7 52.1 52.1
PolygonRNN++ Acuna et al. (2018) 70.2 70.8 68.5 78.0 77.9 79.6 62.8 61.7 62.3
SplineGCN Ling et al. (2019) 72.1 72.5 70.6 80.2 79.1 81.7 65.9 62 64.8
Gur et al. Gur et al. (2019) 75.1 75.0 72.0 82.0 79.6 83.0 74.5 66.5 68.1
Ours (M) 73.5 73.4 69.5 79.8 79.1 83.6 73.1 64.4 65.2
Table 4: Performance on Cityscape. Baseline methods are all iterative approaches, but our method can outperform most of them.

4.5 Ablative Study

The ablative studies are carried out on the COCO dataset with ContourRender-MaskRCNN (Res50FPN). More ablative studies can be referred to supplementary materials.

The necessity of the signature learning If we remove the signature learning process, that is, set and from beginning to end, we find that will not converge at all.

Performance of different signatures With the Cartesian coordinate signature, we obtain 24.5 AP; With the dictionary signature, we obtain 26.1 AP; With the DCT signature, we obtain 28.1 AP.

Performance of different loss on the silhouette loss Lovasz Softmax Berman et al. (2018) is proved to be effective to serve as IOU loss, so we adopt it in ContourRender pipeline. If we substitute it to the common MSE loss, BCE loss and Dice loss, then the AP drops 4.1, 6.7, and 3.5 respectively.

5 Limitations and Future Works

In this work, we propose a novel contour-based instance segmentation pipeline, which enables the learning framework to produce decent contours without iterative or cascaded design, and the shape representation is not compromised as in previous works like ESE-Seg Xu et al. (2019) or PolarMask Xie et al. (2020)

. Nevertheless, our method still suffers from several drawbacks: 1. the contour representation cannot handle separated or hollowed contours at this moment. 2. the whole performance is largely dependent on the bounding box detection, which can give the information about the

and the exact scale of the contour. 3. The choice of the coordinate signature and contour mesh is determined by intuition, which may be improved in later works.

References

  • [1] D. Acuna, H. Ling, A. Kar, and S. Fidler (2018) Efficient interactive annotation of segmentation datasets with polygon-rnn++. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    ,
    pp. 859–868. Cited by: §1, §2, §4.4, Table 3, Table 4.
  • [2] P. Akbari, A. Ziaei, and H. Azarnoush (2021)

    Deep active contours using locally controlled distance vector flow

    .
    arXiv preprint arXiv:2105.08447. Cited by: §1, §2.
  • [3] M. Berman, A. R. Triki, and M. B. Blaschko (2018) The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4413–4421. Cited by: §3.2.2, §4.5.
  • [4] G. Bradski (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §2.
  • [5] L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler (2017) Annotating object instances with a polygon-rnn. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5230–5238. Cited by: §1, §2, Table 4.
  • [6] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. (2019) Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4974–4983. Cited by: §1, §2.
  • [7] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.1.
  • [8] W. Chen, J. Gao, H. Ling, E. J. Smith, J. Lehtinen, A. Jacobson, and S. Fidler (2019)

    Learning to predict 3d objects with an interpolation-based differentiable renderer

    .
    arXiv preprint arXiv:1908.01210. Cited by: §2.
  • [9] D. Cheng, R. Liao, S. Fidler, and R. Urtasun (2019) Darnet: deep active ray network for building segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7431–7439. Cited by: §1, §2.
  • [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1, §4.2.
  • [11] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: Table 1.
  • [12] (2006) Discrete cosine transform (dct). In Encyclopedia of Multimedia, B. Furht (Ed.), pp. 203–205. External Links: ISBN 978-0-387-30038-2, Document, Link Cited by: §1, §3.1.1.
  • [13] K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, and W. T. Freeman (2018) Unsupervised training for 3d morphable model regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8377–8386. Cited by: §2.
  • [14] S. Gur, T. Shaharabany, and L. Wolf (2019) End to end trainable active contours via differentiable rendering. arXiv preprint arXiv:1912.00367. Cited by: §1, §2, Table 4, §6.6.
  • [15] A. Hatamizadeh, D. Sengupta, and D. Terzopoulos (2020) End-to-end trainable deep active contour models for automated image segmentation: delineating buildings in aerial imagery. In European Conference on Computer Vision, pp. 730–746. Cited by: §1, §2.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2, §3.3.2, §3, Table 1.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 1.
  • [18] N. Homayounfar, Y. Xiong, J. Liang, W. Ma, and R. Urtasun (2020) LevelSet r-cnn: a deep variational method for instance segmentation. In European Conference on Computer Vision, pp. 555–571. Cited by: §1.
  • [19] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang (2019-06) Mask scoring r-cnn. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [20] M. Kass, A. Witkin, and D. Terzopoulos (1988) Snakes: active contour models. International journal of computer vision 1 (4), pp. 321–331. Cited by: §2.
  • [21] H. Kato and T. Harada (2019) Learning view priors for single-view 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9778–9787. Cited by: §2.
  • [22] H. Kato, Y. Ushiku, and T. Harada (2018) Neural 3d mesh renderer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3907–3916. Cited by: §1, §2, §3.2.2, §3.2, §6.4.
  • [23] W. Kuo, A. Angelova, J. Malik, and T. Lin (2019-10) ShapeMask: learning to segment novel objects by refining shape priors. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [24] T. Li, M. Aittala, F. Durand, and J. Lehtinen (2018) Differentiable monte carlo ray tracing through edge sampling. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–11. Cited by: §2.
  • [25] T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017-07) Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
  • [26] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. External Links: ISBN 978-3-319-10602-1 Cited by: §1, §4.2.
  • [27] H. Ling, J. Gao, A. Kar, W. Chen, and S. Fidler (2019) Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5257–5266. Cited by: §1, §2, Table 4.
  • [28] S. Liu, T. Li, W. Chen, and H. Li (2019) Soft rasterizer: a differentiable renderer for image-based 3d reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7708–7717. Cited by: §2, §6.4, §6.4.
  • [29] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path aggregation network for instance segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 1.
  • [30] M. M. Loper and M. J. Black (2014) OpenDR: an approximate differentiable renderer. In European Conference on Computer Vision, pp. 154–169. Cited by: §2, §6.4, §6.4.
  • [31] G. Loubet, N. Holzschuch, and W. Jakob (2019) Reparameterizing discontinuous integrands for differentiable rendering. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–14. Cited by: §2.
  • [32] J. MacQueen et al. (1967) Some methods for classification and analysis of multivariate observations. In

    Proceedings of the fifth Berkeley symposium on mathematical statistics and probability

    ,
    Vol. 1, pp. 281–297. Cited by: §3.2.1.
  • [33] J. Mairal, F. Bach, J. Ponce, and G. Sapiro (2009) Online dictionary learning for sparse coding. In

    Proceedings of the 26th annual international conference on machine learning

    ,
    pp. 689–696. Cited by: §3.1.1.
  • [34] S. G. Mallat and Z. Zhang (1993) Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing 41 (12), pp. 3397–3415. Cited by: §3.1.1.
  • [35] D. Marcos, D. Tuia, B. Kellenberger, L. Zhang, M. Bai, R. Liao, and R. Urtasun (2018) Learning deep structured active contours end-to-end. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8877–8885. Cited by: §2.
  • [36] M. Nimier-David, S. Speierer, B. Ruiz, and W. Jakob (2020)

    Radiative backpropagation: an adjoint method for lightning-fast differentiable rendering

    .
    ACM Transactions on Graphics (TOG) 39 (4), pp. 146–1. Cited by: §2.
  • [37] S. Peng, W. Jiang, H. Pi, X. Li, H. Bao, and X. Zhou (2020) Deep snake for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8533–8542. Cited by: §1, §2, §4.3, §4.4, Table 1, Table 3.
  • [38] L. Qi, L. Jiang, S. Liu, X. Shen, and J. Jia (2019) Amodal instance segmentation with kins dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3014–3023. Cited by: §1, §4.2.
  • [39] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: Table 1.
  • [40] H. Rhodin, N. Robertini, C. Richardt, H. Seidel, and C. Theobalt (2015)

    A versatile scene model with differentiable visibility applied to generative pose estimation

    .
    In Proceedings of the IEEE International Conference on Computer Vision, pp. 765–773. Cited by: §2.
  • [41] R. Roveri, A. C. Öztireli, I. Pandele, and M. Gross (2018)

    Pointpronets: consolidation of point clouds with convolutional neural networks

    .
    In Computer Graphics Forum, Vol. 37, pp. 87–99. Cited by: §2.
  • [42] U. Schmidt, M. Weigert, C. Broaddus, and G. Myers (2018) Cell detection with star-convex polygons. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 265–273. Cited by: Figure 1, §1.
  • [43] Z. Tian, C. Shen, H. Chen, and T. He (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636. Cited by: §1, §3.3.2, §3, §4.3.
  • [44] E. Upschulte, S. Harmeling, K. Amunts, and T. Dickscheid (2021) Contour proposal networks for biomedical instance segmentation. arXiv preprint arXiv:2104.03393. Cited by: §1, §2.
  • [45] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen (2020) SOLOv2: dynamic and fast instance segmentation. Advances in Neural Information Processing Systems. Cited by: §1, §2, Table 1.
  • [46] Z. Wang, D. Acuna, H. Ling, A. Kar, and S. Fidler (2019) Object instance annotation with deep extreme level set evolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7500–7508. Cited by: §1.
  • [47] F. Wei, X. Sun, H. Li, J. Wang, and S. Lin (2020)

    Point-set anchors for object detection, instance segmentation and pose estimation

    .
    In European Conference on Computer Vision, pp. 527–544. Cited by: §2, Table 1, §6.6.
  • [48] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson (2020) Synsin: end-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7467–7477. Cited by: §2, §6.4, §6.4.
  • [49] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §4.1.
  • [50] E. Xie, P. Sun, X. Song, W. Wang, X. Liu, D. Liang, C. Shen, and P. Luo (2020) Polarmask: single shot instance segmentation with polar representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12193–12202. Cited by: Figure 1, §1, §2, §4.3, Table 1, §5.
  • [51] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: Table 1.
  • [52] W. Xu, H. Wang, F. Qi, and C. Lu (2019-10) Explicit shape encoding for real-time instance segmentation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Figure 1, §1, §2, Table 1, §5.
  • [53] M. Yang, K. Kpalma, and J. Ronsin (2012)

    Shape-based invariant feature extraction for object recognition

    .
    In Advances in Reasoning-Based Image Processing Intelligent Systems, pp. 255–314. Cited by: §3.1.1.
  • [54] W. Yifan, F. Serena, S. Wu, C. Öztireli, and O. Sorkine-Hornung (2019) Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–14. Cited by: §2, §6.4, §6.4.
  • [55] C. Zhang, B. Miller, K. Yan, I. Gkioulekas, and S. Zhao (2020) Path-space differentiable rendering. ACM Trans. Graph.(Proc. SIGGRAPH) 39 (6), pp. 143. Cited by: §2.
  • [56] C. Zhang, L. Wu, C. Zheng, I. Gkioulekas, R. Ramamoorthi, and S. Zhao (2019) A differential theory of radiative transfer. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–16. Cited by: §2.

6 Appendix

Optionally include extra information (complete proofs, additional experiments and plots) in the appendix. This section will often be part of the supplemental material.

6.1 How Contour Point Number Affects Performance?

Apparently, as the sampled point number grows, the reconstructed error will be consistently reduced, but since the contour cannot represent fragmented shapes for now, the lower bound of the reconstructed error on COCO val2017 is prone to 0.03. As shown in Fig. 7.

Figure 7: The Reconstructed Error on COCO val2017 of full valid length with different sampled point numbers.

However, the online experiments with ContourRender Mask R-CNN (Res50FPN) converges faster. As shown in Table 5, when the sampled point number is larger than 32, the performance has no significant improvement. The shape signatures in these experiments are DCT signatures.

# Sampled Points AP AP50 AP@75
16 27.6 49.9 28.7
24 30.4 52.6 31.2
32 32.6 54.6 33.8
40 32.7 55.1 33.2
Table 5: ContourRender Mask R-CNN (Res50FPN) performance on COCO val2017 with respect to different sampled point numbers.

6.2 Comparison with Other Signatures

In this section, we specifically compared with DCT signatures converted from polar coordinate systems. First, we think the angle-fixed sampling polar coordinate can largely damage the shape representation concerning convex shapes. Though it is not that significant in COCO dataset, it is not the natural representation after all. Thus, the polar coordinate representation is the coordinate with 2 DOFs. And we convert it to DCT signatures like Cartesian coordinates. However, we find such representation is quite sensitive to the noise, especially the element. As shown in Fig. 8.

Figure 8: mIOU decrease on COCO val2017 with different level of Gaussian noise on and respectively. In contrast, the mIOU decrease of Cartesian coordinate representation is also displayed.

6.3 Why Compactness Matters?

The compactness in our context is similar to the concept sparsity. But as some signatures do not have the nature of sparsity, such as Cartesian coordinate signature, we determine to use a different concept. When we check the sparsity of the DCT and dictionary coefficients, it is surprisingly to find that even though we took a certain number valid coefficients, , the online experiments with neural network can maintain a fairly good performance. We believe it is a good property for a shape signature so that the neural network can learn to regress a small number of coefficients to achieve good results.

6.4 Comparison with Other Differentiable Renderers

We have tested our pipeline with other differentiable renderers, such as [54, 48, 30, 28].

For the point-based renderer [48], during training, when is set to 1, the validation score keeps dropping till 0. Similar phenomenon exhibits on another point-based renderer [54].

For the mesh-based renderer, we also tried with OpenDR [30] and Soft Rasterizer [28]. The behaviors are consistent with NMR [22]. Since rendering a contour mesh towards a silhouette is a rather simple task.

6.5 How Scaling Factor of External Linking - Shrink Affects the Performance

As the External linking scheme is shrinking the contour, the resulting silhouettes would be full masks. the scaling factor and . When , the resulting silhouette will be close to a full mask, while , the resulting silhouette will be close to a thin contour. In practice, we find a large could cause the predicted silhouette and the ground truth silhouette hard to be matched even though the shape is similar to each other, but the IOU will still be very low. Thus, we compare different in Table 6 and decide to simply set .

s AP AP@50 AP@75
0 32.6 54.6 33.8
0.3 32.8 54.5 33.9
0.6 31.5 52.7 32.3
0.9 30.1 50.3 30.9
0.95 0 0 0
Table 6: ContourRender Mask R-CNN (Res50FPN) performance on COCO val2017 with respect to different scaling factors.

On the other hand, the shrinking scheme can trivially extended to dilating scheme, with a scaling factor .

6.6 Do Offsets from Prior Templates Significantly Different From Coordinate Itself?

As previous works like active contour-based approaches and other template-based [47] propose to regress a shape from a prior template. It maybe useful for iterative schemes, however, we find it is not that significantly different from the direct coordinate regression on COCO val2017. Given evenly sampled contour points, we compare the average offset (mean, mean), offset variance (var, var) on both x and y axis with different offset-schemes, like "offset to original points" (i.e. the coordinate it self), "offset from an outer box" [47], "offset from a circle" [14]. The online training with ContourRender Mask-RCNN (Res50FPN) also confirms the judgement, as shown in Table 7.

Offset From mean mean var var AP AP@50 AP@75
Origin -0.0017 0.0036 0.2139 0.2736 32.6 54.6 33.8
Outer Box 0.0017 0.0036 0.2437 1.2098 31.8 53.3 32.6
Circle 0.0369 0.0071 0.7272 0.7801 31.7 53.4 32.6
Table 7: Performance on COCO val2017 with respect to different offset schemes.

6.7 Statistics of COCO ContourHard-val

In this section, we will give some statistics data about the selected COCO ContourHard-val set. As mentioned in the main paper, 358 images are chosen, which contains 5054 shapes of 77 categories. In table 8, we list the number count and decrease (in brackets) from the original COCO val2017 for each category.

category #instances category #instances category #instances
person 1500 (-9277) bicycle 34 (-280) car 193 (-1725)
motorcycle 23 (-344) airplane 27 (-116) bus 13 (-270)
train 11 (-179) truck 37 (-377) boat 95 (-329)
traffic light 28 (-606) fire hydrant 2 (-99) stop sign 1 (-74)
parking meter 7 (-53) bench 61 (-350) bird 143 (-284)
cat 9 (-193) dog 15 (-203) horse 29 (-243)
sheep 71 (-283) cow 32 (-340) elephant 8 (-244)
bear 0 (-71) zebra 15 (-251) giraffe 10 (-222)
backpack 90 (-281) umbrella 84 (-323) handbag 118 (-422)
tie 30 (-222) suitcase 34 (-265) frisbee 7 (-108)
skis 76 (-165) snowboard 14 (-55) sports ball 9 (-251)
kite 160 (-167) baseball bat 10 (-135) baseball glove 9 (-139)
skateboard 39 (-140) surfboard 28 (-239) tennis racket 17 (-208)
bottle 99 (-914) wine glass 47 (-294) cup 156 (-739)
fork 42 (-173) knife 64 (-261) spoon 56 (-197)
bowl 109 (-514) banana 20 (-350) apple 26 (-210)
sandwich 18 (-159) orange 39 (-246) broccoli 12 (-300)
carrot 39 (-326) hot dog 18 (-107) pizza 39 (-245)
donut 44 (-284) cake 24 (-286) chair 412 (-1359)
couch 27 (-234) potted plant 78 (-264) bed 5 (-158)
dining table 153 (-542) toilet 1 (-178) tv 25 (-263)
laptop 20 (-211) mouse 8 (-98) remote 29 (-254)
keyboard 13 (-140) cell phone 26 (-236) microwave 5 (-50)
oven 14 (-129) toaster 0 (-9) sink 13 (-212)
refrigerator 13 (-113) book 211 (-918) clock 16 (-251)
vase 30 (-244) scissors 6 (-30) teddy bear 2 (-188)
hair drier 0 (-11) toothbrush 1 (-56) total 5054 (-31281)
Table 8: Numbers of instances for categories in COCO ContourHard-val and Numbers of instances which are removed from the original COCO val2017 dataset due to too convex.

6.8 Broader Impact

Who may benefit from this research?

Instance segmentation may benefit various applications, such as autonomous driving, robot manipulation/navigation. Especially when our proposed method can support real-time feedback as proper base object detector is adopted.

Who may be put at disadvantage from this research?

We cannot think of a case when someone be put to disadvantage specifically because of instance segmentation methods. Maybe if a person on the sidewalk is mis-classified or miss-detected, he could be at risk from autonomous vehicle.

What are the consequences of failure of the system?

There are three kinds of failures of the system, namely mis-classification, wrong localization, bad shape prediction.

Whether the task/method leverages biases in the data?

We do not specifically leverages biases in the data, as the data and the metric is standard for this problem.