Instance segmentation is one of the fundamental tasks in computer vision. Aside from object detection, it also needs to predict the shape of the interested objects, which are usually represented by masksHe et al. (2017); Huang et al. (2019); Liu et al. (2018); Kuo et al. (2019); Chen et al. (2019a); Wang et al. (2020) or contours Xu et al. (2019); Xie et al. (2020); Peng et al. (2020); Castrejon et al. (2017); Acuna et al. (2018). While the former is widely adopted by the mainstream researches, the contour-based approaches draw increasing attentions, mostly due to the advantage on the control of the shape which is beneficial to applications such as automatic annotation Castrejon et al. (2017); Acuna et al. (2018); Ling et al. (2019), autonomous driving Homayounfar et al. (2020); Wang et al. (2019); Peng et al. (2020), and medical image analysis Gur et al. (2019); Schmidt et al. (2018). Since a contour shape (polygon) is a closed curve which consists of several ordered points, predicting the contour shape is essentially regressing the point coordinates. As the coordinates can have large numeric range and are intrinsically unstable (See Fig. 1), previous works usually regress them cumbersomely during inference with iterative optimizations Peng et al. (2020); Castrejon et al. (2017); Acuna et al. (2018); Cheng et al. (2019); Akbari et al. (2021); Hatamizadeh et al. (2020); Gur et al. (2019), cascaded optimizations Upschulte et al. (2021), or compromised representations Xu et al. (2019); Schmidt et al. (2018); Xie et al. (2020). In this work, we explore to regress a decent contour in one pass without limitation on the shape representation.
Based on the observation from Fig. 1, it is natural to think that optimizing the inner content is more stable in comparison to directly regressing the point coordinates. To achieve this, we introduce two designs namely coordinate signature and contour mesh. A coordinate signature is transformed from point coordinates and introduced to facilitate the learning of coordinate information, thus it should be compact, robust, and invertible. Discrete cosine transform (DCT) 12 coefficients are adopted as the coordinate signature. Discussion about DCT coefficients and other choices of the coordinate signatures can be found in Sec. 3.1. However, since the coordinate signature is only an interpretation of the coordinates and cannot address the problem of discretization ambiguity, we further convert the reconstructed coordinates to a contour mesh and render the contour mesh to a silhouette with an off-the-shelf differentiable renderer Kato et al. (2018). The scheme of contour mesh construction is discussed in Sec. 3.2. Since the differentiable renderer plays the central role, we name the proposed novel contour regression pipeline as ContourRender. To show that the proposed ContourRender is independent of the object detector design, we equip it with the popular one-stage detector FCOS Tian et al. (2019) and two-stage Mask R-CNN He et al. (2017). The modification details are described in Sec. 3.3.1.
To quantitatively evaluate the proposed ContourRender, we conducted the experiments on the challenging COCO Lin et al. (2014) and Cityscapes Cordts et al. (2016) datasets. To show the advantages over predicting complex contours, we select 358 images according to the shape convexity Qi et al. (2019) from COCO val2017 datasets to form a COCO HardContour-val subset. Our approach outperforms all the contour-based methods. Since the differentiable renderer is not used in the inference phase, the computational overhead to produce the contour is mainly originated from the coordinate signature prediction and inverse transformation, which is negligible since it can be easily parallelized.
We summarize our contributions in two folds. First, we propose a novel contour-based approach for instance segmentation named ContourRender. The ContourRender can achieve accurate contour prediction by DCT coordinate signature and differentiable contour mesh rendering without the need for iterative regression or compromise on the contour representation. Second, the ContourRender adapted from Mask R-CNN or FCOS both outperform previous contour-based baselines on COCO, Cityscapes, and our proposed challenging COCO HardContour-val subset.
2 Related Works
Contour-based Instance Segmentation
For many applications, contour representation is of particular interest due to the ability to explicitly describe the shape with points. Though we can certainly convert the object masks produced by the mask-based instance segmentation frameworks He et al. (2017); Huang et al. (2019); Liu et al. (2018); Kuo et al. (2019); Chen et al. (2019a); Wang et al. (2020) to contours with post-processing Bradski (2000), it is still worthy to explore an end-to-end scheme to obtain the contour. Such attempts can be roughly divided into three categories based on their inference behaviors: one-pass approaches, iterative approaches, and cascaded approaches.
One-pass approach. Since the coordinate representation is hard to be regressed directly, ESE-Seg Xu et al. (2019) adopted a polar-coordinate representation and approximate the function with Chebyshev polynomials. Later, PolarMask Xie et al. (2020)
took a similar design, with a polar IOU which directly optimizes the contour IOU. The polar-coordinate system enables sampling the contour point along the fixed angle sequence, which can largely reduce the ambiguity of the Cartesian coordinate system (from 2 DoFs to 1 DoF). However, it also limits the expressive ability on the concave shapes (Fig.1). Another attempt is to regress from a pre-defined point set anchor, as in Wei et al. (2020). However, according to Appendix 6.6, the statistics about the offset ranges from the pre-defined point set anchor and the origin point have no significant difference.
Iterative approach. A well-known iterative approach is active contour Kass et al. (1988). Since Marcos et al. (2018) incorporated the active contour into an end-to-end learning framework, many improvements have been proposed Cheng et al. (2019); Akbari et al. (2021); Hatamizadeh et al. (2020); Peng et al. (2020); Gur et al. (2019). Among them, Gur et al. (2019) integrated active contour with a differentiable renderer. The ideology is similar to ours, but their approach requires iterations and the authors only apply it to rather simple shapes like buildings or road scenes. DeepSnake Peng et al. (2020) proposed a circle convolution operator to adapt to the polygon representation. It achieves state-of-the-art performance over contour-based instance segmentation frameworks on several benchmarks. Aside from the active contour-based methods, by regarding the contour with different perspectives, RNNs (regarded as ordered point sequence) Castrejon et al. (2017); Acuna et al. (2018) and GCN (regarded as point graph) Ling et al. (2019) are also adopted to address the polygon regression problem. They all require iteration operations to achieve satisfactory results.
Cascaded approach. Unlike the iterative approaches that refine the prediction progressively with the shared modules, the cascaded approach usually designs individual components for prediction refinement. In Upschulte et al. (2021), after the contour proposals are generated, the coarse prediction will be improved by a local refinement process where the refined offsets are learned individually.
The rendering process can be viewed as mapping from a mesh to an image. Though many factors can be taken into consideration when making the rendering differentiable, such as gradient approximation Loper and Black (2014); Kato et al. (2018); Genova et al. (2018); Kato and Harada (2019), rasterization Rhodin et al. (2015); Liu et al. (2019); Chen et al. (2019c) and ray-tracing Li et al. (2018); Zhang et al. (2019); Loubet et al. (2019); Zhang et al. (2020); Nimier-David et al. (2020). As silhouette is all we need, rasterization and ray-tracing-based neural renderers are rather complicated. Thus we adopt a simple renderer Kato et al. (2018). It obtains the differentiable gradient by sampling in a smooth manner. On the other hand, point cloud-based neural renderers Roveri et al. (2018); Wiles et al. (2020); Yifan et al. (2019) seem more straightforward than mesh with respect to the contour points. However, in practice, we find the point cloud renderer cannot bring out the improvement effect as the mesh renderers do. We have two speculations on this observation: (1). when the point radius is too large, the coordinate information is blurred in the circles, which introduces another kind of coordinate ambiguity. (2). when the point radius is too small, the coordinate information is predominant, as well as the coordinate ambiguity mentioned in Fig. 1. This motivates us to design the contour mesh (Sec. 3.2).
Given an image , a general contour-based instance segmentation framework aims to predict all the locations and the contour shapes , where is the number of the detected objects, and is the number of the points for each contour shape.
For contour regression, we first predict the discrete cosine transform (DCT) coefficients of the contour shape and obtain a coarse contour with the inverse DCT (iDCT) operation (Sec. 3.1). We convert the coarse contour to a contour mesh and render the corresponding silhouette after that(Sec. 3.2).
For the object detection part, we suggest two choices namely Mask R-CNN He et al. (2017) and FCOS Tian et al. (2019) and demonstrate how to modify them to adapt with the ContourRender in Sec. 3.3. To note, the ContourRender can be similarly extended to other object detectors.
3.1 Coordinate Signature
As mentioned earlier, the coordinate signature should be compact, robust, and invertible to fit into the neural network learning paradigm. There are many candidates, such as Cartesian coordinates, polar coordinates, dictionary coefficients of the Cartesian coordinates, DCT descriptor of the Cartesian coordinates, etc.
In the main paper, we will compare the Cartesian coordinates and their dictionary and DCT descriptors to illustrate how we decide the most preferable one before training. The experiments reported in Sec. 4.5 confirms the foresight. For the other signatures, please refer to Appendix 6.2.
3.1.1 Computation of the Signatures and the Invertibility
Given a contour shape , the computation of the signatures and the inverse transform to reconstructed contour shape are described as the following.
Coordinate Signature Given a ground-truth bounding box , we have , , and , where stands for the center and stands for the scale. Inversely, during inference, given the predicted bounding box, the exact position and scale can be obtained by . Note that all operations here are element-wise.
Dictionary Coordinate Signature are given by optimizing with Orthogonal Matching Pursuit (OMP) Mallat and Zhang (1993) which follows Mairal et al. (2009), where is the learned dictionary atoms. To recover , it first converts back to by , and then convert to following the same procedure as coordinate signature.
DCT Coordinate Signature , the applied DCT is 1-D for each axis 12. Similarly, to obtain , it first converts , and then convert to in the same way with coordinate signature.
It is noteworthy some common choices Yang et al. (2012) of contour signatures are not invertible, such as centroid distance function and contour curvature.
3.1.2 The Representation Power and Compactness
Compactness means that the signatures should be able to aggregate most information to a few coefficients with a constrained numeric range. Empirically, we find that a more compact signature is easier to learn. Unfortunately, we do not have rigid theory for this observation, but some discussions are made in Appendix 6.3. We conduct an offline analysis on the COCO val2017 dataset, which consists of 36781 instance shapes of 80 categories. We evaluate three signatures on the reconstruction ability concerning the valid coefficient length (number of non-zero coefficients), as shown in Fig. 3. The reconstruction error is defined as . is calculated by comparing the ground truth mask and the reconstructed mask by filling the contour polygon and resizing to the original size. When reducing the valid coefficient length, we gradually set the coefficients to zero from the last elements to the first.
3.1.3 Sensitivity Analysis and Robustness
As the neural networks are known to regress with noise, it is informed to test the sensitivity of the signatures. We randomly add Gaussian noise to each coefficient element in the signatures. is set to 0.1, 0.2, 0.3, 0.4, 0.5. We show how much mIOU loss will the Gaussian noises cause on different signatures. The results are shown in Fig. 3.
3.1.4 On the Choice of the Coordinate Signature
We have compared three coordinate signatures from three perspectives. Among them, all the signatures can be invertible to reconstruct the original contour shape. While coordinate signature has the best performance on resisting the noise, it is not compact enough. Thus the DCT signature is adopted. Admittedly, it is possible there exist better coordinate signatures, we left it as the future works.
3.1.5 Signature Loss
Given the signature obtained from the ground truth mask, we can directly supervise the signature with:
where the is the predicted signature. is the smooth L1 loss.
After learning on the signature loss, we can already obtain a coarse contour, by converting the predicted according to the procedures described in Sec. 3.1.1. But as mentioned earlier, signature learning cannot address the ambiguous nature of the coordinates. Thus we propose to train on a more stable silhouette through differentiable rendering.
3.2 Contour Mesh
According to the discussion in Sec. 2 and the results from Appendix 6.4, we adopt NMR Kato et al. (2018) as the off-the-shelf renderer. In this case, we should convert the contour into a contour-derived mesh (short for contour mesh) before rendering.
3.2.1 Construction of Contour Mesh
As the differentiable gradient from the neural renderer will back-propagate to the vertex, i.e. the contour points, the construction process should try to avoid damaging contour points.
A mesh is composed of 3D vertex and triangle faces. 3D vertex can be simply obtained by stacking all-zero z-axis to the contour points, while the ways of constructing the faces need to be discussed.
There are two intuitive approaches to construct the contour mesh, i.e. Internal Linking and External Linking. Internal linking means constructing the mesh faces based on the existing contour points, while external linking means we first find a set of the external points and link the contour points with them to form the faces. Both approaches can be conducted with different strategies. Here we give some intuitive thoughts and select one for the main experiments. The illustration of linking strategies is displayed in Fig. 4.
Internal Linking. For internal linking, each neighboring point pair will forward counter-clockwise to find the first point which can make these three points a triangle with an area larger than a threshold . We set for all the datasets.
External Linking. For external linking, an instant idea is to perform the clustering algorithm such as KMeans MacQueen and others (1967) to obtain the clustering centers. The clustering centers are most likely to be inside the contour. And then we can link the contour points to their corresponding clustering centers to form a contour mesh. This process is denoted as ’External Linking-KMeans’. However, as we would like to have the linking process run on the fly, in practice, we take another simple and faster approach. As the contour predicted is already zero-centered (because the supervised signature is obtained after zero-centered), we can simply shrink the contour by multiplying a scaling factor less than 1, and link the corresponding points between original and shrunk contours to form the contour mesh. We set for all the datasets. Different choices of the scaling factor are discussed in Appendix 6.5. This process is named as ’External Linking-Shrink’.
3.2.2 Silhouette Loss
After we transform the contour points to contour mesh, we can generate the corresponding silhouette by NMR Kato et al. (2018). The silhouette is denoted as , while we put the ground truth contour forward the same process and result in . In this way, the silhouette loss is defined by:
3.3 ContourRender with Object Detectors
3.3.1 ContourRender Branch
A generic ContourRender branch is designed as shown in Fig. 2. In this section, we will describe how it can be integrated into existing object detectors. For those who can only predict the object location like FCOS, we show how to add the ContourRender branch to enable shape prediction. For those object detection frameworks that already can do instance segmentation like Mask R-CNN, we show how to replace the original mask branch with our ContourRender branch. The modifications are shown in Fig. 5.
Add Modification of FCOS As the FCOS can only perform object detections, we add the ContourRender branch by taking the same input feature with the box head.
Replace Modification of Mask R-CNN As the Mask R-CNN is already an instance segmentation framework, we replace the original mask branch to the proposed ContourRender branch. The original mask features with the shape of after the RoiAlign operators are directly flattened and fed into the ContourRender branch.
3.3.2 Overall Objective Functions
With the object detector, we denote the object detection objective functions as , the details of for each detector remain the same, and can be referred to the original papers He et al. (2017); Tian et al. (2019). Therefore, the overall learning objective function for ContourRender is given as follows:
where and are the adjusting weights for different loss, which are determined by cross-validation and discussed in Sec. 4.1.
4.1 Implementation Details
. The default backbone is ResNet-50 with ImageNet-pretrained weights. We train the network for 150K iterations with stochastic gradient descent (SGD), which finished within 18 hours on 4 RTX Titan GPUs. The initial learning rate is 0.02 and the mini-batch size is 16. The learning rate is reduced to 0.001 and 0.0001 at 60k and 140k, respectively. Weight decay and momentum are set to 0.0001 and 0.9, respectively. To stabilize the training process, we setand during the first 60K iterations. Then, we set till the end and make decay linearly to till 70K. The performance gap between the two frameworks is negligible.
In this work, we will use three datasets to evaluate the performance of the ContourRender on Mask R-CNN and FCOS, namely COCO Lin et al. (2014), Cityscapes Cordts et al. (2016) and COCO ContourHard-val.
COCO is a large-scale dataset with 860001 shapes for training and 36781 shapes for validation. It has large shape varieties including a fair amount of separated shapes. To deal with the separated shapes, we just connect the separate contours by stacking them to form a larger contour.
As most COCO object shapes are fairly round, like shapes in bear, stop sign, and frisbee, etc., even imperfect contour representation can obtain fairly good performance. Thus, to better compare the contour representation, we select a subset from COCO val2017 as the COCO ContourHard-val set. The selection is based on the convexity mentioned in Qi et al. (2019). Only images with object shape of convexity smaller than 0.4 are selected to be validated. As a result, 358 images are chosen, which contains 5054 shapes of 77 categories. Statistics for each category please refer to Appendix 6.7.
Cityscapes is a popular road-scene dataset with 2975 images 54060 shapes for training, and 500 images with 10415 shapes for validation. It includes 8 categories mostly concerning people and vehicles on the road.
4.3 Results on COCO
Comparison with Other Baselines
We compare our methods with both mask and contour-based methods, it shows our method is competitive with mask-based methods, especially, our methods only drops 1 AP by replacing the mask branch of Mask R-CNN to ContourRender. Considering contour is rather limited to express separated shapes, or hollowed shapes, a little performance loss is expected.
With regarding to contour-based methods, for one-pass approach baseline PolarMask Xie et al. (2020), in the same setting with FCOS Tian et al. (2019) as the base detector, our method outperform 1.5 AP on COCO val2017, and 2.6 AP on COCO ContourHard-val (Table 3). Besides, our methods also beat the iterative approach DeepSnake Peng et al. (2020), even though the latter approach use the iterative optimization to refine the contour while ours do not.
|Mask R-CNN He et al. (2017)||Res50FPN||33.6||55.2||35.3||-||-||-|
|Mask R-CNN He et al. (2017)||Res101FPN||35.7||58.0||37.8||15.5||38.1||52.4|
|PANet Liu et al. (2018)||Res50FPN||36.6||58.0||39.3||16.3||38.1||53.1|
|SOLOv2 Wang et al. (2020)||Res50FPN||38.8||59.9||41.7||16.5||41.7||56.2|
|DeepSnake Peng et al. (2020)||DLA-34||30.3||-||-||-||-||-|
|ESE-Seg Xu et al. (2019)||DarkNet-53||21.6||48.7||22.4||-||-||-|
|PointSetNet Wei et al. (2020)||RX101FPN-DCN||34.6||60.1||34.9||-||-||-|
|PolarMask Xie et al. (2020)||Res50FPN||29.1||49.5||29.7||12.6||31.8||42.3|
|PolarMask Xie et al. (2020)||Res101FPN||30.4||51.9||31.0||13.4||32.4||42.8|
|PolarMask Xie et al. (2020)||RX101FPN-DCN||36.2||59.4||37.7||17.8||37.7||51.5|
Comparison with Other Baselines On COCO ContourHard-val
Since the validation dataset is built specifically for contour comparison, we only compare the contour-based methods on this validation set. As shown in Table 3, our method beats the baseline methods by a large margin.
|PolygonRNN++ Acuna et al. (2018)||Res50M||22.8|
|DeepSnake* Peng et al. (2020)||DLA-34||28.2|
|DeepSnake Peng et al. (2020)||DLA-34||31.7|
4.4 Results on Cityscapes
Since most contour-based approach report the mIOU metric on Cityscapes, we decide to follow the same training/testing and evaluation protocol. We notice that a few methods Peng et al. (2020); Acuna et al. (2018) report AP metric on Cityscapes, to compare with them, we also report the AP metric on Cityscapes in Table 3. It is interesting to find that though almost all these contour-based methods are iterative approaches, our one-pass approach can outperform most of them.
|PolygonRNN Castrejon et al. (2017)||61.4||64.0||60.6||71.2||68.0||69.5||53.7||52.1||52.1|
|PolygonRNN++ Acuna et al. (2018)||70.2||70.8||68.5||78.0||77.9||79.6||62.8||61.7||62.3|
|SplineGCN Ling et al. (2019)||72.1||72.5||70.6||80.2||79.1||81.7||65.9||62||64.8|
|Gur et al. Gur et al. (2019)||75.1||75.0||72.0||82.0||79.6||83.0||74.5||66.5||68.1|
4.5 Ablative Study
The ablative studies are carried out on the COCO dataset with ContourRender-MaskRCNN (Res50FPN). More ablative studies can be referred to supplementary materials.
The necessity of the signature learning If we remove the signature learning process, that is, set and from beginning to end, we find that will not converge at all.
Performance of different signatures With the Cartesian coordinate signature, we obtain 24.5 AP; With the dictionary signature, we obtain 26.1 AP; With the DCT signature, we obtain 28.1 AP.
Performance of different loss on the silhouette loss Lovasz Softmax Berman et al. (2018) is proved to be effective to serve as IOU loss, so we adopt it in ContourRender pipeline. If we substitute it to the common MSE loss, BCE loss and Dice loss, then the AP drops 4.1, 6.7, and 3.5 respectively.
5 Limitations and Future Works
In this work, we propose a novel contour-based instance segmentation pipeline, which enables the learning framework to produce decent contours without iterative or cascaded design, and the shape representation is not compromised as in previous works like ESE-Seg Xu et al. (2019) or PolarMask Xie et al. (2020)
. Nevertheless, our method still suffers from several drawbacks: 1. the contour representation cannot handle separated or hollowed contours at this moment. 2. the whole performance is largely dependent on the bounding box detection, which can give the information about theand the exact scale of the contour. 3. The choice of the coordinate signature and contour mesh is determined by intuition, which may be improved in later works.
Efficient interactive annotation of segmentation datasets with polygon-rnn++.
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 859–868. Cited by: §1, §2, §4.4, Table 3, Table 4.
Deep active contours using locally controlled distance vector flow. arXiv preprint arXiv:2105.08447. Cited by: §1, §2.
-  (2018) The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4413–4421. Cited by: §3.2.2, §4.5.
-  (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §2.
-  (2017) Annotating object instances with a polygon-rnn. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5230–5238. Cited by: §1, §2, Table 4.
-  (2019) Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4974–4983. Cited by: §1, §2.
-  (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.1.
Learning to predict 3d objects with an interpolation-based differentiable renderer. arXiv preprint arXiv:1908.01210. Cited by: §2.
-  (2019) Darnet: deep active ray network for building segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7431–7439. Cited by: §1, §2.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1, §4.2.
-  (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: Table 1.
-  (2006) Discrete cosine transform (dct). In Encyclopedia of Multimedia, B. Furht (Ed.), pp. 203–205. External Links: Cited by: §1, §3.1.1.
-  (2018) Unsupervised training for 3d morphable model regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8377–8386. Cited by: §2.
-  (2019) End to end trainable active contours via differentiable rendering. arXiv preprint arXiv:1912.00367. Cited by: §1, §2, Table 4, §6.6.
-  (2020) End-to-end trainable deep active contour models for automated image segmentation: delineating buildings in aerial imagery. In European Conference on Computer Vision, pp. 730–746. Cited by: §1, §2.
-  (2017) Mask R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2, §3.3.2, §3, Table 1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 1.
-  (2020) LevelSet r-cnn: a deep variational method for instance segmentation. In European Conference on Computer Vision, pp. 555–571. Cited by: §1.
-  (2019-06) Mask scoring r-cnn. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (1988) Snakes: active contour models. International journal of computer vision 1 (4), pp. 321–331. Cited by: §2.
-  (2019) Learning view priors for single-view 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9778–9787. Cited by: §2.
-  (2018) Neural 3d mesh renderer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3907–3916. Cited by: §1, §2, §3.2.2, §3.2, §6.4.
-  (2019-10) ShapeMask: learning to segment novel objects by refining shape priors. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
-  (2018) Differentiable monte carlo ray tracing through edge sampling. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–11. Cited by: §2.
-  (2017-07) Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
-  (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. External Links: Cited by: §1, §4.2.
-  (2019) Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5257–5266. Cited by: §1, §2, Table 4.
-  (2019) Soft rasterizer: a differentiable renderer for image-based 3d reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7708–7717. Cited by: §2, §6.4, §6.4.
-  (2018) Path aggregation network for instance segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 1.
-  (2014) OpenDR: an approximate differentiable renderer. In European Conference on Computer Vision, pp. 154–169. Cited by: §2, §6.4, §6.4.
-  (2019) Reparameterizing discontinuous integrands for differentiable rendering. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–14. Cited by: §2.
Some methods for classification and analysis of multivariate observations.
Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281–297. Cited by: §3.2.1.
Online dictionary learning for sparse coding.
Proceedings of the 26th annual international conference on machine learning, pp. 689–696. Cited by: §3.1.1.
-  (1993) Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing 41 (12), pp. 3397–3415. Cited by: §3.1.1.
-  (2018) Learning deep structured active contours end-to-end. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8877–8885. Cited by: §2.
Radiative backpropagation: an adjoint method for lightning-fast differentiable rendering. ACM Transactions on Graphics (TOG) 39 (4), pp. 146–1. Cited by: §2.
-  (2020) Deep snake for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8533–8542. Cited by: §1, §2, §4.3, §4.4, Table 1, Table 3.
-  (2019) Amodal instance segmentation with kins dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3014–3023. Cited by: §1, §4.2.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: Table 1.
A versatile scene model with differentiable visibility applied to generative pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 765–773. Cited by: §2.
Pointpronets: consolidation of point clouds with convolutional neural networks. In Computer Graphics Forum, Vol. 37, pp. 87–99. Cited by: §2.
-  (2018) Cell detection with star-convex polygons. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 265–273. Cited by: Figure 1, §1.
-  (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636. Cited by: §1, §3.3.2, §3, §4.3.
-  (2021) Contour proposal networks for biomedical instance segmentation. arXiv preprint arXiv:2104.03393. Cited by: §1, §2.
-  (2020) SOLOv2: dynamic and fast instance segmentation. Advances in Neural Information Processing Systems. Cited by: §1, §2, Table 1.
-  (2019) Object instance annotation with deep extreme level set evolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7500–7508. Cited by: §1.
Point-set anchors for object detection, instance segmentation and pose estimation. In European Conference on Computer Vision, pp. 527–544. Cited by: §2, Table 1, §6.6.
-  (2020) Synsin: end-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7467–7477. Cited by: §2, §6.4, §6.4.
-  (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §4.1.
-  (2020) Polarmask: single shot instance segmentation with polar representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12193–12202. Cited by: Figure 1, §1, §2, §4.3, Table 1, §5.
-  (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: Table 1.
-  (2019-10) Explicit shape encoding for real-time instance segmentation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Figure 1, §1, §2, Table 1, §5.
Shape-based invariant feature extraction for object recognition. In Advances in Reasoning-Based Image Processing Intelligent Systems, pp. 255–314. Cited by: §3.1.1.
-  (2019) Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–14. Cited by: §2, §6.4, §6.4.
-  (2020) Path-space differentiable rendering. ACM Trans. Graph.(Proc. SIGGRAPH) 39 (6), pp. 143. Cited by: §2.
-  (2019) A differential theory of radiative transfer. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–16. Cited by: §2.
Optionally include extra information (complete proofs, additional experiments and plots) in the appendix. This section will often be part of the supplemental material.
6.1 How Contour Point Number Affects Performance?
Apparently, as the sampled point number grows, the reconstructed error will be consistently reduced, but since the contour cannot represent fragmented shapes for now, the lower bound of the reconstructed error on COCO val2017 is prone to 0.03. As shown in Fig. 7.
However, the online experiments with ContourRender Mask R-CNN (Res50FPN) converges faster. As shown in Table 5, when the sampled point number is larger than 32, the performance has no significant improvement. The shape signatures in these experiments are DCT signatures.
|# Sampled Points||AP||AP50||AP@75|
6.2 Comparison with Other Signatures
In this section, we specifically compared with DCT signatures converted from polar coordinate systems. First, we think the angle-fixed sampling polar coordinate can largely damage the shape representation concerning convex shapes. Though it is not that significant in COCO dataset, it is not the natural representation after all. Thus, the polar coordinate representation is the coordinate with 2 DOFs. And we convert it to DCT signatures like Cartesian coordinates. However, we find such representation is quite sensitive to the noise, especially the element. As shown in Fig. 8.
6.3 Why Compactness Matters?
The compactness in our context is similar to the concept sparsity. But as some signatures do not have the nature of sparsity, such as Cartesian coordinate signature, we determine to use a different concept. When we check the sparsity of the DCT and dictionary coefficients, it is surprisingly to find that even though we took a certain number valid coefficients, , the online experiments with neural network can maintain a fairly good performance. We believe it is a good property for a shape signature so that the neural network can learn to regress a small number of coefficients to achieve good results.
6.4 Comparison with Other Differentiable Renderers
6.5 How Scaling Factor of External Linking - Shrink Affects the Performance
As the External linking scheme is shrinking the contour, the resulting silhouettes would be full masks. the scaling factor and . When , the resulting silhouette will be close to a full mask, while , the resulting silhouette will be close to a thin contour. In practice, we find a large could cause the predicted silhouette and the ground truth silhouette hard to be matched even though the shape is similar to each other, but the IOU will still be very low. Thus, we compare different in Table 6 and decide to simply set .
On the other hand, the shrinking scheme can trivially extended to dilating scheme, with a scaling factor .
6.6 Do Offsets from Prior Templates Significantly Different From Coordinate Itself?
As previous works like active contour-based approaches and other template-based  propose to regress a shape from a prior template. It maybe useful for iterative schemes, however, we find it is not that significantly different from the direct coordinate regression on COCO val2017. Given evenly sampled contour points, we compare the average offset (mean, mean), offset variance (var, var) on both x and y axis with different offset-schemes, like "offset to original points" (i.e. the coordinate it self), "offset from an outer box" , "offset from a circle" . The online training with ContourRender Mask-RCNN (Res50FPN) also confirms the judgement, as shown in Table 7.
6.7 Statistics of COCO ContourHard-val
In this section, we will give some statistics data about the selected COCO ContourHard-val set. As mentioned in the main paper, 358 images are chosen, which contains 5054 shapes of 77 categories. In table 8, we list the number count and decrease (in brackets) from the original COCO val2017 for each category.
|person||1500 (-9277)||bicycle||34 (-280)||car||193 (-1725)|
|motorcycle||23 (-344)||airplane||27 (-116)||bus||13 (-270)|
|train||11 (-179)||truck||37 (-377)||boat||95 (-329)|
|traffic light||28 (-606)||fire hydrant||2 (-99)||stop sign||1 (-74)|
|parking meter||7 (-53)||bench||61 (-350)||bird||143 (-284)|
|cat||9 (-193)||dog||15 (-203)||horse||29 (-243)|
|sheep||71 (-283)||cow||32 (-340)||elephant||8 (-244)|
|bear||0 (-71)||zebra||15 (-251)||giraffe||10 (-222)|
|backpack||90 (-281)||umbrella||84 (-323)||handbag||118 (-422)|
|tie||30 (-222)||suitcase||34 (-265)||frisbee||7 (-108)|
|skis||76 (-165)||snowboard||14 (-55)||sports ball||9 (-251)|
|kite||160 (-167)||baseball bat||10 (-135)||baseball glove||9 (-139)|
|skateboard||39 (-140)||surfboard||28 (-239)||tennis racket||17 (-208)|
|bottle||99 (-914)||wine glass||47 (-294)||cup||156 (-739)|
|fork||42 (-173)||knife||64 (-261)||spoon||56 (-197)|
|bowl||109 (-514)||banana||20 (-350)||apple||26 (-210)|
|sandwich||18 (-159)||orange||39 (-246)||broccoli||12 (-300)|
|carrot||39 (-326)||hot dog||18 (-107)||pizza||39 (-245)|
|donut||44 (-284)||cake||24 (-286)||chair||412 (-1359)|
|couch||27 (-234)||potted plant||78 (-264)||bed||5 (-158)|
|dining table||153 (-542)||toilet||1 (-178)||tv||25 (-263)|
|laptop||20 (-211)||mouse||8 (-98)||remote||29 (-254)|
|keyboard||13 (-140)||cell phone||26 (-236)||microwave||5 (-50)|
|oven||14 (-129)||toaster||0 (-9)||sink||13 (-212)|
|refrigerator||13 (-113)||book||211 (-918)||clock||16 (-251)|
|vase||30 (-244)||scissors||6 (-30)||teddy bear||2 (-188)|
|hair drier||0 (-11)||toothbrush||1 (-56)||total||5054 (-31281)|
6.8 Broader Impact
Who may benefit from this research?
Instance segmentation may benefit various applications, such as autonomous driving, robot manipulation/navigation. Especially when our proposed method can support real-time feedback as proper base object detector is adopted.
Who may be put at disadvantage from this research?
We cannot think of a case when someone be put to disadvantage specifically because of instance segmentation methods. Maybe if a person on the sidewalk is mis-classified or miss-detected, he could be at risk from autonomous vehicle.
What are the consequences of failure of the system?
There are three kinds of failures of the system, namely mis-classification, wrong localization, bad shape prediction.
Whether the task/method leverages biases in the data?
We do not specifically leverages biases in the data, as the data and the metric is standard for this problem.