Deep Snake for Real-Time Instance Segmentation

01/06/2020 ∙ by Sida Peng, et al. ∙ 18

This paper introduces a novel contour-based approach named deep snake for real-time instance segmentation. Unlike some recent methods that directly regress the coordinates of the object boundary points from an image, deep snake uses a neural network to iteratively deform an initial contour to the object boundary, which implements the classic idea of snake algorithms with a learning-based approach. For structured feature learning on the contour, we propose to use circular convolution in deep snake, which better exploits the cycle-graph structure of a contour compared against generic graph convolution. Based on deep snake, we develop a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation, which can handle errors in initial object localization. Experiments show that the proposed approach achieves state-of-the-art performances on the Cityscapes, Kins and Sbd datasets while being efficient for real-time instance segmentation, 32.3 fps for 512×512 images on a 1080Ti GPU. The code will be available at https://github.com/zju3dv/snake/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 9

Code Repositories

snake

Code for "Deep Snake for Real-Time Instance Segmentation"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Instance segmentation is the cornerstone of many computer vision tasks, such as video analysis, autonomous driving, and robotic grasping, which require both accuracy and efficiency. Most of the state-of-the-art instance segmentation methods

[17, 25, 4, 18] perform pixel-wise segmentation within a bounding box given by an object detector [33], which may be sensitive to the inaccurate bounding box. Moreover, representing an object shape as dense binary pixels generally results in costly post-processing.

An alternative shape representation is the object contour, which is composed of a sequence of vertices along the object silhouette. In contrast to pixel-based representation, contour is not limited within a bounding box and has fewer parameters. Such contour-based representation has long been used in image segmentation since the seminar work by Kass et al. [20], which is well known as the snake algorithm or active contour model. Given an initial contour, the snake algorithm iteratively deforms it to the object boundary by optimizing an energy functional defined with low-level image features, such as image intensity or gradient. While many variants [5, 6, 14] have been developed in the literature, these methods tend to find local optimal solutions as objective functions are handcrafted and optimization is usually nonlinear.

Figure 1: The basic idea of deep snake. Given an initial contour, image features are extracted at each vertex (a). Since the contour is a cycle graph, circular convolution is applied for feature learning on the contour (b). The blue, yellow and green nodes denote the input features, the kernel of circular convolution, and the output features, respectively. Finally, offsets are regressed at each vertex to deform the contour to the object boundary (c).

Some recent learning-based segmentation methods [19, 38] also represent objects as contours and try to directly regress the coordinates of object boundary points from an RGB image. Although such methods are much faster, they do not perform as well as pixel-based methods. Instead, Ling et al. [23] adopt the deformation pipeline of traditional snake algorithms and train a neural network to evolve an initial contour to the object boundary. Given a contour with image features, it regards the input contour as a graph and uses a graph convolutional network (GCN) to predict vertex-wise offsets between contour points and the target boundary points. It achieves competitive accuracy compared with pixel-based methods while being much faster. However, the method proposed in [23] is designed to help annotation and lacks a complete pipeline for automatic instance segmentation. Moreover, treating the contour as a general graph with a generic GCN does not fully exploit the special topology of a contour.

In this paper, we propose a learning-based snake algorithm, named deep snake, for real-time instance segmentation. Inspired by previous methods [20, 23], deep snake takes an initial contour as input and deforms it by regressing vertex-wise offsets. Our innovation is in introducing the circular convolution for efficient feature learning on a contour, as illustrated in Figure 1. We observe that the contour is a cycle graph that consists of a sequence of vertices connected in a closed cycle. Since every vertex has the same degree equal to two, we can apply the standard 1D convolution on the vertex features. Considering that the contour is periodic, deep snake introduces the circular convolution, which indicates that an aperiodic function (1D kernel) is convolved in the standard way with a periodic function (features defined on the contour). The kernel of circular convolution encodes not only the feature of each vertex but also the relationship among neighboring vertices. In contrast, the generic GCN performs pooling to aggregate information from neighboring vertices. The kernel function in our circular convolution amounts to a learnable aggregation function, which is more expressive and results in better performance than using a generic GCN, as demonstrated by our experimental results in Section 5.2.

Based on deep snake, we develop a pipeline for instance segmentation. Given an initial contour, deep snake can iteratively deform it to the object boundary and obtain the object shape. The remaining question is how to initialize a contour, whose importance has been demonstrated in classic snake algorithms. Inspired by [27, 41], we propose to generate an octagon formed by object extreme points as the initial contour, which generally encloses the object tightly. Specifically, we add deep snake to a detection model. The detected box first gives a diamond contour by connecting four points centered at its borders. Then deep snake takes the diamond as input and outputs offsets that point from four vertices to four extreme points, which are used to construct an octagon following [41]. Finally, deep snake deforms the octagon contour to the object boundary.

Our approach exhibits state-of-the-art performances on Cityscapes [7], Kins [32] and Sbd [15] datasets, while being efficient for real-time instance segmentation, 32.3 fps for images on a GTX 1080ti GPU. There are two reasons why the learning-based snake is fast while being accurate. First, our approach can deal with errors in the object localization and thus allows a light detector. Second, object contour has fewer parameters than pixel-based representation and does not require costly post-processing, such as mask upsampling.

In summary, this work has the following contributions:

  • We propose a learning-based snake algorithm for real-time instance segmentation, which deforms an initial contour to the object boundary and introduces the circular convolution for feature learning on the contour.

  • We propose a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation. Both stages can deal with errors in the initial object localization.

  • We demonstrate state-of-the-art performances of our approach on Cityscapes, Kins and Sbd datasets. For images, our algorithm runs at 32.3 fps, which is efficient for real-time instance segmentation.

2 Related work

Pixel-based methods.

Most methods [8, 22, 17, 25] perform instance segmentation on the pixel level within a region proposal, which works particularly well with standard CNNs. A representative instantiation is Mask R-CNN [17]. It first detects objects and then uses a mask predictor to segment instances within the proposed boxes. To better exploit the spatial information inside the box, PANet [25] fuses mask predictions from fully-connected layers and convolutional layers. Such proposal-based approaches achieve state-of-the-art performance. One limitation of these methods is that they cannot resolve errors in localization, such as too small or shifted boxes. In contrast, our approach deforms the detected boxes to the object boundaries, so the spatial extension of object shapes will not be limited.

There exist some pixel-based methods [2, 29, 26, 11, 39] that are free of region proposals. In these methods, every pixel produces the auxiliary information, and then a clustering algorithm groups pixels into object instances based on their information. The auxiliary information could be various, as well as grouping algorithms. [2] predicts the boundary-aware energy for each pixel and uses the watershed transform algorithm for grouping. [29] differentiates instances by learning instance-level embeddings. [26, 11] consider the input image as a graph and regress pixel affinities, which are then processed by a graph merge algorithm. Since the mask is composed of dense pixels, the post-clustering algorithms tend to be time-consuming.

Contour-based methods.

In these methods, the object shape comprises a sequence of vertices along the object boundary. Traditional snake algorithms [20, 5, 6, 14] first introduced the contour-based representation for image segmentation. They deform an initial contour to the object boundary by optimizing a handcrafted energy with respect to the contour coordinates. To improve the robustness of these methods, [28] proposed to learn the energy function in a data-driven manner. Instead of iteratively optimizing the contour, some recent learning-based methods [19, 38] try to regress the coordinates of contour points from an RGB image, which is much faster. However, they are not accurate competitively with state-of-the-art pixel-based methods.

In the field of semi-automatic annotation, [3, 1, 23] have tried to perform the contour labeling using other networks instead of standard CNNs. [3, 1]

predict the contour points sequentially using a recurrent neural network. To avoid sequential inference,

[23] follows the pipeline of snake algorithms and uses a graph convolutional network to predict vertex-wise offsets for contour deformation. This strategy significantly improves the annotation speed while being as accurate as pixel-based methods. However, [23] lacks a pipeline for instance segmentation and does not fully exploit the special topology of a contour. Instead of treating the contour as a general graph, deep snake leverages the cycle graph topology and introduces the circular convolution for efficient feature learning on a contour.

3 Proposed approach

Inspired by [20, 23], we perform object segmentation by deforming an initial contour to the object boundary. Specifically, deep snake takes a contour as input based on image features from a CNN backbone and predicts per-vertex offsets pointing to the object boundary. To fully exploit the contour topology, we introduce the circular convolution for efficient feature learning on the contour, which facilitates deep snake to learn the deformation. Based on deep snake, a pipeline is developed for instance segmentation.

3.1 Learning-based snake algorithm

Given an initial contour, traditional snake algorithms treat the coordinates of the vertices as a set of variables and optimize an energy functional with respect to these variables. By designing proper image forces at the contour coordinates, active contour models could optimize the contour to the object boundary. However, since the energy functional is typically nonconvex and handcrafted based on low-level image features, the deformation process tend to find local optimal solutions.

In contrast, deep snake directly learns to evolve the contour from data in an end-to-end manner. Given a contour with vertices

, we first construct feature vectors for each vertex. The input feature

for a vertex is a concatenation of learning-based features and the vertex coordinate: , where is the feature maps and is a translation-invariant version of vertex . The feature maps is obtained by applying a CNN backbone on the input image, which deep snake shares with the detector in our instance segmentation model. The image feature

is computed using the bilinear interpolation of features at the vertex coordinate

. The appended vertex coordinate is used to model the spatial relationship among contour vertices. Since the deformation should not be affected by the absolute location of contour, we compute the translation-invariant coordinate by subtracting the minimum value along and axis over all vertices, respectively.

Figure 2: Circular Convolution. The blue nodes are the input features defined on a contour, the yellow nodes represent the kernel function, and the green nodes are the output features. The highlighted green node is the inner product between the kernel function and the highlighted blue nodes, which is the same as the standard convolution. The output features of circular convolution have the same length as the input features.

Given the input features defined on a contour, deep snake introduces the circular convolution for the feature learning, as illustrated in Figure 2. In general, the features of contour vertices can be treated as a 1-D discrete signal and processed by the standard convolution. But this breaks the topology of the contour. Therefore, we treat the features on the contour as a periodic signal defined as:

(1)

and propose to encode the periodic features by the circular convolution defined as:

(2)

where is a learnable kernel function and the operator is the standard convolution.

Similar to the standard convolution, we can construct a network layer based on the circular convolution for feature learning, which is easy to be integrated into a modern network architecture. After the feature learning, deep snake applies three 11 convolution layers to the output features for each vertex and predicts vertex-wise offsets between contour points and the target points, which are used to deform the contour. In all experiments, the kernel size of circular convolution is fixed to be nine.

As discussed in the introduction, the proposed circular convolution better exploits the circular structure of the contour than the generic graph convolution. We will show the experimental comparison in Section 5.2. An alternative method is to use standard CNNs to regress a pixel-wise vector field from the input image to guide the evolution of the initial contour [34, 30, 37]. We argue that an important advantage of deep snake over the standard CNNs is the object-level structured prediction, i.e., the offset prediction at a vertex depends on other vertices of the same contour. Therefore, it is more reasonable for deep snake to predict an offset for a vertex located in the background and far from the object, which is very common in an initial contour. Standard CNNs have difficulty in outputting meaningful offsets in this case, since it is ambiguous to decide which object a background pixel belongs to.

Figure 3: Proposed contour-based model for instance segmentation. (a) Deep snake consists of three parts: a backbone, a fusion block, and a prediction head. It takes a contour as input and outputs vertex-wise offsets to deform the contour. (b) Based on deep snake, we propose a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation. The box proposed by the detector gives a diamond contour, whose four vertices are then deformed to object extreme points by deep snake. An octagon is constructed based on the extreme points. Taking the octagon as the initial contour, deep snake iteratively deforms it to the object boundary.

Network architecture.

Figure 3(a) shows the detailed schematic. Following ideas from [31, 36, 21]

, deep snake consists of three parts: a backbone, a fusion block, and a prediction head. The backbone is comprised of 8 “CirConv-Bn-ReLU” layers and uses residual skip connections for all layers, where “CirConv” means circular convolution. The fusion block aims to fuse the information across all contour points at multiple scales. It concatenates features from all layers in the backbone and forwards them through a 1

1 convolution layer followed by max pooling. The fused feature is then concatenated with the feature of each vertex. The prediction head applies three 1

1 convolution layers to the vertex features and output vertex-wise offsets.

3.2 Deep snake for instance segmentation

Figure 3(b) overviews the proposed pipeline for instance segmentation. We add deep snake to an object detection model. The detector first produces object boxes that are used to construct diamond contours. Then deep snake deforms the diamond vertices to object extreme points, which are used to construct octagon contours. Finally, our approach takes octagons as initial contours and performs iterative contour deformation to obtain the object shape.

Initial contour proposal.

Most active contour models require precise initial contours. Since the octagon proposed in [41] generally tightly encloses the object, we choose it as the initial contour, as shown in Figure 3(b). This octagon is formed by four extreme points, which are top, leftmost, bottom, rightmost pixels in an object, respectively, denoted by . Given a detected object box, we extract four points centered at the top, left, bottom, right box borders, denoted by , and then connect them to get a diamond contour. Deep snake takes this contour as input and outputs offsets that points from each vertex to the extreme point , namely

. In practice, to take in more context information, the diamond contour is uniformly upsampled to 40 points, and deep snake correspondingly outputs 40 offsets. The loss function only considers the offsets at

.

We construct the octagon by generating four lines based on extreme points and connecting their endpoints. Specifically, the four extreme points form a new object box. For each extreme point, a line extends from it along the corresponding box border in both directions to 1/4 of the border length. And the line will be truncated if it meets the box corner. Then the endpoints of the four lines are connected to form the octagon.

Contour deformation.

We first uniformly sample the octagon contour points along its edges and let it start from the top extreme points . Similarly, the ground-truth contour is generated by uniformly sampling vertices along the object boundary and assigning its first vertex as the one nearest to . Deep snake takes the initial contour as input and outputs offsets that point from each vertex to the target boundary point. We set as in all experiments, which can uniformly cover most object shapes.

However, regressing the offsets in one pass is challenging, especially for vertices far away from the object. Inspired by [20, 23, 35], we deal with this problem in an iterative optimization fashion. Specifically, our approach first predicts offsets based on the current contour and then deforms this contour by vertex-wise adding the offsets to its vertex coordinates. The deformed contour can be used for the next deformation or directly outputted as the object shape. In experiments, the number of inference iteration is set as 3 unless otherwise stated.

Note that the contour is an alternative representation for the spatial extension of an object. By deforming the initial contour to the object boundary, our approach could resolve the localization errors from the detector.

Figure 4: Given an object box, we perform RoIAlign to obtain the feature map and use a detector to detect the component boxes.

Handling multi-component objects.

Due to the occlusions, many instances comprise more than one connected component. However, a contour can only outline one connected component per bounding box. To overcome this problem, we propose to detect the object components within the object box. Specifically, using the detected box, our approach performs RoIAlign [17] to extract a feature map and adds a detector branch on the feature map to produce the component boxes. Figure 4 shows the basic idea. The following segmentation pipeline keeps the same. Our approach obtains the final object shape by merging component contours from the same object box.

4 Implementation details

Training strategy.

For the training of deep snake, we use the smooth loss proposed in [13] to learn the two deformation processes. The loss function for extreme point prediction is defined as

(3)

where is the predicted extreme point. And the loss function for iterative contour deformation is defined as

(4)

where is the deformed contour point and is the ground-truth boundary point. For the detection part, we adopt the same loss function as the original detection model. The training details changes with datasets, which will be described in Section 5.3.

Detector.

We adopt CenterNet [40] as the detector for all experiments. CenterNet reformulates the detection task as a keypoint detection problem and achieves an impressive trade-off between speed and accuracy. For the object box detector, we adopt the same setting as [40], which outputs class-specific boxes. For the component box detector, a class-agnostic CenterNet is adopted. Specifically, given an feature map, the class-agnostic CenterNet outputs an tensor representing the component center and an tensor representing the box size.

5 Experiments

We compare our approach with the state-of-the-art methods on the Cityscapes [7], Kins [32] and Sbd [15] datasets. Comprehensive ablation studies are conducted to analyze importance of the proposed components in our approach.

5.1 Datasets and Metrics

Cityscapes [7]

is a widely used benchmark for urban scene instance segmentation. It contains training, 500 validation and testing images with high quality annotations. Besides, it has 20k images with coarse annotations. This dataset is challenging due to the crowded scenes and the wide range in object scale. The performance is evaluated in terms of the average precision (AP) metric averaged over 8 semantic classes of the dataset. We report our results on the validation and test sets.

Kins [32]

was recently created by additionally annotating Kitti [12] dataset with instance-level semantic annotation. This dataset is used for amodal instance segmentation, which is a variant of instance segmentation and aims to recover complete instance shapes even under occlusion. Kins consists of training images and testing images. Following its setting, we evaluate our approach on 7 object categories in terms of the AP metric.

Sbd [15]

re-annotates images from the Pascal Voc [9] dataset with instance-level boundaries and has the same 20 object categories. The reason that we do not directly perform experiments on Pascal Voc is that its annotations contain holes, which is not suitable for contour-based methods. The Sbd dataset is split into training images and testing images. We report our results in terms of 2010 Voc AP [16], AP, AP metrics. AP is the average of AP with 9 thresholds from 0.1 to 0.9.

5.2 Ablation studies

We conduct ablation studies on the Sbd dataset with the consideration that it has 20 semantic categories and could fully evaluate the ability to deform various object contours. The three proposed components are evaluated, including our network architecture, initial contour proposal, and circular convolution. In these experiments, the detector and deep snake are trained end-to-end for 160 epochs with multi-scale data augmentation. The learning rate starts from

and decays with 0.5 at 80 and 120 epochs. Table 1 summarizes the results of ablation studies.

AP AP AP
Baseline 50.9 58.8 43.5
+ Architecture 52.3 59.7 46.0
+ Initial proposal 53.6 61.1 47.6
+ Circular convolution 54.4 62.1 48.3
Table 1: Ablation studies on Sbd val set . The baseline is a direct combination of Curve-gcn [23] and CenterNet [40]. The second model reserves the graph convolution and replaces the network architecture with our proposed one, which yields 1.4 AP improvement. Then we add the initial contour proposal before contour deformation, which improves AP by 1.3. The fourth row shows that replacing graph convolution with circular convolution further yields 0.8 AP improvement.
Iter. 1 Iter. 2 Iter. 3
Graph convolution 50.2 51.5 53.6
Circular convolution 50.6 54.2 54.4
Table 2: Comparison between graph and circular convolution on Sbd val set. The results are in terms of the AP metric. Graph and circular convolutions mean the convolution operator in the network. The columns show the results of different inference iterations. Circular convolution outperforms graph convolution across all inference iterations. Furthermore, circular convolution with two iterations outperforms graph convolution with three iterations by 0.6 AP, indicating a stronger deforming ability.

The row “Baseline” lists the result of a direct combination of Curve-gcn [23] with CenterNet [40]. Specifically, the detector produces object boxes, which gives ellipses around objects. Then ellipses are deformed into object boundaries through Graph-ResNet. Note that, the baseline represents the contour as a graph and uses a graph convolution network for contour deformation.

To validate our designed network, the model in the second row keeps the convolution operator as graph convolution and replaces Graph-ResNet with our proposed architecture, which yields 1.4 AP improvement. The main difference between the two networks is that our architecture appends a global fusion block before the prediction head.

When exploring the influence of the contour initialization, we add the initial contour proposal before the contour deformation. Instead of directly using the ellipse, the proposal step generates an octagon initialization by predicting four object extreme points, which not only resolves the detection errors but also encloses the object more tightly. The comparison between the second and the third row shows a 1.3 improvement in terms of AP.

Figure 5: Comparison between graph convolution (top) and circular convolution (bottom) on Sbd. The result of circular convolution with two iterations is visually better than that of graph convolution with three iterations.

Finally, the graph convolution is replaced with the circular convolution, which achieves 0.8 AP improvement. To fully validate the importance of circular convolution, we further compare models with different convolution operators and different inference iterations, as shown in table 2. Circular convolution outperforms graph convolution across all inference iterations. And circular convolution with two iterations outperforms graph convolution with three iterations by 0.6 AP. Figure 5 shows qualitative results of graph and circular convolution on Sbd, where circular convolution gives a sharper boundary. The quantitative and qualitative results indicate that models with the circular convolution have a stronger ability to deform contours.

5.3 Comparison with the state-of-the-art methods

Figure 6: Qualitative results on Cityscapes test and Kins test sets. The first two rows show the results on Cityscapes, and the last row lists the results on Kins. Note that the results on Kins are for amodal instance segmentation.
training data AP [val] AP AP person rider car truck bus train mcycle bicycle
SGN [24] fine + coarse 29.2 25.0 44.9 21.8 20.1 39.4 24.8 33.2 30.8 17.7 12.4
PolygonRNN++ [1] fine - 25.5 45.5 29.4 21.8 48.3 21.1 32.3 23.7 13.6 13.6
Mask R-CNN [17] fine 31.5 26.2 49.9 30.5 23.7 46.9 22.8 32.2 18.6 19.1 16.0
GMIS [26] fine + coarse - 27.6 49.6 29.3 24.1 42.7 25.4 37.2 32.9 17.6 11.9
Spatial [29] fine - 27.6 50.9 34.5 26.1 52.4 21.7 31.2 16.4 20.1 18.9
PANet [25] fine 36.5 31.8 57.1 36.8 30.4 54.8 27.0 36.3 25.5 22.6 20.8
Deep snake fine 37.4 31.7 58.4 37.2 27.0 56.0 29.5 40.5 28.2 19.0 16.4
Table 3: Results on Cityscapes val (“AP [val]” column) and test (remaining columns) sets. Our approach achieves the state-of-the-art performance, which outperforms PANet [25] by 0.9 AP on the val set and 1.3 AP on the test set. According to the timing result in [29], our approach is approximately 5 times faster than PANet.

Performance on Cityscapes.

Since fragmented instances are very common in Cityscapes, we adopt the proposed strategy to handle multi-component objects. Our network is trained with multi-scale data augmentation and tested at a single resolution of . No testing tricks are used. The detector is first trained alone for 140 epochs, and the learning rate starts from and drops by half at 80, 120 epochs. Then the detection and snake branches are trained end-to-end for 200 epochs, and the learning rate starts from and drops by half at 80, 120, 150 epochs. We choose a model that performs best on the validation set.

Table 3 compares our results with other state-of-the-art methods on the Cityscapes validation and test sets. All methods are tested without tricks. Using only the fine annotations, our approach achieves state-of-the-art performances on both validation and test sets. We outperform PANet by 0.9 AP on the validation set and 1.3 AP on the test set. According to the approximate timing result in [29], PANet runs at less than 1.0 fps. In contrast, our model runs at 4.6 fps on a 1080 Ti GPU for images, which is about 5 times faster. Our approach achieves 28.2 AP on the test set when the strategy of handling multi-component objects is not adopted. Visual results are shown in Figure 6.

detection amodal seg inmodal seg
MNC [8] 20.9 18.5 16.1
FCIS [22] 25.6 23.5 20.8
ORCNN [10] 30.9 29.0 26.4
Mask R-CNN [17] 31.1 29.2
Mask R-CNN [17] 31.3 29.3 26.6
PANet [25] 32.3 30.4 27.6
Deep snake 32.8 31.3
Table 4: Results on Kins test set in terms of the AP metric. The amodal bounding box is used as the ground truth in the detection task. means no such output in the corresponding method.

Performance on Kins.

As a dataset for amodal instance segmentation, objects in the Kins dataset are all connected as a single component, so the strategy of handling multi-component objects is not adopted. We train the detector and snake end-to-end for 150 epochs. The learning rate starts from and decays with 0.5 and 0.1 at 80 and 120 epochs, respectively. We perform multi-scale training and test the model at a single resolution of .

Table 4 shows the comparison with [8, 22, 10, 17, 25] on the Kins dataset in terms of the AP metric. Kins [32] indicates that tackling both amodal and inmodal segmentation simultaneously can improve the performance, as shown in the fourth and the fifth row of Table 4. Our approach learns only the amodal segmentation task and achieves the best performance across all methods. We find that the snake branch can improve the detection performance. When CenterNet is trained alone, it obtains 30.5 AP on detection. When trained with the snake branch, its performance improves by 2.3 AP. For images on the Kins dataset, our approach runs at 7.6 fps on a 1080 Ti GPU. Figure 7 shows some qualitative results on Kins.

Figure 7: Qualitative results on Sbd val set. Our approach handles errors in object localization in most cases. For example, in the first image, although the detected boxes do not fully cover the boys, our approach recovers the complete object shapes. Zoom in for details.
AP AP AP
STS [19] 29.0 30.0 6.5
ESE-50 [38] 32.6 39.1 10.5
ESE-20 [38] 35.3 40.7 12.1
Deep snake 54.4 62.1 48.3
Table 5: Results on Sbd val set. Our approach outperforms other contour-based methods by a large margin. The improvement increases with the IoU threshold, 21.4 AP and 36.2 AP.

Performance on Sbd.

Most objects on the Sbd dataset are connected as a single component, so we do not handle fragmented instances. For multi-component objects, our approach detects their components separately instead of detecting the whole object. We train the detection and snake branches end-to-end for 150 epochs with multi-scale data augmentation. The learning rate starts from and drops by half at 80 and 120 epochs. The network is tested at a single scale of .

In Table 5, we compare with other contour-based methods [19, 38] on the Sbd dataset in terms of the Voc AP metrics. [19, 38] predict the object contours by regressing shape vectors. STS [19] defines the object contour as a radial vector from the object center, and ESE [38] approximates object contour with and coefficients of Chebyshev polynomial. In contrast, our approach deforms an initial contour to the object boundary. We outperform these methods by a large margin of at least 19.1 AP. Note that, our approach yields 21.4 AP and 36.2 AP improvements, demonstrating that the improvement increases with the IoU threshold. This indicates that our algorithm better outlines object boundaries. For images on the Sbd dataset, our approach runs at 32.3 fps on a 1080 Ti. Some qualitative results are illustrated in Figure 7.

5.4 Running time

method MNC FCIS MS STS ESE OURS
time (ms) 360 160 180 27 26 31
fps 2.8 6.3 5.6 37.0 38.5 32.3
Table 6: Running time on the Pascal Voc dataset. “MS” represents Mask R-CNN [17], and “OURS” represents our approach. The last three methods are contour-based methods.

Table 6 compares our approach with other methods [8, 22, 17, 19, 38] in terms of running time on the Pascal Voc dataset. Since the Sbd dataset shares images with Pascal Voc and has the same semantic categories, the running time on the Sbd dataset is technically the same as the one on Pascal Voc. We obtain the running time of other methods on Pascal Voc from [38].

For images on the Sbd dataset, our algorithm runs at 32.3 fps on a desktop with an Intel i7 3.7GHz and a GTX 1080 Ti GPU, which is efficient for real-time instance segmentation. Specifically, CenterNet takes 18.4 ms, the initial contour proposal takes 3.1 ms, and each iteration of contour deformation takes 3.3 ms. Since our approach outputs the object boundary, no post-processing like upsampling is required. If the strategy of handling fragmented instances is adopted, the detector additionally takes 3.6 ms.

6 Conclusion

We introduced a new contour-based model for real-time instance segmentation. Inspired by traditional snake algorithms, our approach deforms an initial contour to the object boundary and obtains the object shape. To this end, we proposed a learning-based snake algorithm, named deep snake, which introduces the circular convolution for efficient feature learning on the contour and regresses vertex-wise offsets for the contour deformation. Based on deep snake, we developed a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation. We showed that this pipeline gained a superior performance than direct regression of the coordinates of the object boundary points. We also showed that the circular convolution learns the structural information of the contour more effectively than the graph convolution. To overcome the limitation of the contour that it can only outline one connected component, we proposed to detect the object components within the object box and demonstrated the effectiveness of this strategy on Cityscapes. The proposed model achieved the state-of-the-art results on the Cityscapes, Kins and Sbd datasets with a real-time performance.

References

  • [1] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018.
  • [2] Min Bai and Raquel Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017.
  • [3] Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. Annotating object instances with a polygon-rnn. In CVPR, 2017.
  • [4] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In CVPR, 2019.
  • [5] Laurent D Cohen. On active contour models and balloons. CVGIP: Image understanding, 53(2):211–218, 1991.
  • [6] Timothy F Cootes, Christopher J Taylor, David H Cooper, and Jim Graham. Active shape models-their training and application. CVIU, 61(1):38–59, 1995.
  • [7] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In CVPR, 2016.
  • [8] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016.
  • [9] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
  • [10] Patrick Follmann, Rebecca Kö Nig, Philipp Hä Rtinger, Michael Klostermann, and Tobias Bö Ttger. Learning to see the invisible: End-to-end trainable amodal instance segmentation. In WACV, 2019.
  • [11] Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. Ssap: Single-shot instance segmentation with affinity pyramid. In ICCV, 2019.
  • [12] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. IJRR, 32(11):1231–1237, 2013.
  • [13] Ross Girshick. Fast r-cnn. In ICCV, 2015.
  • [14] Steve R Gunn and Mark S Nixon. A robust snake implementation; a dual active contour. PAMI, 19(1):63–68, 1997.
  • [15] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In ICCV, 2011.
  • [16] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Simultaneous detection and segmentation. In ECCV, 2014.
  • [17] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
  • [18] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask scoring r-cnn. In CVPR, 2019.
  • [19] Saumya Jetley, Michael Sapienza, Stuart Golodetz, and Philip HS Torr. Straight to shapes: Real-time detection of encoded shapes. In CVPR, 2017.
  • [20] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. IJCV, 1(4):321–331, 1988.
  • [21] Guohao Li, Matthias Müller, Ali Thabet, and Bernard Ghanem. Can gcns go as deep as cnns? In ICCV, 2019.
  • [22] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017.
  • [23] Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja Fidler. Fast interactive object annotation with curve-gcn. In CVPR, 2019.
  • [24] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn: Sequential grouping networks for instance segmentation. In ICCV, 2017.
  • [25] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018.
  • [26] Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. Affinity derivation and graph merge for instance segmentation. In ECCV, 2018.
  • [27] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object segmentation. In CVPR, 2018.
  • [28] Diego Marcos, Devis Tuia, Benjamin Kellenberger, Lisa Zhang, Min Bai, Renjie Liao, and Raquel Urtasun. Learning deep structured active contours end-to-end. In CVPR, 2018.
  • [29] Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In CVPR, 2019.
  • [30] Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hujun Bao.

    Pvnet: Pixel-wise voting network for 6dof pose estimation.

    In CVPR, 2019.
  • [31] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.

    Pointnet: Deep learning on point sets for 3d classification and segmentation.

    In CVPR, 2017.
  • [32] Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Amodal instance segmentation with kins dataset. In CVPR, 2019.
  • [33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  • [34] Christian Rupprecht, Elizabeth Huaroc, Maximilian Baust, and Nassir Navab. Deep active contours. arXiv preprint arXiv:1607.05074, 2016.
  • [35] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018.
  • [36] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. TOG, 2018.
  • [37] Zian Wang, David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Object instance annotation with deep extreme level set evolution. In CVPR, 2019.
  • [38] Wenqiang Xu, Haiyang Wang, Fubo Qi, and Cewu Lu. Explicit shape encoding for real-time instance segmentation. In ICCV, 2019.
  • [39] Ze Yang, Yinghao Xu, Han Xue, Zheng Zhang, Raquel Urtasun, Liwei Wang, Stephen Lin, and Han Hu. Dense reppoints: Representing visual objects with dense point sets. arXiv preprint arXiv:1912.11473, 2019.
  • [40] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  • [41] Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In CVPR, 2019.