Code for "Deep Snake for Real-Time Instance Segmentation" CVPR 2020 oral
This paper introduces a novel contour-based approach named deep snake for real-time instance segmentation. Unlike some recent methods that directly regress the coordinates of the object boundary points from an image, deep snake uses a neural network to iteratively deform an initial contour to the object boundary, which implements the classic idea of snake algorithms with a learning-based approach. For structured feature learning on the contour, we propose to use circular convolution in deep snake, which better exploits the cycle-graph structure of a contour compared against generic graph convolution. Based on deep snake, we develop a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation, which can handle errors in initial object localization. Experiments show that the proposed approach achieves state-of-the-art performances on the Cityscapes, Kins and Sbd datasets while being efficient for real-time instance segmentation, 32.3 fps for 512×512 images on a 1080Ti GPU. The code will be available at https://github.com/zju3dv/snake/.READ FULL TEXT VIEW PDF
Code for "Deep Snake for Real-Time Instance Segmentation" CVPR 2020 oral
Instance segmentation is the cornerstone of many computer vision tasks, such as video analysis, autonomous driving, and robotic grasping, which require both accuracy and efficiency. Most of the state-of-the-art instance segmentation methods[17, 25, 4, 18] perform pixel-wise segmentation within a bounding box given by an object detector , which may be sensitive to the inaccurate bounding box. Moreover, representing an object shape as dense binary pixels generally results in costly post-processing.
An alternative shape representation is the object contour, which is composed of a sequence of vertices along the object silhouette. In contrast to pixel-based representation, contour is not limited within a bounding box and has fewer parameters. Such contour-based representation has long been used in image segmentation since the seminar work by Kass et al. , which is well known as the snake algorithm or active contour model. Given an initial contour, the snake algorithm iteratively deforms it to the object boundary by optimizing an energy functional defined with low-level image features, such as image intensity or gradient. While many variants [5, 6, 14] have been developed in the literature, these methods tend to find local optimal solutions as objective functions are handcrafted and optimization is usually nonlinear.
Some recent learning-based segmentation methods [19, 38] also represent objects as contours and try to directly regress the coordinates of object boundary points from an RGB image. Although such methods are much faster, they do not perform as well as pixel-based methods. Instead, Ling et al.  adopt the deformation pipeline of traditional snake algorithms and train a neural network to evolve an initial contour to the object boundary. Given a contour with image features, it regards the input contour as a graph and uses a graph convolutional network (GCN) to predict vertex-wise offsets between contour points and the target boundary points. It achieves competitive accuracy compared with pixel-based methods while being much faster. However, the method proposed in  is designed to help annotation and lacks a complete pipeline for automatic instance segmentation. Moreover, treating the contour as a general graph with a generic GCN does not fully exploit the special topology of a contour.
In this paper, we propose a learning-based snake algorithm, named deep snake, for real-time instance segmentation. Inspired by previous methods [20, 23], deep snake takes an initial contour as input and deforms it by regressing vertex-wise offsets. Our innovation is in introducing the circular convolution for efficient feature learning on a contour, as illustrated in Figure 1. We observe that the contour is a cycle graph that consists of a sequence of vertices connected in a closed cycle. Since every vertex has the same degree equal to two, we can apply the standard 1D convolution on the vertex features. Considering that the contour is periodic, deep snake introduces the circular convolution, which indicates that an aperiodic function (1D kernel) is convolved in the standard way with a periodic function (features defined on the contour). The kernel of circular convolution encodes not only the feature of each vertex but also the relationship among neighboring vertices. In contrast, the generic GCN performs pooling to aggregate information from neighboring vertices. The kernel function in our circular convolution amounts to a learnable aggregation function, which is more expressive and results in better performance than using a generic GCN, as demonstrated by our experimental results in Section 5.2.
Based on deep snake, we develop a pipeline for instance segmentation. Given an initial contour, deep snake can iteratively deform it to the object boundary and obtain the object shape. The remaining question is how to initialize a contour, whose importance has been demonstrated in classic snake algorithms. Inspired by [27, 41], we propose to generate an octagon formed by object extreme points as the initial contour, which generally encloses the object tightly. Specifically, we add deep snake to a detection model. The detected box first gives a diamond contour by connecting four points centered at its borders. Then deep snake takes the diamond as input and outputs offsets that point from four vertices to four extreme points, which are used to construct an octagon following . Finally, deep snake deforms the octagon contour to the object boundary.
Our approach exhibits state-of-the-art performances on Cityscapes , Kins  and Sbd  datasets, while being efficient for real-time instance segmentation, 32.3 fps for images on a GTX 1080ti GPU. There are two reasons why the learning-based snake is fast while being accurate. First, our approach can deal with errors in the object localization and thus allows a light detector. Second, object contour has fewer parameters than pixel-based representation and does not require costly post-processing, such as mask upsampling.
In summary, this work has the following contributions:
We propose a learning-based snake algorithm for real-time instance segmentation, which deforms an initial contour to the object boundary and introduces the circular convolution for feature learning on the contour.
We propose a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation. Both stages can deal with errors in the initial object localization.
We demonstrate state-of-the-art performances of our approach on Cityscapes, Kins and Sbd datasets. For images, our algorithm runs at 32.3 fps, which is efficient for real-time instance segmentation.
Most methods [8, 22, 17, 25] perform instance segmentation on the pixel level within a region proposal, which works particularly well with standard CNNs. A representative instantiation is Mask R-CNN . It first detects objects and then uses a mask predictor to segment instances within the proposed boxes. To better exploit the spatial information inside the box, PANet  fuses mask predictions from fully-connected layers and convolutional layers. Such proposal-based approaches achieve state-of-the-art performance. One limitation of these methods is that they cannot resolve errors in localization, such as too small or shifted boxes. In contrast, our approach deforms the detected boxes to the object boundaries, so the spatial extension of object shapes will not be limited.
There exist some pixel-based methods [2, 29, 26, 11, 39] that are free of region proposals. In these methods, every pixel produces the auxiliary information, and then a clustering algorithm groups pixels into object instances based on their information. The auxiliary information could be various, as well as grouping algorithms.  predicts the boundary-aware energy for each pixel and uses the watershed transform algorithm for grouping.  differentiates instances by learning instance-level embeddings. [26, 11] consider the input image as a graph and regress pixel affinities, which are then processed by a graph merge algorithm. Since the mask is composed of dense pixels, the post-clustering algorithms tend to be time-consuming.
In these methods, the object shape comprises a sequence of vertices along the object boundary. Traditional snake algorithms [20, 5, 6, 14] first introduced the contour-based representation for image segmentation. They deform an initial contour to the object boundary by optimizing a handcrafted energy with respect to the contour coordinates. To improve the robustness of these methods,  proposed to learn the energy function in a data-driven manner. Instead of iteratively optimizing the contour, some recent learning-based methods [19, 38] try to regress the coordinates of contour points from an RGB image, which is much faster. However, they are not accurate competitively with state-of-the-art pixel-based methods.
predict the contour points sequentially using a recurrent neural network. To avoid sequential inference, follows the pipeline of snake algorithms and uses a graph convolutional network to predict vertex-wise offsets for contour deformation. This strategy significantly improves the annotation speed while being as accurate as pixel-based methods. However,  lacks a pipeline for instance segmentation and does not fully exploit the special topology of a contour. Instead of treating the contour as a general graph, deep snake leverages the cycle graph topology and introduces the circular convolution for efficient feature learning on a contour.
Inspired by [20, 23], we perform object segmentation by deforming an initial contour to the object boundary. Specifically, deep snake takes a contour as input based on image features from a CNN backbone and predicts per-vertex offsets pointing to the object boundary. To fully exploit the contour topology, we introduce the circular convolution for efficient feature learning on the contour, which facilitates deep snake to learn the deformation. Based on deep snake, a pipeline is developed for instance segmentation.
Given an initial contour, traditional snake algorithms treat the coordinates of the vertices as a set of variables and optimize an energy functional with respect to these variables. By designing proper image forces at the contour coordinates, active contour models could optimize the contour to the object boundary. However, since the energy functional is typically nonconvex and handcrafted based on low-level image features, the deformation process tend to find local optimal solutions.
In contrast, deep snake directly learns to evolve the contour from data in an end-to-end manner. Given a contour with vertices
, we first construct feature vectors for each vertex. The input featurefor a vertex is a concatenation of learning-based features and the vertex coordinate: , where is the feature maps and is a translation-invariant version of vertex . The feature maps is obtained by applying a CNN backbone on the input image, which deep snake shares with the detector in our instance segmentation model. The image feature
is computed using the bilinear interpolation of features at the vertex coordinate. The appended vertex coordinate is used to model the spatial relationship among contour vertices. Since the deformation should not be affected by the absolute location of contour, we compute the translation-invariant coordinate by subtracting the minimum value along and axis over all vertices, respectively.
Given the input features defined on a contour, deep snake introduces the circular convolution for the feature learning, as illustrated in Figure 2. In general, the features of contour vertices can be treated as a 1-D discrete signal and processed by the standard convolution. But this breaks the topology of the contour. Therefore, we treat the features on the contour as a periodic signal defined as:
and propose to encode the periodic features by the circular convolution defined as:
where is a learnable kernel function and the operator is the standard convolution.
Similar to the standard convolution, we can construct a network layer based on the circular convolution for feature learning, which is easy to be integrated into a modern network architecture. After the feature learning, deep snake applies three 11 convolution layers to the output features for each vertex and predicts vertex-wise offsets between contour points and the target points, which are used to deform the contour. In all experiments, the kernel size of circular convolution is fixed to be nine.
As discussed in the introduction, the proposed circular convolution better exploits the circular structure of the contour than the generic graph convolution. We will show the experimental comparison in Section 5.2. An alternative method is to use standard CNNs to regress a pixel-wise vector field from the input image to guide the evolution of the initial contour [34, 30, 37]. We argue that an important advantage of deep snake over the standard CNNs is the object-level structured prediction, i.e., the offset prediction at a vertex depends on other vertices of the same contour. Therefore, it is more reasonable for deep snake to predict an offset for a vertex located in the background and far from the object, which is very common in an initial contour. Standard CNNs have difficulty in outputting meaningful offsets in this case, since it is ambiguous to decide which object a background pixel belongs to.
, deep snake consists of three parts: a backbone, a fusion block, and a prediction head. The backbone is comprised of 8 “CirConv-Bn-ReLU” layers and uses residual skip connections for all layers, where “CirConv” means circular convolution. The fusion block aims to fuse the information across all contour points at multiple scales. It concatenates features from all layers in the backbone and forwards them through a 1
1 convolution layer followed by max pooling. The fused feature is then concatenated with the feature of each vertex. The prediction head applies three 11 convolution layers to the vertex features and output vertex-wise offsets.
Figure 3(b) overviews the proposed pipeline for instance segmentation. We add deep snake to an object detection model. The detector first produces object boxes that are used to construct diamond contours. Then deep snake deforms the diamond vertices to object extreme points, which are used to construct octagon contours. Finally, our approach takes octagons as initial contours and performs iterative contour deformation to obtain the object shape.
Most active contour models require precise initial contours. Since the octagon proposed in  generally tightly encloses the object, we choose it as the initial contour, as shown in Figure 3(b). This octagon is formed by four extreme points, which are top, leftmost, bottom, rightmost pixels in an object, respectively, denoted by . Given a detected object box, we extract four points centered at the top, left, bottom, right box borders, denoted by , and then connect them to get a diamond contour. Deep snake takes this contour as input and outputs offsets that points from each vertex to the extreme point , namely
. In practice, to take in more context information, the diamond contour is uniformly upsampled to 40 points, and deep snake correspondingly outputs 40 offsets. The loss function only considers the offsets at.
We construct the octagon by generating four lines based on extreme points and connecting their endpoints. Specifically, the four extreme points form a new object box. For each extreme point, a line extends from it along the corresponding box border in both directions to 1/4 of the border length. And the line will be truncated if it meets the box corner. Then the endpoints of the four lines are connected to form the octagon.
We first uniformly sample the octagon contour points along its edges and let it start from the top extreme points . Similarly, the ground-truth contour is generated by uniformly sampling vertices along the object boundary and assigning its first vertex as the one nearest to . Deep snake takes the initial contour as input and outputs offsets that point from each vertex to the target boundary point. We set as in all experiments, which can uniformly cover most object shapes.
However, regressing the offsets in one pass is challenging, especially for vertices far away from the object. Inspired by [20, 23, 35], we deal with this problem in an iterative optimization fashion. Specifically, our approach first predicts offsets based on the current contour and then deforms this contour by vertex-wise adding the offsets to its vertex coordinates. The deformed contour can be used for the next deformation or directly outputted as the object shape. In experiments, the number of inference iteration is set as 3 unless otherwise stated.
Note that the contour is an alternative representation for the spatial extension of an object. By deforming the initial contour to the object boundary, our approach could resolve the localization errors from the detector.
Due to the occlusions, many instances comprise more than one connected component. However, a contour can only outline one connected component per bounding box. To overcome this problem, we propose to detect the object components within the object box. Specifically, using the detected box, our approach performs RoIAlign  to extract a feature map and adds a detector branch on the feature map to produce the component boxes. Figure 4 shows the basic idea. The following segmentation pipeline keeps the same. Our approach obtains the final object shape by merging component contours from the same object box.
For the training of deep snake, we use the smooth loss proposed in  to learn the two deformation processes. The loss function for extreme point prediction is defined as
where is the predicted extreme point. And the loss function for iterative contour deformation is defined as
where is the deformed contour point and is the ground-truth boundary point. For the detection part, we adopt the same loss function as the original detection model. The training details changes with datasets, which will be described in Section 5.3.
We adopt CenterNet  as the detector for all experiments. CenterNet reformulates the detection task as a keypoint detection problem and achieves an impressive trade-off between speed and accuracy. For the object box detector, we adopt the same setting as , which outputs class-specific boxes. For the component box detector, a class-agnostic CenterNet is adopted. Specifically, given an feature map, the class-agnostic CenterNet outputs an tensor representing the component center and an tensor representing the box size.
We compare our approach with the state-of-the-art methods on the Cityscapes , Kins  and Sbd  datasets. Comprehensive ablation studies are conducted to analyze importance of the proposed components in our approach.
is a widely used benchmark for urban scene instance segmentation. It contains training, 500 validation and testing images with high quality annotations. Besides, it has 20k images with coarse annotations. This dataset is challenging due to the crowded scenes and the wide range in object scale. The performance is evaluated in terms of the average precision (AP) metric averaged over 8 semantic classes of the dataset. We report our results on the validation and test sets.
was recently created by additionally annotating Kitti  dataset with instance-level semantic annotation. This dataset is used for amodal instance segmentation, which is a variant of instance segmentation and aims to recover complete instance shapes even under occlusion. Kins consists of training images and testing images. Following its setting, we evaluate our approach on 7 object categories in terms of the AP metric.
re-annotates images from the Pascal Voc  dataset with instance-level boundaries and has the same 20 object categories. The reason that we do not directly perform experiments on Pascal Voc is that its annotations contain holes, which is not suitable for contour-based methods. The Sbd dataset is split into training images and testing images. We report our results in terms of 2010 Voc AP , AP, AP metrics. AP is the average of AP with 9 thresholds from 0.1 to 0.9.
We conduct ablation studies on the Sbd dataset with the consideration that it has 20 semantic categories and could fully evaluate the ability to deform various object contours. The three proposed components are evaluated, including our network architecture, initial contour proposal, and circular convolution. In these experiments, the detector and deep snake are trained end-to-end for 160 epochs with multi-scale data augmentation. The learning rate starts fromand decays with 0.5 at 80 and 120 epochs. Table 1 summarizes the results of ablation studies.
|+ Initial proposal||53.6||61.1||47.6|
|+ Circular convolution||54.4||62.1||48.3|
|Iter. 1||Iter. 2||Iter. 3|
The row “Baseline” lists the result of a direct combination of Curve-gcn  with CenterNet . Specifically, the detector produces object boxes, which gives ellipses around objects. Then ellipses are deformed into object boundaries through Graph-ResNet. Note that, the baseline represents the contour as a graph and uses a graph convolution network for contour deformation.
To validate our designed network, the model in the second row keeps the convolution operator as graph convolution and replaces Graph-ResNet with our proposed architecture, which yields 1.4 AP improvement. The main difference between the two networks is that our architecture appends a global fusion block before the prediction head.
When exploring the influence of the contour initialization, we add the initial contour proposal before the contour deformation. Instead of directly using the ellipse, the proposal step generates an octagon initialization by predicting four object extreme points, which not only resolves the detection errors but also encloses the object more tightly. The comparison between the second and the third row shows a 1.3 improvement in terms of AP.
Finally, the graph convolution is replaced with the circular convolution, which achieves 0.8 AP improvement. To fully validate the importance of circular convolution, we further compare models with different convolution operators and different inference iterations, as shown in table 2. Circular convolution outperforms graph convolution across all inference iterations. And circular convolution with two iterations outperforms graph convolution with three iterations by 0.6 AP. Figure 5 shows qualitative results of graph and circular convolution on Sbd, where circular convolution gives a sharper boundary. The quantitative and qualitative results indicate that models with the circular convolution have a stronger ability to deform contours.
|training data||AP [val]||AP||AP||person||rider||car||truck||bus||train||mcycle||bicycle|
|SGN ||fine + coarse||29.2||25.0||44.9||21.8||20.1||39.4||24.8||33.2||30.8||17.7||12.4|
|Mask R-CNN ||fine||31.5||26.2||49.9||30.5||23.7||46.9||22.8||32.2||18.6||19.1||16.0|
|GMIS ||fine + coarse||-||27.6||49.6||29.3||24.1||42.7||25.4||37.2||32.9||17.6||11.9|
Since fragmented instances are very common in Cityscapes, we adopt the proposed strategy to handle multi-component objects. Our network is trained with multi-scale data augmentation and tested at a single resolution of . No testing tricks are used. The detector is first trained alone for 140 epochs, and the learning rate starts from and drops by half at 80, 120 epochs. Then the detection and snake branches are trained end-to-end for 200 epochs, and the learning rate starts from and drops by half at 80, 120, 150 epochs. We choose a model that performs best on the validation set.
Table 3 compares our results with other state-of-the-art methods on the Cityscapes validation and test sets. All methods are tested without tricks. Using only the fine annotations, our approach achieves state-of-the-art performances on both validation and test sets. We outperform PANet by 0.9 AP on the validation set and 1.3 AP on the test set. According to the approximate timing result in , PANet runs at less than 1.0 fps. In contrast, our model runs at 4.6 fps on a 1080 Ti GPU for images, which is about 5 times faster. Our approach achieves 28.2 AP on the test set when the strategy of handling multi-component objects is not adopted. Visual results are shown in Figure 6.
|detection||amodal seg||inmodal seg|
|Mask R-CNN ||31.1||29.2|
|Mask R-CNN ||31.3||29.3||26.6|
As a dataset for amodal instance segmentation, objects in the Kins dataset are all connected as a single component, so the strategy of handling multi-component objects is not adopted. We train the detector and snake end-to-end for 150 epochs. The learning rate starts from and decays with 0.5 and 0.1 at 80 and 120 epochs, respectively. We perform multi-scale training and test the model at a single resolution of .
Table 4 shows the comparison with [8, 22, 10, 17, 25] on the Kins dataset in terms of the AP metric. Kins  indicates that tackling both amodal and inmodal segmentation simultaneously can improve the performance, as shown in the fourth and the fifth row of Table 4. Our approach learns only the amodal segmentation task and achieves the best performance across all methods. We find that the snake branch can improve the detection performance. When CenterNet is trained alone, it obtains 30.5 AP on detection. When trained with the snake branch, its performance improves by 2.3 AP. For images on the Kins dataset, our approach runs at 7.6 fps on a 1080 Ti GPU. Figure 7 shows some qualitative results on Kins.
Most objects on the Sbd dataset are connected as a single component, so we do not handle fragmented instances. For multi-component objects, our approach detects their components separately instead of detecting the whole object. We train the detection and snake branches end-to-end for 150 epochs with multi-scale data augmentation. The learning rate starts from and drops by half at 80 and 120 epochs. The network is tested at a single scale of .
In Table 5, we compare with other contour-based methods [19, 38] on the Sbd dataset in terms of the Voc AP metrics. [19, 38] predict the object contours by regressing shape vectors. STS  defines the object contour as a radial vector from the object center, and ESE  approximates object contour with and coefficients of Chebyshev polynomial. In contrast, our approach deforms an initial contour to the object boundary. We outperform these methods by a large margin of at least 19.1 AP. Note that, our approach yields 21.4 AP and 36.2 AP improvements, demonstrating that the improvement increases with the IoU threshold. This indicates that our algorithm better outlines object boundaries. For images on the Sbd dataset, our approach runs at 32.3 fps on a 1080 Ti. Some qualitative results are illustrated in Figure 7.
Table 6 compares our approach with other methods [8, 22, 17, 19, 38] in terms of running time on the Pascal Voc dataset. Since the Sbd dataset shares images with Pascal Voc and has the same semantic categories, the running time on the Sbd dataset is technically the same as the one on Pascal Voc. We obtain the running time of other methods on Pascal Voc from .
For images on the Sbd dataset, our algorithm runs at 32.3 fps on a desktop with an Intel i7 3.7GHz and a GTX 1080 Ti GPU, which is efficient for real-time instance segmentation. Specifically, CenterNet takes 18.4 ms, the initial contour proposal takes 3.1 ms, and each iteration of contour deformation takes 3.3 ms. Since our approach outputs the object boundary, no post-processing like upsampling is required. If the strategy of handling fragmented instances is adopted, the detector additionally takes 3.6 ms.
We introduced a new contour-based model for real-time instance segmentation. Inspired by traditional snake algorithms, our approach deforms an initial contour to the object boundary and obtains the object shape. To this end, we proposed a learning-based snake algorithm, named deep snake, which introduces the circular convolution for efficient feature learning on the contour and regresses vertex-wise offsets for the contour deformation. Based on deep snake, we developed a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation. We showed that this pipeline gained a superior performance than direct regression of the coordinates of the object boundary points. We also showed that the circular convolution learns the structural information of the contour more effectively than the graph convolution. To overcome the limitation of the contour that it can only outline one connected component, we proposed to detect the object components within the object box and demonstrated the effectiveness of this strategy on Cityscapes. The proposed model achieved the state-of-the-art results on the Cityscapes, Kins and Sbd datasets with a real-time performance.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
Pvnet: Pixel-wise voting network for 6dof pose estimation.In CVPR, 2019.
Pointnet: Deep learning on point sets for 3d classification and segmentation.In CVPR, 2017.