Deep Variational Instance Segmentation

by   Jialin Yuan, et al.
Oregon State University

Instance Segmentation, which seeks to obtain both class and instance labels for each pixel in the input image, is a challenging task in computer vision. State-of-the-art algorithms often employ two separate stages, the first one generating object proposals and the second one recognizing and refining the boundaries. Further, proposals are usually based on detectors such as faster R-CNN which search for boxes in the entire image exhaustively. In this paper, we propose a novel algorithm that directly utilizes a fully convolutional network (FCN) to predict instance labels. Specifically, we propose a variational relaxation of instance segmentation as minimizing an optimization functional for a piecewise-constant segmentation problem, which can be used to train an FCN end-to-end. It extends the classical Mumford-Shah variational segmentation problem to be able to handle permutation-invariant labels in the ground truth of instance segmentation. Experiments on PASCAL VOC 2012, Semantic Boundaries dataset(SBD), and the MSCOCO 2017 dataset show that the proposed approach efficiently tackle the instance segmentation task. The source code and trained models will be released with the paper.


page 2

page 3

page 14

page 16

page 17

page 18

page 19

page 20


Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations

This paper presents a novel approach for learning instance segmentation ...

A Pyramid CNN for Dense-Leaves Segmentation

Automatic detection and segmentation of overlapping leaves in dense foli...

A Survey on Instance Segmentation: State of the art

Object detection or localization is an incremental step in progression f...

Learning to Cluster for Proposal-Free Instance Segmentation

This work proposed a novel learning objective to train a deep neural net...

Pixel-level Encoding and Depth Layering for Instance-level Semantic Labeling

Recent approaches for instance-aware semantic labeling have augmented co...

LevelSet R-CNN: A Deep Variational Method for Instance Segmentation

Obtaining precise instance segmentation masks is of high importance in m...

Trace-back Along Capsules and Its Application on Semantic Segmentation

In this paper, we propose a capsule-based neural network model to solve ...

Code Repositories


Deep Variational Instance Segmentation

view repo

1 Introduction

Recent years have witnessed rapid development in semantic segmentation Long et al. (2015); Noh et al. (2015); Chen et al. (2016); Jaderberg et al. (2015)

, i.e., classifying pixels into different object categories such as

car or person. However, in order to fully understand a scene, we need to identify different object instances, which may have the same semantic label. This task, called semantic instance segmentation Everingham et al. (2010); Hariharan et al. (2011); Lin et al. (2014), is much more challenging, because (1) different instances may have similar appearances if they belong to the same category; (2) the number of instances are often unknown during prediction; and (3) labels of the instances are permutation-invariant, i.e., randomly permuting instance labels in the training set ground truth should not change the learning outcome (Fig. 1).

For such permutation-invariant instance labels, one cannot directly train the model using conventional objectives such as the cross-entropy loss (CE). One popular strategy is to combine detection and segmentation into a two-stage approach. One network generates object proposals, while another one classifies and refines each proposal Hariharan et al. (2014); Li et al. (2016); Romera-Paredes and Torr (2016); Ren and Zemel (2016); Dai et al. (2016a); He et al. (2017); Liu et al. (2018); Chen et al. (2018); Uhrig et al. (2018). To ensure all instances are segmented, these methods often need to generate a significant amount of proposals ( per image), and many are based on a sliding window approach that is similar to a complete search on a low-resolution image with anchor boxes. These proposals are verified with a classifier and a smaller but still significant amount (

) are sent to the second stage for classification and refinement. To improve the efficiency, alternative approaches that do not explicitly generate object proposals were developed. Most methods learn to predict instance-agnostic feature for each pixel, and then use heuristic post-processing procedures to segment each instance 

Zhang et al. (2015, 2016); Uhrig et al. (2016); Bai and Urtasun (2017); Kirillov et al. (2016); Liu et al. (2017).

(a) Input Image                          (b) GT             (c) Real-valued Predicted Labels

Figure 1: (a): An example from PASCAL VOC Everingham et al. (2010) with 8 bottles. (b) Ground truth. Labels of the bottles can be either 1 to 8 or 8 to 1. (c) Our approach solves a variational relaxation of the problem and predict real-valued labels on the image (best in color)

We note that the goal of instance segmentation is to generate piecewise-constant predictions on each pixel that match with a given ground truth. This resonates with the classic and elegant variational principle introduced to computer vision almost three decades ago. Such variational methods, originated from the Mumford-Shah model Mumford and Shah (1989)

, parse an image into meaningful sub-regions by finding a piecewise smooth approximation. These approaches were traditionally limited to simple problems such as image restoration and active contours, mainly because the difficulties at that time to estimate nonlinear functions from an image. However, they could be inherently appealing in a deep network setting, since these variational objectives work with real-valued inputs and outputs. e.g., the Mumford-Shah functional, that are naturally differentiable.

We believe such variational approaches could be very powerful when combined with deep learning, since they enable us to solve deep learning problems that are difficult for conventional objective functions such as cross-entropy. On the other hand, parametrizing variational approaches with a deep network enables them to model complex functions originating from an image. It also allows them to generalize to testing images. In this paper, we propose

deep variational instance segmentation

(DVIS), which is a fully convolutional neural network (FCN) that directly predicts instance labels – a 2-dimensional piecewise-constant function, with each constant sub-region corresponding to a different instance. A novel variational objective is proposed to accommodate the permutation-invariant nature of the ground truth in instance segmentation, which leads to end-to-end training of the network.

With this proposed approach, we are directly gazing at instances from a top-down FCN viewpoint without the need to generate bounding box proposals using search protocols. Our approach outperforms other one-stage instance segmentation methods on the PASCAL VOC dataset Everingham et al. (2010); Hariharan et al. (2011) and the MS-COCO dataset Lin et al. (2014), especially at more strict metrics that consider only segments with high overlap with ground truth as positive. This makes us believe that it is a potentially interesting framework to pursue. The source code and trained models will be released with the paper.

Figure 2:

The proposed deep variational instance segmentation (DVIS): An FCN is trained to directly output real-valued instance labels, using a novel variational framework we proposed that combines a binary loss function, a permutation-invariant loss function, and regularization terms. During inference, we discretize the predicted instance map into several instances. After classification and verification, we output final segmentation with both semantic and instance labels (best viewed in color)

2 Related Work

Instance segmentation identifies every single instance at pixel-level. Two-stage approaches break the task into two cascaded sub-tasks: the first one generates region proposals, e.g., with a region proposal network (RPN) Ren et al. (2015) and another network segments, scores, and refines each proposal. This two-stage architecture solves the counting problem by adopting non-maximum suppression (NMS) Ren et al. (2015); Redmon and Farhadi (2018); Dai et al. (2016b); Liu et al. (2016); He et al. (2017); Huang et al. (2019) or determinant point process (DPP) Lee et al. (2016); Azadi et al. (2017) to remove overlapping detections. Besides RPN, Uijlings et al. (2013) uses selective search to generate proposals, Pont-Tuset et al. (2017a) uses a network to generate region proposals in the form of a binary mask. However, such a two-stage process is inherently slow, as many different proposals with various sizes and aspect ratios need to be generated and scored, which might be unacceptable in realistic application scenarios where engineers are striving to obtain real-time performance. In most recent work, Liu et al. (2018); Chen et al. (2018); Uhrig et al. (2018) integrate instance-agnostic features into the second stage in the two-stage architecture. The global context information encoded in these features can help refine the final segmentation.

We focus our literature review more on one-stage methods that are directly relevant to our work. Some proposal-free approaches focus on exploring instance-agnostic features and learning them using an FCN. Bai and Urtasun (2017); Romera-Paredes and Torr (2016); Ren and Zemel (2016) predict the energy of watershed transform, Uhrig et al. (2016) predicts the direction on each pixel to the object center, Kirillov et al. (2016) predicts instance-level boundary score, and Liu et al. (2017) attempts to locate instance segment breakpoints to separate each instance. However, these approaches do not directly generate an instance prediction and hence need to resort to a significant amount of heuristic post-processing such as template matching Uhrig et al. (2016), MultiCutKirillov et al. (2016), conditional random fieldArnab and Torr (2017)

or recurrent neural network

Romera-Paredes and Torr (2016); Ren and Zemel (2016).

Kong and Fowlkes (2018); Fathi et al. (2017) are one-stage approaches based on the metric learning idea. Kong and Fowlkes (2018) learns to map pixels to a multi-dimensional embedding space using pairwise associative loss. Fathi et al. (2017) formulates it using metric learning. The network is trained to enforce pixels from the same instance to be close to each other while pixels from different instance to be far away in the learned feature space. These approaches have not employed binary terms as in ours. Hence, in the embedding space generated by these methods, the background (stuff categories such as water, grass etc.) are no different than “yet another instance" and the separation between foreground and background is usually weak, hence these methods require more post-processing and depends on semantic segmentation to distinguish background and foreground, our foreground/background binary term directly suppresses output on the background pixels and outputs a cleaner instance map.

Recently, Bolya et al. (2019a, b) propose a new architecture for one-step instance segmentation and obtained state-of-the-art. They use a network to learn mask prototypes from the input image and combine these prototypes to generate the final mask for each detected instance. But they still search with anchor boxes of different scales and shapes hence generate significantly more proposals than ours.

3 Deep Variational Instance Segmentation

3.1 The Mumford-Shah Model

The Mumford-Shah model is an energy-based model introduced in 1989

Mumford and Shah (1989) for image segmentation. It relaxes the task to a continuous energy minimization problem that computes the optimal piecewise-smooth approximation of a given image. Let denote an observed image on a bounded domain to be segmented. We define an approximation of and , the set of edges delineating the boundaries of different objects. the Mumford-Shah functional is:


where are non-negative parameters, is the set of non-edge pixels, is the number of pixels in . Minimizing the above functional essentially seeks to optimize for a piecewise smooth function (ideally constant inside each segment) which may be non-smooth on the edges/boundaries. The first term drives to be close to . The second term imposes smoothness prior inside each segment and protects from under-segmentation. The last term encourages shorter object contours to avoid over-segmentation. By adjusting the parameters , it can optimally segment the given image.

The Mumford-Shah functional was well-regarded as a solid variational model that has been analyzed aplenty Chan et al. (2006); Grady and Alvino (2008); Pock et al. (2009); Vese and Chan (2002); Xu et al. (2011); Strekalovskiy and Cremers (2014). It appropriately regularizes on the length of object boundaries while capable of modeling multiple objects within the same image. However, because the first term is usually only enforcing the approximation to be close to the input image function, it was traditionally only utilized in superpixel segmentation and active contours Vese and Chan (2002); Morar et al. (2012).

From unsupervised to supervised setting. We note the similarity between the unsupervised Mumford-Shah model and the supervised instance segmentation problem. Both optimize for a piecewise-constant function, where each piece corresponds to one object instance and it is unknown how many pieces are present in the image. Both enforce constancy within each piece and a short boundary length would also be an ideal prior for instance segmentation, albeit to our knowledge we have never previously seen an approach that incorporates that. The second term in the MS-model is a common pairwise term that enforces piecewise-constancy, similar to those used in metric-learning-based instance segmentation methods Fathi et al. (2017); Kong and Fowlkes (2018). Previous work Xu et al. (2011); Strekalovskiy and Cremers (2014) have shown that the second and third terms can be combined as a robust loss on the pairwise term (see Sec. 3.3 for more details).

The main difficulty of extending this variational approach to solve the instance segmentation problem lies in utilizing the matching potential , where a simple MSE or CE loss would not suffice for instance segmentation because of the permutation-invariance of ground truth labels. However, there is one ground truth label remains the same through the whole dataset: the background label. Thus, a new variational formulation is needed. In the next subsection we propose a novel variational formulation that solves the instance segmentation problem.

3.2 Deep Variational Instance Segmentation

As discussed above, we relax the supervised instance segmentation to a continuous energy minimization problem. We first note that the ground truth label in instance segmentation usually has two distinct aspects: 1) when the label of a pixel is , then the pixel is background; 2) when the label of a pixel is larger than , then the label is permutation-invariant, i.e. one can switch labels of different objects (e.g. between object and ) without affecting their actual meaning. Hence, when defining a variational functional for instance segmentation, both of these components need to be considered.

We define a variational functional for instance segmentation as:


where denotes the continuous-valued label map predicted by our network, an FCN with parameters . is the operation rounding to the nearest integer.

compares the instance label with the binarized ground truth label that indicates object/background and

denotes the permutation-invariant loss function which compares the difference between two pixel labels with , which indicates whether the ground truth labels at these pixels are different. Using to compare labels allows us to define a permutation-invariant loss function since the exact values of the ground truth labels no longer play a role in the loss function. The smoothness and minimal edge length terms are the same as in Mumford-Shah. We incorporate an additional quantization term, which drives the output label value to be closer to integers.

Training on this variational functional enables us to learn from a training set with instance-level ground truth and generalize onto unseen testing images. This improves over traditional variational segmentation which does not have learning capabilities. Note that in our permutation-invariant loss , we would in principle integrate over all pixel pairs within the image that are not boundaries, instead of only in a small neighborhood as in traditional conditional random field approaches. This is because instance segmentation is an inherently non-local problem: due to occlusion the same instance can be separated into several pieces in 2D that are possibly very far away from each other, hence, only local consistency is not enough. Empirically we have also found that if we only enforce local consistency, we may have small, smooth changes in the predicted instance labels that could add up to a significant amount and lead to changing instance labels within the same instance.

In practice we discretize on all the pixels, and discretize the integral on sampled pixel pairs. Either stratified sampling or random sampling of pixel pairs can be used. In stratified sampling, we sample all the immediate neighbors in the 4-neighborhood of a pixel, and reduce the sampling density for further away pixel pairs. In random sampling, we randomly select pixel pairs across the whole image for computing the integral on . We have found that on smaller resolutions, stratified sampling is efficient whereas when resolutions are very large, random sampling is more efficient.

Also note that there is a significant difference between variational approaches such as ours and conditional random field (CRF) approaches, although both employ matching (unary) and regularization (pairwise) terms. In CRFs, the labels come from a discrete set, while in variational approaches the labels are relaxed to be continuous themselves. It is difficult for a CNN to simulate the full CRF inference process and one would have to resort to a recurrent network Zheng et al. (2015), increasing the complexity of the model. On the other hand, our variational formulation eq.(2) would only require an FCN to simultaneously handle images with an undetermined amount of objects, since it predicts labels as continuous real-valued numbers.

3.3 Loss Functions

As a variational approach, our output values are continuous. Hence, loss functions would be more similar to regression loss functions. Here we mostly utilize variants of the robust Huber loss function if and otherwise.We set throughout the work.

Binary Loss: Our first seeks to separate a labeled instance from “stuff” classes such as road, water, sky etc. which would not have individual instances in them and are usually labeled as background in instance segmentation tasks. Thus, drives segmentation to be non-positive in background pixels and sufficiently positive in foreground pixels. Let on the background pixels and on the foreground pixels, the absolute loss is computed as:



is the commonly used ReLU activation function,

is a parameter of the loss function to separate foreground from background. With this loss, on foreground pixels, when , the loss will be , this accommodates foreground objects taking different values. On background pixels, once , the loss will be . In experiments, we set .

Permutation Invariant Loss: We use to enforce similarity between ground truth instance labels and predicted instance labels, taking into account that the ground truth labels are permutation-invariant. Let and be two pixels from a neighborhood and their ground truth as , respectively, the relative loss is computed by:


where is a parameter used to adjust the margin between predicted labels from different instances. We set in practice. Hence, there is no loss if the difference between predicted labels on two pixels is more than , if the two pixels belong to different instances. On the other hand, if the two pixels belong to the same instance, the loss is only when their predicted labels are the same.

Regularization: Mumford-Shah regularization is helpful for obtaining sharper boundaries. We have noticed that without such regularization the predicted label map tends to change more smoothly at object boundaries, creating intermediate values that do not belong to any object which make post-processing more difficult. There have been a significant amount of work on optimizing the Mumford-Shah term. We follow Strekalovskiy and Cremers (2014) to discretize Mumford-Shah as a robust loss function:


which is equivalent to the original Mumford-Shah formulation. Strekalovskiy and Cremers (2014) then solves the formulation using a primal-dual algorithm, but in our case we do not need to exactly solve the optimization problem since optimization is anyways never exact with a deep network. Hence we just use a simple quasi-convex robust loss function as in the Cauchy loss:


Note one way to approach proper Mumford-Shah regularization is to anneal the loss gradually towards a Welsch loss function as in Barron (2019), which we did not do because the difference is very minor.

Finally, the quantization term minimizes the distance between the output label and its nearest integer. It helps to create sufficient margin between different label values, making post-processing easier.

In summary, we relax a supervised instance segmentation to a variational minimization problem. With our formulation, the proposed variational problem can be tackled by training an FCN to optimize these loss functions and output the real-valued approximation of instance segmentation labels. And through directly optimizing on instance segmentation, our proposed approach has the advantage to generate different labels to different objects while has the capability of capturing multiple scattered parts, e.g. of an occlude sofa as a single object (Fig.2).

4 Implementation Details

FCN for Instance Segmentation: An encoder-decoder FCN network is adapted to solve instance segmentation with our variational loss. We employ ResNet-50 and ResNet-101

with output stride 8 as our base network and its output is then upsampled by 2 using a decoder network similar to the upsampling branch in FPN

Lin et al. (2017) to generate higher resolution output. The last layer of the FCN network outputs the real-valued label map as one output channel, which is then used to compute our variational loss eq. (2

) and backpropagation. We remove negative label outputs by adding a

ReLU activation on the FCN output. Note we did not employ multiple output heads as in FPN.

Training: We scale the input image to 513513 for PASCAL and with the minimum edge equal to 700 for COCO (preserving the height-to-width ratio). The window size for computing relative loss is set to 128 throughout all experiments, except the ablation study about the parameter itself in supplementary. And we initialize the backbone network with the pre-trained weights for the semantic segmentation task on PASCAL and the pre-trained weights for the object detection task on COCO.

Permutation-Invariant Loss: Given an input image in size and the FCN with a downsampling factor , the output size would be . The number of pixel pairs is a huge number . In our model, with the binary loss to separate background and foreground, it suffices to only consider the pixel pairs locating on instances, which reduces the number of pixel pairs that need to be computed. Then we utilize the stratified sampling to sample pairs to compute the permutation-invariant loss. Given a pixel and the window size , we sampled all pixels inside the center area with distance and we select the rest pixels with a dilation rate of ’r’, similar to dilated convolutions Chen et al. (2016). The base setting we use is .

Discretization to instance segmentation: After we obtain the real-valued instance labels, we apply the mean-shift segmentation algorithm on it with different bandwidthes, and to discretize it to two different label maps. Because is fixed to , bandwidth of works well to separate objects the network believe is different. And when the network does not learn to separate the instances well enough, bandwidth helps to segment the objects. these two bandwidth proves to be enough to generate all instance segments, which are then verified in the next stage.

Classification and Verification: We utilize a classification network to verify the segments. It first takes CNN features from the bounding box of each predicted instance from the FCN with ROIAlign He et al. (2017), and concatenate it with the predicted binary mask for the instance. We then run a small convolutional network with layers that will classify each predicted instance into the pre-defined semantic categories. Besides, we have an IoU head Huang et al. (2019) that attempts to predict the Intersection-Over-Union between the predicted instance with the ground truth instance that best matches it, using a Huber regression loss. Finally, we reject false positive instances by thresholding on both the maximal weighted sum of predicted confidence on the semantic classification and the predicted IoU. Note that we are only verifying on average segments per image, which is significantly less than previous approaches (Table 6), hence the overhead of this stage is very small (Table 6). Hence, we believe this classification step does not change the fact that our method is one-stage. After all, all one-stage methods have post-processing steps which sometimes taking longer time than ours.

5 Experiments

We evaluate the proposed approach for instance segmentation on the challenging PASCAL VOC datasetEveringham et al. (2010) on the val split and the SBD splitHariharan et al. (2011), as well as the COCO datasetLin et al. (2014).

5.1 Datasets

PASCAL VOC 2012 consists of 20 object classes and one background class. It has been the benchmark challenge for segmentation over the years. The original dataset contains 1,464, 1,449, and 1,456 images for training, validation, and testing. It is augmented by extra annotation from Hariharan et al. (2011), resulting in 10,582 training images. The metric we use to evaluate on PASCAL is average precision (AP) with pixel intersection-over-union (IoU) thresholds at 0.5, 0.6, 0.7, 0.8 and 0.9 averaged across the 20 object classes. As there is no ground truth on the testing set, we use the val set to test.

PASCAL SBD is a different split on the PASCAL VOC dataset. In order to compare with Li et al. (2016); Bolya et al. (2019a), we train a separate model on SBD’s training set and evaluate on its 5,732 validation images.

COCO is a very challenging dataset for instance segmentation and object detection. It has 115,000 images and 5,000 images for training and validation, respectively. 20,000 images are used as test-dev from the split of 2017. There are 80 instance classes for instance segmentation and object detection challenge. There are more objects in each image than PASCAL VOC. We train our model on the train 2017 subset and run prediction on val 2017 and test-dev 2017 subsets respectively. We adopt the public cocoapi to report the performance metrics , , , , , and .

5.2 Comparison to the state-of-the-art

Results on PASCAL VOC and SBD are shown in Table 1 and Table 2 respectively. Our approach significantly outperforms one-stage instance segmentation algorithms SGN, DIN, and Embedding Liu et al. (2017); Arnab and Torr (2017); Kong and Fowlkes (2018) on all mAP thresholds. The latter two are state-of-the-art metric learning approaches. Besides, on the SBD dataset we also outperformed a well-regarded proposal-based approach FCIS Li et al. (2016) significantly (Table 2). The very recent YOLACT Bolya et al. (2019a) achieved slightly better results than ours on mAP at IoU, however our approach is significantly better than it at IoU, which require more precise segmentation of each object. We note that IoU is a quite low standard for segmentation since there can still be significant amount of segmentation errors at this threshold. Our better performance at a higher threshold shows that our variational approach is capable of segmenting objects more precisely, especially on objects of non-rectangular shapes. Some one-stage approaches such as DWT takes each connected component as an instance, hence they do not work well for many PASCAL VOC objects which are separated into several parts with occlusions. We significantly outperformed SGN which is known to be superior than DWT.

Method backbone architecture
0.5 0.6 0.7 0.8 0.9
DINArnab and Torr (2017) PSPNet(Resnet-101) two-stage 61.7 55.5 48.6 39.5 25.1 57.5
SGNLiu et al. (2017) PSPNet(Resnet-101) one-stage 61.4 55.9 49.9 42.1 26.9 47.2
EmbeddingKong and Fowlkes (2018) DeepLab-v3 one-stage 64.5 - - - - -
DVIS DeepLab-v3 one-stage 70.3 68.0 60.2 50.6 33.7 56.6
Table 1: result on the PASCAL VOC 2012 val. set.
Method backbone architecture
0.5 0.6 0.7 0.8 0.9
DIN Arnab and Torr (2017) PSPNet(Resnet-101) two-stage 62.0 - 44.8 - - 55.4
FCISLi et al. (2016) Resnet-101-C5 two-stage 65.7 - 52.1 - - -
YOLACTBolya et al. (2019a) Resnet-50-FPN one-stage 72.3 - 56.2 - - -
DVIS DeepLab-v3 one-stage 70.5 68.5 62.9 55.2 34.5 58.3
Table 2: result on the PASCAL SBD val. set.

Results on COCO are shown in Table 3 and Table 4. One can see that with a one-stage algorithm, we obtain performances very close to the two-stage mask R-CNN, trailing mainly on small objects. We outperform the state-of-the-art one-stage method YOLACT on AP with multiple settings on both the val-2017 and test-dev 2017 datasets. YOLACT-700 results are only available on test-dev hence we compare with YOLACT-550 on val. The authors have a more recent improvement, YOLACT++ where they used deformable convolutions which is orthogonal to our contributions, and could be applied in our case to further improve performance. Qualitative results and more comparisons will be shown in the supplementary material.

Method backbone architecture AP
PANetLiu et al. (2018) Resnet-101-FPN two-stage 37.6 59.1 40.6 20.3 41.3 53.8
Mask R-CNNChen et al. (2019) Resnet-101-FPN two-stage 36.5 58.1 39.1 18.4 40.2 50.4
YOLACT-550Bolya et al. (2019a) Resnet-50-FPN one-stage 30.0 - - - - -
DVIS-700 Resnet-50-FCN one-stage 32.6 53.4 35.0 13.1 34.8 48.1
DVIS-700 Resnet-101-FCN one-stage 35.3 57.3 37.2 14.6 38.2 50.7
Table 3: result on COCO’s val 2017 set
Method backbone architecture AP
PANetLiu et al. (2018) Resnet-50-FPN two-stage 36.6 58.0 39.3 16.3 38.1 53.1
FCISLi et al. (2016) Resnet-101-C5 two-stage 29.5 51.5 30.2 8.0 31.0 49.7
Mask R-CNNHe et al. (2017) Resnet-101-FPN two-stage 35.7 58.0 37.8 15.5 38.1 52.4
YOLACT-700Bolya et al. (2019a) Resnet-101-FPN one-stage 31.2 50.6 32.8 12.1 33.3 47.1
DVIS-700 Resnet-50-FCN one-stage 30.3 48.6 33.0 11.0 33.2 46.1
DVIS-700 Resnet-101-FCN one-stage 32.7 52.2 34.5 12.3 36.4 48.2
Table 4: result on COCO’s test-dev 2017 set

5.3 Ablation study

Inference cost: We report the total number of float point operations (FLOPs) needed to compute instance segmentation with our approach compared with the state-of-the-art on the COCO val2017 set. In Table 6, it shows that our model requires significantly less computation than YOLACTBolya et al. (2019a), the state-of-the-art in inference speed, due to the fact that we have much less segments to work on (see also next paragraph and Table 6). We also present breakdowns of DVIS timings, where it can be seen that the majority of our computation is within the FCN network itself. Besides the network, the mean shift grouping and the classification module together only require about extra in terms of FLOPs.

Method backbone image size
550 700
YOLACTBolya et al. (2019a) Resnet-50-FPN 61.59 G 98.89 G
YOLACTBolya et al. (2019a) Resnet-101-FPN 86.05 G 137.70 G
DVIS Resnet-50-FCN 38.49 G 60.94 G
DVIS Resnet-101-FCN 66.24 G 106.35 G
Breakdown for Postprocessing time on DVIS (ResNet-101)
Mean Shift Grouping - 94.79 M 124.42 M
Classification Module Resnet-101-FCN 1.54 G 2.44 G
Table 6: Number of candidates inputted to post-processing
Method No.
FCISLi et al. (2016) 2,000
PANetLiu et al. (2018) 1,000
Mask R-CNNHe et al. (2017) 1,000
YOLACTBolya et al. (2019a) 200
DVIS@ COCO 14.83
Table 5: Number of FLOPs on the COCO val 2017 set

Number of Candidates in Post-Processing: We compare the average number of candidates from our discretization process with previous one or two-stage instance segmentation algorithms in Table 6. All the 2-stage algorithms Li et al. (2016); Liu et al. (2018); Bolya et al. (2019a) send over 1,000 proposals to their second stage. YOLACT Bolya et al. (2019a) selects top-200 proposals for post-processing. Meanwhile, we only average about segments per image that are sent to the classification module, further illustrating that our one-stage FCN network has already precisely located the instances, thanks to the variational framework.

6 Conclusion

In this paper we proposed deep variational instance segmentation (DVIS), which relaxes instance segmentation into a variational problem with a novel variational objective that includes a permutation-invariant component. Such a variational objective leads to an end-to-end training framework with an FCN directly predicting continuous instance labels on the image. During inference time, we discretize the predicted continuous labels and utilize a small CNN to categorize them into semantic categories, as well as reject false positives. Experiments have shown that the proposed approach improves over state-of-the-arts in one-stage instance segmentation, especially on higher overlap thresholds. Such performance shows that our model is effective in capturing the global shape information in objects and segmenting object with higher precision. In the future, we will further explore variants of the top-down instance segmentation paradigm from the proposed approach especially on small objects.

Broader Impact Statement

Instance segmentation is an important part for object recognition and is expected to be deployed in many real-life computer vision applications. Our algorithm significantly reduces the amount of computation required to obtain good performance in instance segmentation, hence would significantly lower the total carbon footprint for deployments of instance segmentation algorithms. We did not create additional social and ethical concerns of instance segmentation algorithms. However, there are inherent concerns about object detection algorithms including instance segmentation to be misused in a system to recover personal identities without individual consent. This is beyond the scope of the paper since we are only concerned with broad object categories (person, trees, cars, bus, etc.) rather than individual identities of the objects. Our labels are permutation-invariant, i.e. they could assign an arbitrary real-valued number to any instance it predicts. Due to this randomness they do not reveal individual identities per se. A possible drawback is that one could input instance segmentation results to another algorithm to identify personal identities, however that is beyond the scope of this paper.


  • A. Arnab and P. H. Torr (2017) Pixelwise instance segmentation with a dynamically instantiated network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 441–450. Cited by: §2, §5.2, Table 1, Table 2.
  • S. Azadi, J. Feng, and T. Darrell (2017) Learning detection with diverse proposals. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 7369–7377. Cited by: §2.
  • M. Bai and R. Urtasun (2017) Deep watershed transform for instance segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2858–2866. Cited by: §1, §2.
  • J. T. Barron (2019) A general and adaptive robust loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4331–4339. Cited by: §3.3.
  • D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee (2019a) YOLACT: real-time instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9157–9166. Cited by: §1, §2, §5.1, §5.2, §5.3, §5.3, Table 2, Table 3, Table 4, Table 6.
  • D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee (2019b) YOLACT++: better real-time instance segmentation. arXiv preprint arXiv:1912.06218. Cited by: §2.
  • T. F. Chan, S. Esedoglu, and M. Nikolova (2006) Algorithms for finding global minimizers of image segmentation and denoising models. SIAM journal on applied mathematics 66 (5), pp. 1632–1648. Cited by: §3.1.
  • K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, et al. (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: Table 3.
  • L. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, and H. Adam (2018) Masklab: instance segmentation by refining object detection with semantic and direction features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4022. Cited by: §1, §2.
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2016) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915. Cited by: §1, §4.
  • G. Csurka, D. Larlus, F. Perronnin, and F. Meylan (2013) What is a good evaluation measure for semantic segmentation?.. In BMVC, Vol. 27, pp. 2013. Cited by: §3.
  • J. Dai, K. He, and J. Sun (2016a) Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3150–3158. Cited by: §1.
  • J. Dai, Y. Li, K. He, and J. Sun (2016b) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §2.
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: Figure 1, §1, §1, §5.
  • A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy (2017) Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277. Cited by: §2, §3.1.
  • L. Grady and C. Alvino (2008) Reformulating and optimizing the mumford-shah functional on a graph—a faster, lower energy solution. In European Conference on Computer Vision, pp. 248–261. Cited by: §3.1.
  • B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pp. 991–998. Cited by: §1, §1, §5.1, §5.
  • B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik (2014) Simultaneous detection and segmentation. In European Conference on Computer Vision, pp. 297–312. Cited by: §1.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. arXiv preprint arXiv:1703.06870. Cited by: §1, §1, §2, §4, Table 4, Table 6.
  • Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang (2019) Mask scoring r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6409–6418. Cited by: §2, §4.
  • M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems, pp. 2017–2025. Cited by: §1.
  • A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother (2016) Instancecut: from edges to instances with multicut. arXiv preprint arXiv:1611.08272. Cited by: §1, §2.
  • S. Kong and C. C. Fowlkes (2018) Recurrent pixel embedding for instance grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9018–9028. Cited by: §2, §3.1, §5.2, Table 1.
  • D. Lee, G. Cha, M. Yang, and S. Oh (2016) Individualness and determinantal point processes for pedestrian detection. In European Conference on Computer Vision, pp. 330–346. Cited by: §2.
  • Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei (2016) Fully convolutional instance-aware semantic segmentation. arXiv preprint arXiv:1611.07709. Cited by: §1, §1, §5.1, §5.2, §5.3, Table 2, Table 4, Table 6.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §4.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §1, §5.
  • S. Liu, J. Jia, S. Fidler, and R. Urtasun (2017) Sgn: sequential grouping networks for instance segmentation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §5.2, Table 1.
  • S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768. Cited by: §1, §1, §2, §5.3, Table 3, Table 4, Table 6.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §1.
  • A. Morar, F. Moldoveanu, and E. Gröller (2012) Image segmentation based on active contours without edges. In 2012 IEEE 8th International Conference on Intelligent Computer Communication and Processing, pp. 213–220. Cited by: §3.1.
  • D. Mumford and J. Shah (1989) Optimal approximations by piecewise smooth functions and associated variational problems. Communications on pure and applied mathematics 42 (5), pp. 577–685. Cited by: §1, §3.1.
  • H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. In International Conference on Computer Vision (ICCV), Cited by: §1.
  • T. Pock, D. Cremers, H. Bischof, and A. Chambolle (2009) An algorithm for minimizing the mumford-shah functional. In Computer Vision, 2009 IEEE 12th International Conference on, pp. 1133–1140. Cited by: §3.1.
  • J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik (2017a) Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE transactions on pattern analysis and machine intelligence 39 (1), pp. 128–140. Cited by: §2.
  • J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool (2017b) The 2017 davis challenge on video object segmentation. arXiv:1704.00675. Cited by: Figure 6, §5.
  • J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §2.
  • M. Ren and R. S. Zemel (2016) End-to-end instance segmentation and counting with recurrent attention. arXiv preprint arXiv:1605.09410. Cited by: §1, §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.
  • B. Romera-Paredes and P. H. S. Torr (2016) Recurrent instance segmentation. In European Conference on Computer Vision, pp. 312–329. Cited by: §1, §2.
  • E. Strekalovskiy and D. Cremers (2014) Real-time minimization of the piecewise smooth mumford-shah functional. In European conference on computer vision, pp. 127–141. Cited by: §3.1, §3.1, §3.3.
  • J. Uhrig, M. Cordts, U. Franke, and T. Brox (2016) Pixel-level encoding and depth layering for instance-level semantic labeling. In German Conference on Pattern Recognition, pp. 14–25. Cited by: §1, §2.
  • J. Uhrig, E. Rehder, B. Fröhlich, U. Franke, and T. Brox (2018) Box2pix: single-shot instance segmentation by assigning pixels to object boxes. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 292–299. Cited by: §1, §2.
  • J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders (2013) Selective search for object recognition. International journal of computer vision 104 (2), pp. 154–171. Cited by: §2.
  • L. A. Vese and T. F. Chan (2002) A multiphase level set framework for image segmentation using the mumford and shah model. International journal of computer vision 50 (3), pp. 271–293. Cited by: §3.1.
  • L. Xu, C. Lu, Y. Xu, and J. Jia (2011) Image smoothing via l 0 gradient minimization. In Proceedings of the 2011 SIGGRAPH Asia Conference, pp. 1–12. Cited by: §3.1, §3.1.
  • Z. Zhang, S. Fidler, and R. Urtasun (2016) Instance-level segmentation for autonomous driving with deep densely connected mrfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 669–677. Cited by: §1.
  • Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun (2015) Monocular object instance segmentation and depth ordering with cnns. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2614–2622. Cited by: §1.
  • S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr (2015) Conditional random fields as recurrent neural networks. In Proceedings of the IEEE international conference on computer vision, pp. 1529–1537. Cited by: §3.2.

1 How many labels can DVISpredict?

In the paper section 5.3, we give the average amount of candidates in post-processing and it is much smaller than RPNRen et al. (2015) based methodsBolya et al. (2019a); Li et al. (2016); He et al. (2017); Liu et al. (2018). Then an interesting question raised which is how many distinct objects can our framework predict. With multiple objects in the scene, the network has to be able to “see" all the objects, in order to assign them different values. Fig. 3 shows the number of candidate segments inputted to post-processing on the PASCAL VOC and MS-COCO dataset, which showed that our number of candidates are usually slightly higher than the number of objects. This showed that DVIS could both detect enough objects for each image, and also did not generate an overabundance of candidate segments.

Figure 3: Number of Objects DVIS predicted vs. number of objects in the image on Pascal VOC(the left column) and COCO (the right column). The figures are (from top to bottom): histogram of the number of ground truth objects in the dataset and the number of discretized instances over the number of GT objects. Note that by using 2 set of thresholds we are capable of detecting more objects than the maximal prediction value. And the number of candidate segments is only slightly more than the number of objects in the images

2 Window size for computing relative loss

We show an ablation study to verify that it is indeed necessary in the permutation-invariant loss to compare pixel labels with a large spatial displacement. The ablation study is done on the PASCAL VOC dataset. We compared results where we limit the permutation-invariant loss to pixel pairs that are close-by, with ranges of , , , , and pixels tested respectively. Table 7 shows that a large window size significantly improves our performance.

0.5 0.6 0.7 0.8 0.9
range 8 63.98 57.74 50.54 36.48 14.23 44.59
range 16 63.38 57.55 49.72 37.49 14.09 44.45
range 32 65.4 59.7 51.4 39.8 15.7 46.4
range 64 68.21 62.82 56.73 49.34 33.5 54.1
range 128 70.3 68.0 60.2 50.6 33.7 56.6
Table 7: result on PASCAL VOC val. set for different window size taken for the permutation-invariant loss

3 Regularization and Quantization

Since Mumford-Shah regularization term and the quantization term mostly work on improving the boundaries, their impact on the interior of the object is relatively small. Unfortunately, the commonly used IoU metric is almost exclusively focused on the interior and ignores small differences on the boundaries. Hence to illustrate the use of the MS-regularization, we compute the F1-measure, a semantic contour-based score from Csurka et al. (2013), to depict the effect of the Mumford-Shah regularization.

Where indicates the - object in image with class . is the distance error tolerance. The is the Iversons bracket notation. is the number of objects with class in image . is the total number of supported categories. is the number of images. From Table 8, the model trained with is 2% better than the model w/o at distance error tolerance, which shows it improves significantly performance near the boundary. The model trained with adding quantization has equivalent performance with the model without it and it has higher score with larger distance error tolerance, since this term can increase margin between different instances and the detected instances are better shaped. Fig.4 shows some visual examples, the predicted instance map is more smooth, both inside the instances and on the background. Besides, instance boundaries are sharper with . And different instances are better separated from each other by adding quantization.

1 5 10
w/o 21.6 59.1 69.6
w/ 23.5 59.6 69.9
w/ quantization and 23.3 60.2 71.7
Table 8: semantic contour F1-score on PASCAL VOC val.

RGB image               without      with      with quantization and

Figure 4: This figure shows the predicted instance map from model trained w/o or w/ the Mumford-Shah regularization, where the previous one is smoother inside the instances and the background and there is less noise along instances’ boundaries

4 Influence of the IoU head

We run an ablation study to identify how the classification confidence and the predicted IoU affect the results. The weighted sum is computed as with . Fig.5 shows that it achieves better mAP at IoU as increases, which means the predicted IoU can detect more objects in higher quality.

Figure 5: Ablation study on how the IoU score affect the instance segmentation on PASCAL VOC val.

5 Predict instance map on unseen categories

Because our DVIS method learn to segment instances directly from instance-level ground truth, it can recognize ’objectness’ for unseen categories by relating them to seen ones. We test it with running the model trained on PASCAL VOC train set on images containing unseen categories from the DAVIS challenge Pont-Tuset et al. (2017b). Examples are shown in Fig.6, which shows DVIS can recognize ’objectness’ and segment the instances.

RGB image                          GT                    Predicted instance map

Figure 6: Predicted instance map on unseen categories from DAVIS challenge Pont-Tuset et al. (2017b).

6 Qualitative Results on PASCAL VOC

We show some more qualitative results on the PASCAL VOC dataset in Fig.7.

7 Qualitative Results on COCO

We show some more qualitative results on the MS-COCO dataset in Fig.8 and Fig. 9. We also show some failure cases in Fig.10. In those failure cases, our method fails to predict a good instance map when the scene become too crowded.

Note that part of the reason the algorithm is failing on those crowded scenes may be because of the way COCO is labeled. As can be seen in 10, among all the persons in the scene, only some are labeled as persons while some are not. We hypothesize this confuses our algorithm more than the anchor-based algorithms, since our permutation-invariant loss looks globally at all pixel pairs, whereas anchor box based methods only analyzes locally within each box. It would be interesting if we run the algorithm on a dataset where instances are more consistently labeled.

Figure 7: Examples from Pascal VOC 2012 val subset. From left to right: Image, Ground Truth, Predicted Instance Map, Final Instance Segmentation from DVIS(best viewed in color)

RGB image               GT             Predicted instance map          Final seg.

Figure 8: This figure shows qualitative results on COCO val2017 set, part(1)

RGB image               GT             Predicted instance map          Final seg.

Figure 9: This figure shows qualitative results on COCO val2017 set, part (2)

RGB image               GT             Predicted instance map          Final seg.

Figure 10: Examples of inaccurate predicted instance maps with crowded objects on the COCO val2017 set