Learning Panoptic Segmentation from Instance Contours

10/16/2020 ∙ by Sumanth Chennupati, et al. ∙ University of Maryland University of Michigan Valeo 0

Panoptic Segmentation aims to provide an understanding of background (stuff) and instances of objects (things) at a pixel level. It combines the separate tasks of semantic segmentation (pixel-level classification) and instance segmentation to build a single unified scene understanding task. Typically, panoptic segmentation is derived by combining semantic and instance segmentation tasks that are learned separately or jointly (multi-task networks). In general, instance segmentation networks are built by adding a foreground mask estimation layer on top of object detectors or using instance clustering methods that assign a pixel to an instance center. In this work, we present a fully convolution neural network that learns instance segmentation from semantic segmentation and instance contours (boundaries of things). Instance contours along with semantic segmentation yield a boundary-aware semantic segmentation of things. Connected component labeling on these results produces instance segmentation. We merge semantic and instance segmentation results to output panoptic segmentation. We evaluate our proposed method on the CityScapes dataset to demonstrate qualitative and quantitative performances along with several ablation studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Panoptic segmentation [1, 2] offers ultimate understanding of a scene by providing joint semantic and instance level predictions of background and objects at a pixel level. Panoptic segmentation is usually achieved by combining outputs from semantic segmentation and instance segmentation. Examples where panoptic segmentation offers unprecedented advantage over standalone semantic or instance segmentation solutions include collective knowledge of distinct objects and drivable area around a self-driving car [3, 4], semantic and instance level details of cancerous cells in digital pathology [5], understanding of background and different individuals in a frame to enhance smartphone photography. Multi-task learning networks [6, 7, 8, 9] that jointly perform semantic and instance segmentation [1, 3] accelerated progress of panoptic segmentation in terms of accuracy and computational efficiency compared to traditional methods that use naive fusion of predictions from independent semantic and instance segmentation networks to derive panoptic segmentation output [2].

Fig. 1: (a) Semantic segmentation, (b) Instance contour segmentation, (c) Instance center regression and (d) Instance segmentation

Instance segmentation is typically achieved in two major ways, 1) Foreground mask estimation of objects detected by an object detection model [1, 10, 11, 12] or 2) Clustering based instance assignment methods [13, 14]. Recently, single stage instance segmentation methods have been developed [15, 16]. These major approaches use fully convolution networks so that they can be trained in an end-to-end fashion.

Semantic segmentation is a mature task which is well explored in literature relatively to panoptic segmentation. We make an observation that panoptic segmentation can be obtained from semantic segmentation by additionally estimating instance separating contours. Naively, the instance separating contours can be an additional class in the segmentation task. In practice, it is difficult to get good performance for this class. It is illustrated in Figure 1 where segmentation (a) and instance contour segmentation (b) contains all the information to obtain panoptic segmentation. The minimal contours needed are contours which separate two instances of the same object. However, these contours don’t have sufficient information to be learnt on its own and thus we use the entire instance contours.

In this work, we present a multi-task learning network as shown in Figure 2 that learns semantic segmentation, instance contours and center regression results. Our instance contours along with semantic segmentation guide us to derive instance segmentation and eventually panoptic segmentation. Our instance contour segmentation network is a binary segmentation network that predicts instance boundaries between objects that belong to a same category. Compared to semantic edge detection networks [17, 18] our instance contour estimation doesn’t ignore boundaries between instances of a same category. We refine low quality instances from our instance segmentation output using center regression results. We split large instances or merge small instances, using 2-d offsets to an instance center predicted by instance center regression at a pixel level.

Fig. 2: We present a network that learns panoptic segmentation from semantic segmentation and instance contours (boundaries of things). We use a shared convolution neural network to predict semantic segmentation, instance contours and center regression. Instance contours along with semantic segmentation yield a boundary-aware semantic segmentation of things. Connected component labeling on these results produce instance segmentation and eventually panoptic segmentation.

We hope that our idea encourages a new direction in the research of panoptic segmentation which ultimately leads to learning of instance separating contours within the segmentation task. The main contributions of this paper include:

  1. A novel method to learn panoptic segmentation and instance segmentation from semantic segmentation and instance contours.

  2. An instance contour segmentation network that learns boundaries between objects of same semantic category.

Ii Related Work

Scene understanding [19] has witnessed tremendous progress over the past decade with introduction of Convolution Neural Networks [20, 21, 22] that aided in development of semantic segmentation (pixel wise classification) and instance segmentation (pixel level recognition of distinct objects). Panoptic segmentation [2], a joint semantic and instance segmentation has provided complete scene understanding by categorizing a pixel into distinct categories and instances. On the other hand, semantic edge detection [17] has been widely used to learn boundaries between semantic classes.

Ii-a Semantic Segmentation

Few years ago semantic segmentation [23] was considered a challenging problem. With the help of Fully convolutional neural networks (FCNs) [24] development of accurate and efficient solutions were made possible.

Several enhancements were made to push the performance of semantic segmentation higher by making improvements to encoder and decoder in FCNs. Dilated residual convolutions [25], Feature pyramid networks [1, 26], Spatial pyramid pooling [27] etc. are examples of improvements made to encoder while U-Net [28], Densely connected CRFs [25, 29] are examples of improvements made to decoder. We use a combination of feature pyramid networks and a light-weight asymmetric decoder presented by Kirillov et al. [1] to learn semantic segmentation.

Ii-B Instance Segmentation

In instance segmentation, an object instance(id) is assigned to every pixel for every known object within an image. Two stage methods like MaskR-CNN [12] involve proposal generation from object detection followed by mask generation using a foreground/background binary segmentation network. These methods dominate the state of the art in instance segmentation but incur a relatively higher computational cost. Using YOLO [30], SSD [31] and other light wight object detector compared to Faster R-CNN [32] may seem promising but they still posses inevitable additional compute in generating object proposals followed by mask generation.

Other approaches in instance segmentation range from clustering of instance embedding [33] to prediction of instance centers using offset regression [13, 14]. These methods appear logically straight forward but are lagging behind in terms of accuracy and computational efficiency. The major drawback with these methods is usage of compute intensive clustering methods like OPTICS [34], DBSCAN [35] etc. In contrast to these methods, we derive instance segmentation from semantic segmentation using instance contours (boundaries of things).

Fig. 3: Proposed model architecture with CNN backbone. Multi-scale features from the backbone are fed to a feature pyramid network and then to upsampling neck followed by a prediction head. Our network has three heads for semantic segmentation, instance contour segmentation and center regression tasks. Separate necks can be used for different heads/tasks as needed.

Ii-C Semantic Edge Detection

Semantic edge detection (SED) [17, 36] differs from edge detection [37] by predicting edges that belong to semantic class boundaries. In SED, edges/boundaries that separate segments of one category from another are predicted whereas, in edge detection every edge is detected based on image gradients. Holistically-nested edge detection (HED) [38] is one of the first CNN based edge detection method. Later, several methods were proposed to address different challenges in edge detection that include prediction of crisp boundaries [18, 39], selection of intermediate feature maps and choices of supervision on these feature maps [40, 41]. It is important to note that these methods ignore the boundaries between instances of objects that belong to same semantic category.

Deep Snake [42] recently proposed to predict instance contours by learning contours from object detection. They replace foreground mask estimation for objects with contours to derive instance segmentation. Our instance contour segmentation however is a single stage method that directly estimates contours using a binary segmentation network.

Ii-D Panoptic Segmentation

Panoptic segmentation [2] combines semantic segmentation and instance segmentation together to provide class category and instance id for every pixel with in an image. Recent works [1, 14, 3] use a shared backbone and predict panoptic segmentation by fusing output from semantic and instance segmentation branches. Almost every work so far uses an FCN based semantic segmentation branch with variations including usage of dilated convolutions [14] or feature pyramid networks [1]. However, choices of instance segmentation branch can vary as discussed in Section II-B.

Major challenge in generating panoptic segmentation output is merging conflicting outputs from semantic segmentation and instance branches. For example, semantic segmentation can predict that a pixel might belong to car class while instance segmentation branch may predict the same pixel as person class. Several methods [3] were proposed to handle the conflicts in a better and learned fashion. Our methods propose to derive instance segmentation from semantic segmentation using instance contours. Therefore, our method doesn’t require a conflict resolution policy like other existing methods.

Iii Proposed Method

Our proposed method is a multi-task neural network with several shared convolution layers and multiple output heads that predict semantic segmentation, instance contours and center regression. As shown in Figure 3, a common ResNet [22] backbone outputs multi-scale feature maps that are processed by a top-down feature pyramid network [26]. These feature maps from different levels are upsampled to a common scale through a series of 1x1 convolutions and combined before making output predictions. We refer the upsampling stages as necks and prediction layers as heads.

Outputs from instance contour and semantic segmentation branches are combined to generate instance segmentation. We refine instance segmentation output using center regression results. Later, we simply merge semantic and instance segmentation outputs to generate panoptic segmentation.

Iii-a Model Architecture

We begin with introducing our shared backbone that outputs multi-scale feature maps as shown in Figure 3. Our backbone uses ResNet [22] as the encoder that outputs multiple scales of feature maps {1/4, 1/8, 1/16, 1/32} w.r.t to input image. Our pyramid is built using Feature pyramid network (FPN) [26]

which consumes feature maps (scales 1/4 to 1/32) from backbone in a top-down fashion and outputs feature maps with 256 channels maintaining their input scale. Feature maps from pyramid are then passed through a series of 1x1 convolutions and are up sampled to 1/4 scale using 2-d bi-linear interpolation in the neck layers as proposed in

[1]. These layers have 128 dims at each level. We add these feature maps from different levels and pass to prediction heads. Our semantic segmentation head contains 1

1 convolution layer with ’k’ filters (to output ’k’ output maps for ’k’ classes) followed by a 4x upsampling. We perform softmax activation followed by an argmax function on the ’k’ output maps to derive full resolution semantic segmentation output. Our instance contour estimation head is similar to semantic segmentation head except it has 1 output feature map and a sigmoid activation instead of softmax. Our center regression head has two output channels that predict offsets from instance center in x and y axis, and does not have any special activation function.

Iii-B Loss functions

We discuss the explicit loss functions defined for semantic segmentation and instance contour branches. We chose cross-entropy loss for semantic segmentation. In Equation

1, is segmentation loss over classes for all pixels in the image, where and

are the prediction probability and the ground truth per pixel for class

.

(1)

For instance contours, we chose weighted multi-label binary cross entropy loss [17] as shown in Equation 2, where is the ratio of non-edge pixels to total pixels in the image.

(2)

We add Huber loss ( = 0.3):

(3)

and NMS Loss [18] terms to contour loss to predict thin and crisp boundaries.

(4)
Fig. 4: Illustrative flow diagram of proposed algorithm that learns panoptic segmentation from semantic segmentation and instance contours

We compute softmax response along the normal direction of boundary pixels as described in [18]. For center regression, we use Huber loss to compute error between , predicted offsets and , ground truth offsets with = 1. Our total loss function is a weight combination of semantic loss, contour losses and center regression loss.

(5)

where is defined as:

(6)

We chose , , and as 1, 50 and 0.1 for our experiments.

Iii-C Instance segmentation

Our instance segmentation is derived from semantic segmentation unlike any other instance segmentation methods as shown in Figure 4. As a first step, we generate a binary mask by searching for instance classes in semantic segmentation which we refer to as instance class mask. We subtract instance contours (generated from instance contour segmentation head) from instance class mask to derive boundary aware instance class mask. Using connected component labeling [43], we derive unique instances from boundary aware instance class mask. We map the semantic segmentation output to the instance generated. We assign the most frequent label found inside an instance as its category and average the softmax predictions over the area of an instance to generate confidence for an instance.

Iii-D Refining Instance Segmentation

We refine instance segmentation output using center regression results. Our refinement consists of mainly 2 stages: Split and Merge. We estimate centroids predicted by center regression head. We cluster the centroid predictions using DBSCAN in an instance and split them if distinct centroids are found. If distance between two centroids is at least 20 pixels (eps), we declare them as distinct. Our clustering stage doesn’t require large computational complexity like other methods [33, 13, 14] since we perform clustering within instances that are much smaller compared to performing clustering on entire image.

After the instances are split, we estimate mean centroids for every instance using offsets predicted by center regression head. If the mean centroids are closer than 20 pixels in euclidean distance, we merge those instances. Later, we remove all instances that have an area lower than a minimum area threshold. We assign these pixels to instances whose centroids are closest to the centroids derived from offsets predicted by the center regression head.

Iii-E Panoptic Segmentation

Panoptic segmentation is now obtained by simply merging output from semantic segmentation and instance segmentation. As discussed in Section II-D we don’t use a conflict resolution since our instance segmentation is a byproduct of our semantic segmentation. Thus, we will never have conflicting predictions.

Iv Experiments, Results and Discussion

Fig. 5: Qualitative Results on Cityscapes [44] dataset obtained with ResNet-50 [22] encoder using a separate neck architecture, wBCE + Huber loss combination, split and merge refinement with min Instance area = 300 pixels. Instance contours ground truth are generated with dilation rate = 2. From Left to Right: Semantic Segmentation, Instance contour Segmentation, Center Regression, Instance Segmentation.

In this section, we demonstrate the performance of our proposed methods for panoptic segmentation on Cityscapes [44] dataset. We also present the performance of our semantic segmentation and instance segmentation results that helped us generate the panoptic segmentation output.

Iv-a Experimental Setup

Cityscapes [44] is an automotive scene understanding dataset with 2975/500 train/val images at 10242048 resolution. This dataset contains labels for semantic, instance and panoptic segmentation tasks. We derive labels for our instance contour task by applying a contour detection algorithm on instance ground truth masks. We dilate the resulting contours to derive thick contours and serve them as ground truth for our instance contour segmentation task. Cityscapes dataset has 19 semantic object categories out of which 8 categories are provided with instance masks.

We train our network on full resolution images with a batch size of 4 images. We use Group Normalization [45] which is effective for lower batch sizes. We use an SGD optimizer with learning rate = , momentum = 0.9, weight decay =

. We initialize our ResNet encoders with pre-trained ImageNet

[46] weights and train our networks for 48000 iterations.

We measure the performance of semantic segmentation using mean intersection over union (mIoU), instance segmentation using mean average precision (mAP) and panoptic segmentation using panoptic quality (PQ) [2], segmentation quality (SQ) and recognition quality (RQ) metrics.

Contour Loss Performance
wBCE Huber NMS AP PQ PQTh SQTh RQTh
16.0 43.9 25.0 72.6 33.3
24.3 47.8 33.2 76.3 42.9
18.9 44.6 26.1 74.3 35.3
23.3 46.7 32.4 76.1 42.1
TABLE I: Instance and Panoptic Segmentation results for different loss functions used to represent instance contour loss. wBCE = weighted multi-class binary cross entropy, AP = average precision, PQ = panoptic quality. PQTh, SQTh, and RQTh represent panoptic, segmentation and recognition qualities of instance objects (things).

Iv-B Ablation experiments

Iv-B1 Instance contour segmentation loss function

As mentioned before, we aim to predict thin and crisp instance contours. We study different loss functions discussed in Section III-B by evaluating the performance of instance and panoptic segmentation as shown in Table I. We used ResNet-50 encoder as our backbone and separate heads with a common neck as discussed in Section III-A.

Dilation Rate AP PQ PQTh SQTh RQTh
1 24.1 46.0 30.5 73.1 40.6
2 24.3 47.8 33.2 76.3 42.9
3 22.6 46.6 32.0 75.6 41.7
TABLE II: Performance of instance and panoptic segmentation when different dilation rates were used to generate ground truth instance contours. Increasing the dilation rate, increases the thickness of the ground truth instance contours.

We observed that the use of Huber and NMS loss function have improved the performance of instance and panoptic segmentation results. The weighted multi-class binary cross entropy combined with the Huber loss is the best combination we found. We use this combination for the rest of the experiments in the paper. Qualitative results in Figure 5 demonstrate that the contours generated are thin and crisp when the above combination is used.

Fig. 6: Panoptic Segmentation Results on Cityscapes [44] dataset. Results obtained with ResNet-50 [22] encoder using a separate neck architecture, wBCE + Huber loss combination, split and merge refinement with min Instance area = 300 pixels. Instance contours ground truth are generated with dilation rate = 2.

Iv-B2 Instance contour ground truth dilation rate

We generate our ground truth instance contours by applying a contour detection algorithm on instance masks provided for different objects in cityscapes dataset. Number of edge pixels are comparatively lower than non edge pixels in our contour segmentation problem. We can alleviate this class imbalance using appropriate loss functions as discussed in Section III-B or by dilating the the contours and increasing their thickness. In Table II, we evaluate the performance of instance and panoptic segmentation for different dilation rates.

We observed that when appropriate loss combination is used, the dilation rate doesn’t have a significant impact on the performance. However, increasing the dilation rate from 2 to 3 decreases the performance. We use a dilation rate of 2 to generate our ground truth contours for all other experiments.

Refine Performance
Split Merge AP PQ PQTh SQTh RQTh
24.0 47.1 33.0 75.6 42.7
24.2 47.7 33.1 76.1 42.9
24.3 47.8 33.2 76.3 42.9
TABLE III: Evaluation of instance and panoptic segmentation performance before and after refinement using offsets predicted by center regression results.
min Instance Area AP PQ PQTh SQTh RQTh
1 10.0 40.6 17.6 75.5 23.1
100 21.3 46.4 31.4 75.7 40.8
300 24.3 47.8 33.2 76.3 42.9
500 23.6 47.0 32.7 75.5 42.4
TABLE IV: Impact of minimum instance area threshold during refinement of instance segmentation.

Iv-B3 Refining Instance Segmentation

As discussed in Section III-D, we refine our instance segmentation output using center regression results. We evaluate the effects of split and merge components in our refinement process in Table III and evaluate the effect of min instance area in Table IV.

Neck Backbone mIoU PQSt AP PQTh PQ
Shared ResNet-50 67.5 57.4 24.3 33.2 47.8
Separate ResNet-50 69.6 58.6 25.0 34.0 48.3
Shared ResNet-101 68.4 58.5 24.7 33.4 48.1
Separate ResNet-101 68.7 59.3 24.9 33.2 48.4
TABLE V: Performance of semantic, instance and panoptic segmentation using different network architecture choices.

We observed that refining the instance segmentation using offsets predicted by center regression marginally improves performance of instance segmentation. However, the refinement is critical in cases where a broken contour can miss the boundary between two instances that can wrongly predicted as a single instances. Similarly, a occlusion by a pole or low width object can cause mislead connected component labeling to interpret resulting contours as separate instances. Qualitative results in Figure 5 suggests that offsets predicted by center regression head are accurate for objects that are closer while they are less accurate for objects farther away.

We observed that choosing an appropriate minimum instance area threshold is critical in determining the performance of our proposed method. Lower instance area allows to remove unwanted instance generated due to artifacts in contour estimation. Such artifacts could be a result of false contours around mirrors of cars, convex hulls, occlusion etc.

Iv-B4 Network Ablation

We experimented with different network architecture choices as discussed in Section III-A. We studied the impact of using a shared neck vs separate neck layer to upsample and add features from a common feature pyramid network. We also studied how the depth of ResNet encoder impacts our performance by using ResNet-50 and ResNet-101 encoders in Table V. We observed that higher ResNet depth and separate necks yield better performance.

Method mIoU PQSt AP PQTh PQ
Two Stage Object detection
Mask R-CNN [12] - - 31.5 - -
Weakly Supervised [11] 71.6 52.9 24.3 39.6 47.3
Panoptic-FPN [1] 74.5 62.4 32.2 51.3 57.7
Instance Clustering
Kendall et al [13] 78.5 - 21.6 - -
Panoptic-DeepLab [14] 78.2 - 32.7 - 60.3
Single-stage Object detection
Poly YOLO [16]* - - 8.7 - -
Others
Uhrig et al. [47] 64.3 - 8.9 - -
Deep Watershed [48] - - 19.4 - -
SGN [49] - - 29.2 - -
Ours [ResNet-50] 69.6 58.6 25.0 34.0 48.3
Ours [ResNet-101] 68.7 59.3 24.9 33.2 48.4
TABLE VI: Comparison with other state-of-the art methods on Cityscapes val [44] dataset. *Poly YOLO [16] is evaluated on resized input image of size 416832.

Iv-C State of the Art Comparison

In Table VI, we compare our proposed methods against other semantic, instance and panoptic segmentation methods.

Iv-C1 Comparison with Two-stage methods

As discussed in Section II-B, Two stage object detection methods [1, 12, 50, 11] dominate the state of the art in instance and panoptic segmentation. However they have incur additional compute costs in generating object detection followed by foreground mask generation. Mask R-CNN [12] for instance segmentation on a high end GPU like Nvidia Titan X runs at 5-6 fps on a 10241024 image. When semantic segmentation task is executed in parallel with instance segmentation to compute panoptic segmentation the run time speed of Mask R-CNN [12] further declines. This makes the two-stage object detection based methods not suitable for real-time applications. Our proposed method with ResNet-50 encoder outputs panoptic segmentation at 3 fps on a mid grade Nvidia GTX 1080 GPU on a 10242048 image. We expect higher frame rates when our connected component labeling function is optimized for GPU operation as opposed to its current CPU based implementation.

Iv-C2 Comparison with Instance clustering

Kendall et al. [13] was one of the early works that used multi-task learning to simultaneously learn semantic and instance segmentation. Panoptic-DeepLab [14] recently proposed an strong baseline for center regression based methods by exploiting the effectiveness of dual-Atrous Spatial Pyramid Pooling (ASPP) modules. We believe that using ASPP module in our network will improve our semantic segmentation performance and eventually lead us to better instance and panoptic segmentation results. However, ASPP modules are computationally very expensive compared to Feature pyramid networks [1].

Iv-C3 Comparison with Single-stage object detection and Others

Poly YOLO [16] reported 22 fps on a 416832 image with an AP score of 8.7. Other methods like Deep Watershed [48] and SGN [49] incur a huge computation complexity in their instance assignment techniques. Our methods are light weight compared to object detection, instance clustering based methods and better in terms of performance compared with other single-stage methods.

V Conclusion

In this paper, we presented a new approach to panoptic segmentation using instance contours. Our method is one of the first approaches where instance segmentation is a generated as a byproduct in a semantic segmentation network. We evaluated performance of our semantic, instance and panoptic segmentation results on Cityscapes dataset. We presented several ablation studies that help understand the impact of architecture and training choices that we made. We believe that our proposed methods opens a new direction in research of instance and panoptic segmentation and serves a baseline for contour based methods.

References

  • [1] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 6399–6408, 2019.
  • [2] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9404–9413, 2019.
  • [3] Andra Petrovai and Sergiu Nedevschi. Multi-task network for panoptic segmentation in automated driving. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 2394–2401. IEEE, 2019.
  • [4] Daan de Geus, Panagiotis Meletis, and Gijs Dubbelman. Single network panoptic segmentation for street scene understanding. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 709–715. IEEE, 2019.
  • [5] Donghao Zhang, Yang Song, Dongnan Liu, Haozhe Jia, Siqi Liu, Yong Xia, Heng Huang, and Weidong Cai. Panoptic segmentation with an end-to-end cell r-cnn for pathology image analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 237–244. Springer, 2018.
  • [6] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
  • [7] Ganesh Sistu, Isabelle Leang, Sumanth Chennupati, Senthil Yogamani, Ciarán Hughes, Stefan Milz, and Samir Rawashdeh. Neurall: Towards a unified visual perception model for automated driving. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 796–803. IEEE, 2019.
  • [8] Sumanth Chennupati, Ganesh Sistu, Senthil Yogamani, and Samir A Rawashdeh. Multinet++: Multi-stream feature aggregation and geometric loss strategy for multi-task learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [9] Sumanth Chennupati, Ganesh Sistu., Senthil Yogamani., and Samir A Rawashdeh. Auxnet: Auxiliary tasks enhanced semantic segmentation for automated driving. In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP,, pages 645–652. INSTICC, SciTePress, 2019.
  • [10] L. Porzi, S. R. Bulò, A. Colovic, and P. Kontschieder. Seamless scene segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8269–8278, 2019.
  • [11] Qizhu Li, Anurag Arnab, and Philip HS Torr. Weakly-and semi-supervised panoptic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 102–118, 2018.
  • [12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • [13] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
  • [14] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. arXiv preprint arXiv:1911.10194, 2019.
  • [15] Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, and Ping Luo. Polarmask: Single shot instance segmentation with polar representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [16] Petr Hurtik, Vojtech Molek, Jan Hula, Marek Vajgl, Pavel Vlasanek, and Tomas Nejezchleba. Poly-yolo: higher speed, more precise detection and instance segmentation for yolov3. arXiv preprint arXiv:2005.13243, 2020.
  • [17] Zhiding Yu, Chen Feng, Ming-Yu Liu, and Srikumar Ramalingam. Casenet: Deep category-aware semantic edge detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5964–5973, 2017.
  • [18] David Acuna, Amlan Kar, and Sanja Fidler. Devil is in the edges: Learning semantic boundaries from noisy annotations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [19] Derek Hoiem, James Hays, Jianxiong Xiao, and Aditya Khosla. Guest editorial: Scene understanding. International Journal of Computer Vision, 112(2):131–132, 2015.
  • [20] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradient-based learning. In Shape, contour and grouping in computer vision, pages 319–345. Springer, 1999.
  • [21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [23] Mennatullah Siam, Sara Elkerdawy, Martin Jagersand, and Senthil Yogamani. Deep semantic segmentation for automated driving: Taxonomy, roadmap and challenges. In 2017 IEEE 20th international conference on intelligent transportation systems (ITSC), pages 1–8. IEEE, 2017.
  • [24] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
  • [25] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  • [26] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • [27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015.
  • [28] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.
  • [29] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr.

    Conditional random fields as recurrent neural networks.

    In Proceedings of the IEEE international conference on computer vision, pages 1529–1537, 2015.
  • [30] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [31] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [33] Xiaodan Liang, Liang Lin, Yunchao Wei, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Proposal-free network for instance-level object segmentation. IEEE transactions on pattern analysis and machine intelligence, 40(12):2978–2991, 2017.
  • [34] Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. Optics: ordering points to identify the clustering structure. ACM Sigmod record, 28(2):49–60, 1999.
  • [35] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise.
  • [36] Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang. Object contour detection with a fully convolutional encoder-decoder network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 193–202, 2016.
  • [37] Lucas J Van Vliet, Ian T Young, and Guus L Beckers. A nonlinear laplace operator as edge detector in noisy images. Computer vision, graphics, and image processing, 45(2):167–195, 1989.
  • [38] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
  • [39] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru Liu. Learning to predict crisp boundaries. In The European Conference on Computer Vision (ECCV), September 2018.
  • [40] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai. Richer convolutional features for edge detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3000–3009, 2017.
  • [41] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Deepedge: A multi-scale bifurcated deep network for top-down contour detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4380–4389, 2015.
  • [42] Sida Peng, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, and Xiaowei Zhou. Deep snake for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [43] Hanan Samet and Markku Tamminen. Efficient component labeling of images of arbitrary dimension represented by linear bintrees. IEEE transactions on pattern analysis and machine intelligence, 10(4):579–586, 1988.
  • [44] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [45] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  • [46] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [47] Jonas Uhrig, Marius Cordts, Uwe Franke, and Thomas Brox. Pixel-level encoding and depth layering for instance-level semantic labeling. In German Conference on Pattern Recognition, pages 14–25. Springer, 2016.
  • [48] Min Bai and Raquel Urtasun. Deep watershed transform for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5221–5229, 2017.
  • [49] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn: Sequential grouping networks for instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3496–3504, 2017.
  • [50] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang. Attention-guided unified network for panoptic segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7019–7028, 2019.