Deep convolutional neural networks (DCNN) have been proven an effective solution to pattern recognition problems in computer vision. To achieve high quality results, DCNN usually requires extensive amounts of training data. Particularly, modern datasets, ,  have annotated millions of images covering thousands of object categories in image tags or bounding boxes (bbox), and dramatically improved the classification and detection performance. In contrast, the common datasets ,  have orders of magnitudes fewer pixel-level mask annotations, since acquiring mask annotations requires much more human efforts (about 15 more time than box annotations ). This limits the performance of instance and semantic segmentation models that usually benefit from more pixel-level annotated training data.
In past few years, weakly- or semi-supervised learning methods are explored to mitigate the problem by utilizing available large scale weak annotations, such as image-level labels, , , , , , , , , , , ,  and bboxes , , . They often convert the weak annotations to pixel-level supervision with unsupervised approaches and train the fully convolutional networks (FCNs) ,  by iteratively inferring and refining segmentation mask. However, this is indeed a label denoising or data augmentation process and does not reduce the requirements of pixel-level training data by the segmentation models. In addition, the weak information are not effectively and accurately utilized because the converted mask labels are usually noisy or incomplete. Therefore, it hinders the potential capacity of a large number of available weak annotations for segmentation task.
In this work, we aim to achieve high quality instance and semantic segmentation results over a small set of pixel-level mask annotations and a large set of box annotations, as shown in Fig. 1 and Fig. 2. Our architecture, named DASNet, consists of three modules: detection, attention and segmentation. The detection module is to recognize and localize all objects of each class in bboxes. Given the products of detection, the attention module aims to generate multi-scale class-specific features which are used as the input to the segmentation module. The segmentation module trained with pixel-level annotations outputs binary segmentation masks. In order to achieve instance segmentation, the position-sensitive score map technique  is carefully adapted to the DASNet.
Our method can effectively reduce the pixel-level training data requirements for instance and semantic segmentation tasks in two aspects. First, the class-agnostic segmentation strategy dramatically simplifies the pixel-level supervised learning task, which is to recover object shapes only. And all classes share the same segmentation model. Therefore, the segmentation model can easily converge to good local minima even trained with a small number of pixel-level annotations. Second, the state-of-the-art object detection models are deployed to achieve high quality object recognition and coarse localization for segmentation. Instead converted to the latent pixel-level labels, available large scale bboxes are utilized in an effective and accurate manner by training the detection model. In this context, the key problem is how to leverage the high-level detection products to further facilitate the low-level mask recovering. We solve this problem by an attention mechanism.
To our best knowledge, this is the first study to explore detection models for semi-supervised instance and semantic segmentation. The experimental results on PASCAL VOC 2012 dataset show that our architecture substantially outperforms existing weakly- and semi-supervised techniques especially using a small number of mask annotations.
2 Related Work
Since acquiring pixel-level mask is an expensive, time-consuming annotation work, researchers have recently pay more attention to develop techniques for achieving high quality segmentation results when training examples with mask annotations are limited or missing. These techniques can be roughly classified as weakly- and semi-supervised learning. We discuss each in turn next.
Weakly-supervised learning: In order to avoid the constraints of expensive pixel-level annotations, weakly-supervised approaches train the segmentation models with only weak annotations, including image-level labels , , , , , , , , , , , , , points , scribbles , , and bboxes , , . All of these methods are dedicated to converting the weak annotations to the latent pixel-level supervision and training the FCNs by iteratively inferring and refining segmentation mask. However, the latent supervisions are either incomplete or noisy without guiding with any strong annotations. Therefore, the segmentation performance of these techniques is still inferior.
Semi-supervised leaning: In semi-supervised learning of semantic segmentation, the assumption is that a large number of weak annotations and a small number of strong (pixel-level) annotations are available, which is usually satisfied in practice. Various types of weak annotations, such as image-level labels , , scribbles , and bboxes , , have been explored in the semi-supervised setting. Intuitively, , ,  augment the weak annotations with the small number of strong annotations for training the FCNs as in the weakly-supervised settings and achieve better performance than the weakly supervised counterparts. In contrast,  decouples the semantic segmentation into two sub-tasks: classification and class-agnostic segmentation, supervised with image-level labels and pixel-level masks respectively. This approach shows an impressive performance even with a very small number of strong annotations.
It is noted that the common pre-training strategy (e.g., pre-training on ImageNet for classification task) in fully supervised setting can be also regarded as a way for alleviating the requirements of strong annotations. However, its main role is to promote the convergence of the segmentation models .
In this work, we assume that there exists a large number of box annotations and a small number of mask annotations. Our method is most related to , , . The box annotations in ,  are converted to mask labels using unsupervised methods , ,  and a priori knowledge for training FCNs. In contrast, we utilize box annotations in an efficient and accurate manner by training the detection module. Motivated by , we also decouple the semantic segmentation into sub-tasks. However, the class-specific features generated from the classification component in  are usually sparse and noisy, since the classification network tends to focus only on small discriminative parts (e.g. the head of an animal). In this work, by exploring the detection model, multiple complete objects of multiple scales are focused on the class-specific features with an attention module.
3.1 Architecture Overview
Our proposed network consists of three modules: Detection, Attention, and Segmentation, as shown in Fig. 3. Given an image, the detection model first detects all classes of objects with bboxes. Then the attention module generates class-specific multi-scale features by cropping the multiple layer features of the detection module with the detected boxes. Finally, the segmentation module integrates all scales of feature maps hierarchically and recovers the binary masks for each class separately. Moreover, we demonstrate that the DASNet can achieve instance segmentation by computing position-sensitive score maps . Next, we will detail each module as well as the training and inference processes.
3.2 Detection Module
Training with large amounts of bbox annotations, object detection based on DCNN have achieved significant advances in recent years. In this paper, the SSD models  are explored. One of main properties for SSD is that multiple feature maps with different resolutions are combined to handle various sizes of objects, which has become a common technique. It is noted that we do not see any hurdles to prevent other state-of-the-art detection models from being integrated into DASNet. Hereafter we review the SSD approach briefly.
The SSD architecture is a single feed-forward convolutional network, which consists of a base network and multiple small convolutional predictors. Given an image, the base network generates multi-scale feature maps, noted as , , where is the number of different scale feature maps. On each feature map, a convolutional predictor is applied to produce a fixed-size collection of boxes and the corresponding object class scores of instances presented in those boxes. All scored boxes are further processed by a non-maximum suppression algorithm to produce the final detections, noted as , where and are coordinates and class label of the detected box . Refer to  for more details.
Therefore the detection module in the DASNet produces two kinds of products: multi-scale feature maps and detection results . In the following, we will introduce how these products are utilized for reducing the pixel-level training data requirements for instance and semantic segmentation.
3.3 Attention Module
Correspondence property of convolutional features 
is one of the main reasons that DCNN can be successfully applied for localization tasks. Specifically, the convolutional features of a particular layer can be regarded as a two dimensional grid of feature vectors or a feature map. The correspondence refers that if feature vectors in a particular feature map have similar values (e.g., in inner product), their corresponding receptive field regions of the original image have similar appearances and vice versa. Thus the vector valuesand their position information (i.e., ) represent ”what” and ”where”, respectively.
Motivated by this property, we design a simple yet effective attention mechanism to generate class-specific muti-scale features by exploring the products of detection module. Formally, given the target class and the detection products and , a set of class-specific boxes are selected at first. Then the class-specific multi-scale features are computed by cropping each scale of feature map with the class-specific boxes as equation (1):
where , , and , represent the feature vector coordinates and the size of feature map,respectively. The box coordinates , , , are normalized by the image size. Briefly, the class-specific feature maps are obtained by setting the feature vector values of detection module outside all class-specific boxes as zeros. The backward process of the attention operation has similar form.
Significant differences between ROI (region-of-interest) pooling  and our attention mechanism should be noted, although they share similar forms. The ROI pooling takes as input an a feature map and a bbox and outputs a small fixed-size dimension of feature map by pooling the inside feature vectors of the single box region. Since all other outside features are removed, the whole spatial information in ROI pooling is lost. In contrast, our attention module generates the same size of class-specific feature map by zeroing out unrelated signals only. Therefore, the whole spatial information of all class-specific object instances is preserved. This information is necessary for recovering the high-resolution object masks with the deconvolutional segmentation network discussed next.
Three properties of the multi-scale class-specific feature maps facilitate the pixel-level supervised learning task. (1) A particular object class of the feature maps has been identified in advance. (2) The object instances has been coarsely localized in high-level space by suppressing unrelated signals. (3) Multi-layer feature maps contain rich information for capturing various size of objects by leveraging the detection results. The following section will introduce how these multi-scale class-specific feature maps are used to generate the binary pixel-level masks.
3.4 Segmentation Module
Given a set of muti-scale class-specific feature maps, , generated by the attention module, the objective of segmentation module is to produce binary shape masks of class-specific object instances. In semantic segmentation task, it is a pixel-wise binary classification problem which infers whether each pixel belongs to the given class or not. Furthermore, the instance segmentation needs to determine not only whether each pixel belongs to the given class, but also which particular instance it belongs to, as shown in Fig. 2. In this section, we discuss the semantic segmentation case, and how to achieve instance segmentation will be introduced in the next section.
The segmentation module is a single deconvolution network, which has been successfully applied for the semantic segmentation task , , . Similar to , we adapt the deconvolution network proposed in  as our segmentation module. As shown in Fig. 3, the segmentation module hierarchically merges multi-scale class-specific feature maps in a top-down manner and outputs a segmentation mask in the same size to the input image. Specifically, to merge class-specific feature maps and ( has the smallest size), is fed into a series of deconvolution and/or unpooling layers to generate a upsampled feature map in the same size to (including height, width, and channel number). Then and are concatenated along their channel direction. In turn, the concatenated feature map will be merged with the lower feature map in the same way. This process is repeated until all class-specific feature maps are merged and the segmentation mask is generated.
For semantic segmentation task, the segmentation module produces a two-channel class-specific segmentation map, in which the two channels represent foreground and background respectively. The class-specific segmentation loss is the softmax loss over two binary class (foreground and background) in pixel-wise.
Benefit from the properties (1) and (2) of class-specific feature maps aforementioned, the pixel-level supervised learning task of detection module has been dramatically simplified to determine whether each pixel within the region of class-specific boxes belongs to a given class. Therefore, the segmentation accuracy is still competitive even training with a very small number of pixel-level annotated training samples (e.g., 5 to 10 annotations per class) as demonstrated in section 4.2. Moreover, the property (3) allows the segmentation module to capture objects of various scales, see Fig. 1.
3.5 Instance Segmentation
In this section, we show that how the position-sensitive score map technique  is adapted to DASNet for achieving instance segmentation. We begin by introducing the original position-sensitive score map approach. From the top convolution features (), position-sensitive score maps are produced, where C is object class number (+1 for background), k is the size of relative position grid, and 2 presents two groups (inside and outside). Given a ROI generated by the region proposal network , its pixel-wise score maps are produced by the assembling (copy-paste) its
cells from the corresponding score maps. For each pixel in a ROI, there are two tasks: 1) detection: whether it belongs to an object bbox at a relative position or not; 2) segmentation: whether it is inside an object instance’s boundary or not. The detection score of the whole ROI is obtained via average pooling over all pixels’ likelihoods, which are the max values between their inside and outside scores. The segmentation score (in probabilities) is the union of pixel-wise inside/outside softmax values. It is noted that the detection and segmentation scores are computed for each category. Thus, for each ROI, a softmax detection loss overcategories, a softmax segmentation loss within the class-spcific bboxes area in foreground mask of the ground-truth category only, and a bbox regression  loss are applied for training.
To integrate position-sensitive score map approach in DASNet, some modifications are necessary. First, the number of position-sensitive score map is reduced to ( by default in the following experiments). Since the binary class-specific segmentation of DASNet assumes that object instances of a particular class have been detected by detection module. Therefore, the segmentation module only needs to segment inside/outside masks within the detected bbox. Second, the position-sensitive score maps are produced from the top convolutional features () with high-resolution of a deconvolotion network, instead of feature map with much () lower resolution of a truncated network. Its effectiveness has not been proven in this case before. Third, in the training stage, each bbox has two loss terms in equal weights: an instance score regression sigmoid-cross-entropy loss and a softmax segmentation loss over the foreground mask of the ground-truth instance only. Fourth, in the testing stage, the segmentation module only outputs the detected instance mask. It is noted that the instance score regression loss can not be removed, otherwise the learned score maps are not position-sensitive without learning negative instance samples. Thus, it should be careful to organize the training samples for learning position-sensitive score maps, which is discussed next.
3.6 Training and Inference
Stage-wise training vs. joint training: For stage-wise training, we first train the detection module with bbox annotations, and then train the segmentation module with mask annotations by freezing the parameters of the detection module. For joint training, since the attention module allows the gradients from segmentation module backward to the detection module, the detection module and segmentation module are trained with the bboxes and mask annotations simultaneously. Fine-tuning strategy is another option that simultaneously fine-tuning the models obtained from stage-wise training. However, we do not see improvements with either joint training or fine-tuning strategy by now. Therefore, all experiments are conducted with stage-wise training in this paper.
Detection: The training and inference processes of detection module are same as .
Semantic segmentation: Both mask and bbox annotations (bbox annotations can be easily obtained from mask annotations) are needed to train the segmentation module. At the training stage, the ground-truth class-specific bboxes are fed to the attention module for generating multi-scale class-specific features, which are the input to the segmentation module. And the ground-truth binary masks corresponding to the class-specific bboxes are used to compute the segmentation loss, in which the pixels outside class-specific bboxes are ignored. In inference, the final semantic segmentation mask is obtained via a max operation on all class-specific score maps.
Instance segmentation: To train the segmentation module for instance segmentation, instance-aware semantic segmentation mask and bbox annotations are required. In this setting, the bbox annotations are utilized in two ways: 1) The ground-truth class-specific bboxes are used to generate multi-scale class-specific features. 2) For each ground-truth bbox, positive and negative instance bboxes ( in our experiments) are sampled for training the position-sensitive score maps. Specifically, the sampled bboxes which match one of ground-truth bboxes (Haccard overlap larger than 0.5) are treated as positives, and those do not match any of ground-truth bboxes are treated as negatives. The instance score is set to 1 for positives and 0 for negatives, which is used to compute instance score regression loss.
In inference, the detection module detects all classes of object instances firstly. Then, each detected bbox is forward to the attention model for generating instance-specific features covering a particular instance. Finally, all instance masks are segmented separately.
4.1 Implementations Details.
Dataset: In our experiments, we focus on the 20 Pascal classes . We employ PASCAL VOC and MS COCO  datasets for training and PASCAL VOC 2012 dataset for testing. The PASCAL dataset is extented with 10528 pixel-level annotated images in . To simulate semi-supervised learning scenario, we construct a heterogeneous annotated training set, in which all images are labeled with bbox annotations and a fraction of images have pixel-level annotations. A various number of bbox and mask annotations of training samples is controlled to demonstrate the effectiveness of our semi-supervised framework. To compare with existing weakly-, semi-, and fully-supervised learning methods , , , , , , we split mask annotations to (10528 images in ), (1464 images in VOC 2012 training dataset), (25, 10, 5 images per class random select from ), and split the bbox annotations into (16551 images from VOC2007 trainval and VOC2012 trainval datasets), (144790 images from , VOC2007 test and MS COCO datasets). All images are resized to during training and testing111On , the images are resized to for training the detection module since we use the released SSD model from https://github.com/weiliu89/caffe/tree/ssd..
Data Augmentation: We use common strategies to augment training samples, including expending, cropping, adding noise, and horizontal flipping as in . In addition, the number of class-specific bboxes is randomly set, which is similar to the combinatorial cropping proposed in .
Our proposed method is implemented based on Caffe library. To compare with existing methods, we use VGG 16-layer net 
as our backbone. We use the standard Stochastic Gradient Descent (SGD) with momentum for optimization, where the base learning rate is 0.01 for semantic segmentation and 0.02 for instance segmentation, and divided by 10 after 15K/30K, the batch size is 32, and the momentum is 0.9. The source code will be available.
|M G+ ||-||-||67.5|
|M G+ ||-||66.9|
4.2 Semantic Segmentation
In this section, we evaluate the performance of DASNet for semantic segmentation task on VOC 2012 test set via evaluation server. Segmentation accuracy is measured by Intersection over Union (IoU) between ground-truth and predicted segmentation. Table 1 compares quantitative results of using various supervision level.
Training with a small number of pixel-level mask annotations, the DASNet presents substantially better performance without any post-processing than other weakly-, semi-, and fully-supervised methods. Particularly, when five training examples per class with mask annotations are used, the accuracy of semi-supervised method DecoupledNet is reduced by 11.9% (66.6-54.7%), while the accuracy of our DASNet is reduced by 4.4% (69.2-64.8%) only, comparing with using the full mask annotations. The results show that the DASNet can significantly reduce the pixel-level training data requirements for semantic segmentation, although we use stronger bbox-level annotations than the image-level. Compared with semi-supervised methods which use the same type of training examples, the DASNet also requires much less pixel-level annotations for achieving similar results. Moreover, the DASNet trained with 200 strong annotations and 16.5K bbox annotations can obtain higher accuracy than some fully supervised methods trained with 10K strong annotations.
Fig. 1 presents some qualitative semantic segmentation results produced by our DASNet on VOC 2012 test set. Trained with an extremely small number (5-10 examples per class) of pixel-level annotated samples only, the DASNet can also segment multiple objects of various sizes.
4.3 Instance segmentation
Following , we evaluate the performance of DASNet for instance segmentation task with mAP at IoU threshold 0.5 and 0.75. Table 2 shows the instance segmentation results. As the number of pixel training samples is reduced from 10K to 1.4K, the instance segmentation accuracy is reduced by 1.4% only, which shares the similar results with semantic segmentation task. Fig. 2 shows several examples of instance segmentation from VOC 2012 test dataset .
By exploring the detection model, our DASNet significantly improves the performance of semi-supervised instance and semantic segmentation compared with existing methods. The proposed detection-attention-segmentation framework is actually a from-coarse-to-fine localization process. Thus the large scale available bbox annotations can improve segmentation results in an accurate and efficient way by training the detection model. However, this also raises the fore-end error problem that the detection errors will directly lead to segmentation errors, as shown in Fig. 4. In future works, we have three suggestions for this problem. First, integrating the state-of-the-art detection model into the DASNet framework to reduce the detection error. Second, obtaining the class-specific feature maps with pixel-level resolution instead the bbox-level in this work. Third, joint learning is explored to further improve the performance of both detection and segmentation modules. These suggestions will be included in our future research and more experiments should be conducted for further demonstrating the effectiveness of the proposed DASNet.
This paper introduces DASNet, a semi-supervised instance and semantic segmentation framework for reducing pixel-level mask annotations by leveraging large scale bbox annotations. The key idea is exploring detection models to simplify the pixel-level supervised learning task. Thus the pixel-level training data requirements of segmentation model are reduced. The attention module is a key component that exploits the products of detection to facilitate learning class-agnostic segmentation by multi-scale class-specific feature maps. In addition, the position-sensitive score map technique is adapted to DASNet for instance segmentation. Experimental results show that our method substantially reduces the requirements of pixel-level annotations compared with existing semi-supervised instance and semantic segmentation methods.
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: A large-scale hierarchical image database. In: CVPR. (2009) 248–255
-  Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Li, L.J., Li, L.J., Shamma, D.A.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1) (2017) 32–73
-  Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova, A., Rom, H., Uijlings, J., Popov, S., Veit, A., Belongie, S., Gomes, V., Gupta, A., Sun, C., Chechik, G., Cai, D., Feng, Z., Narayanan, D., Murphy, K.: Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages (2017)
-  Everingham, M., Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV 88(2) (2010) 303–338
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. Volume 8693. (2014) 740–755
-  Pathak, D., Shelhamer, E., Long, J., Darrell, T.: Fully Convolutional Multi-Class Multiple Instance Learning. In: ICLR. (2014)
-  Pinheiro, P.O., Collobert, R.: From Image-level to Pixel-level Labeling with Convolutional Networks. In: ICCV. (2015) 1713–1721
-  Pathak, D., Krahenbuhl, P., Darrell, T.: Constrained convolutional neural networks for weakly supervised segmentation. In: CVPR. (2016) 1796–1804
-  Saleh, F., Aliakbarian, M.S., Alvarez, J.M.: Built-in foreground / background prior for weakly-supervised semantic segmentation. In: ECCV. (2016) 413–432
-  Papandreou, G., Chen, L.C., Murphy, K.P., Yuille, A.L.: Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation. In: ICCV. (2015) 1742–1750
-  Lu, Z., Fu, Z., Xiang, T., Han, P., Wang, L., Gao, X.: Learning from weak and noisy labels for semantic segmentation. TPAMI 39 (2017) 1–1
-  Wei, Y., Liang, X., Chen, Y., Jie, Z., Xiao, Y., Zhao, Y., Yan, S.: Learning to segment with image-level annotations. Pattern Recognition 59 (2015) 234–244
-  Wei, Y., Liang, X., Chen, Y., Shen, X., Cheng, M.M., Zhao, Y., Yan, S.: STC: A Simple to Complex Framework for Weakly-supervised Semantic Segmentation. TPAMI pp(99) (2016) 1–1
Kwak, S., Hong, S., Han, B.:
Weakly supervised semantic segmentation using superpixel pooling
In: AAAI Conference on Artificial Intelligence. (2017)
-  Chaudhry, A., Dokania, P.K., Torr, P.H.S.: Discovering Class-Specific Pixels for Weakly-Supervised Semantic Segmentation. In: BMVC. (2017)
-  Roy, A., Todorovic, S.: Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation. In: CVPR. (2017) 7282–7291
-  Durand, T., Mordan, T., Thome, N., Cord, M.: Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In: CVPR. (2017) 5957–5966
-  Oh, S.J., Benenson, R., Khoreva, A., Akata, Z., Fritz, M., Schiele, B.: Exploiting saliency for object segmentation from image level labels. In: CVPR. (2017)
-  Dai, J., He, K., Sun, J.: Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: ICCV. (2015) 1635–1643
-  Khoreva, A., Benenson, R., Hosang, J., Hein, M., Schiele, B.: Simple does it: Weakly supervised instance and semantic segmentation. In: CVPR. (July 2017) 1665–1674
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015) 3431–3440
-  Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR. (2015)
-  Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: CVPR. (2016)
-  Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the Point: Semantic Segmentation with Point Supervision. In: ECCV. (2016) 549–565
-  Xu, J., Schwing, A.G., Urtasun, R.: Learning to Segment Under Various Forms of Weak Supervision. In: CVPR. (2015) 3781–3790
-  Lin, D., Dai, J., Jia, J., He, K., Sun, J.: Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In: CVPR. (2016) 3159–3167
-  Hong, S., Noh, H., Han, B.: Decoupled deep neural network for semi-supervised semantic segmentation. In: NIPS. (2015)
-  Uijlings, J.R.R., Sande, K.E.A.V.D., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. IJCV 104(2) (2013) 154–171
-  Pont-Tuset, J., Gool, L.V.: Boosting object proposals: From pascal to coco. In: ICCV. (Dec 2015) 1546–1554
-  Rother, C., Kolmogorov, V., Blake, A.: ”grabcut”: interactive foreground extraction using iterated graph cuts. In: ACM SIGGRAPH. (2004) 309–314
-  Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: ECCV. (2016) 21–37
-  Long, J., Zhang, N., Darrell, T.: Do convnets learn correspondence? NIPS 2 (2014) 1601–1609
-  Girshick, R.: Fast r-cnn. In: ICCV. (2015) 1440–1448
-  Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV. (2015) 1520–1528
-  Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for scene segmentation. TPAMI pp(99) (2017) 1–1
-  Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ArXiv e-prints (February 2018)
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. TPAMI 39(6) (2017) 1137
-  Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV. (Nov 2011) 991–998
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR. (2014)