Interactive Full Image Segmentation

12/05/2018 ∙ by Eirikur Agustsson, et al. ∙ Google 10

We address the task of interactive full image annotation, where the goal is to produce accurate segmentations for all object and stuff regions in an image. To this end we propose an interactive, scribble-based annotation framework which operates on the whole image to produce segmentations for all regions. This enables the annotator to focus on the largest errors made by the machine across the whole image, and to share corrections across nearby regions. Furthermore, we adapt Mask-RCNN into a fast interactive segmentation framework and introduce a new instance-aware loss measured at the pixel-level in the full image canvas, which lets predictions for nearby regions properly compete. Finally, we compare to interactive single object segmentation on the on the COCO panoptic dataset. We demonstrate that, at a budget of four extreme clicks and four corrective scribbles per region, our interactive full image segmentation approach leads to a 5



There are no comments yet.


page 1

page 3

page 4

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper addresses the task of interactive full image segmentation, where the goal is to obtain accurate segmentations for all object and stuff regions in the image. Full image annotations are important for many applications such as self-driving cars, navigation assistance for the blind, and automatic image captioning. However, creating such datasets requires large amounts of human labor. For example, annotating a single image took 1.5 hours for Cityscapes [17]. For COCO+stuff [11, 32], annotating one image took 19 minutes (80 seconds per object [32] plus 3 minutes for stuff regions [11]), which totals 39k hours for the 123k images. So there is a clear need for faster annotation tools.

This paper proposes an efficient interactive framework for full image segmentation, illustrated in Fig. LABEL:fig:main_idea and 1. Given an image, an annotator first marks extreme points [37] on all object and stuff regions. These provide a tight bounding box with four boundary points for each region, and can be efficiently collected (7s per region [37]). Next, the machine predicts an initial segmentation for the full image based on these extreme points. Afterwards we present the whole image with the predicted segmentation to the annotator and iterate between (A) the annotator providing scribbles on the errors of the current segmentation, and (B) the machine updating the predicted segmentation accordingly (Fig. LABEL:fig:main_idea).

While all previous interactive segmentation methods address the task of single object segmentation [3, 8, 18, 16, 20, 23, 28, 29, 30, 36, 34, 35, 43, 52], our approach of full image segmentation provides several advantages: (I) In interactive segmentation typically the annotator focuses on the biggest errors made by the machine. In single object segmentation this means the biggest errors for that one object. In full image segmentation this extends the annotator focus beyond a single object to the biggest errors made in the whole image. Intuitively this should lead to a better trade-off between annotation time and quality. (II) In single object segmentation, each annotator correction only affects the machine predictions for a single region. In our approach instead, a single correction specifies the extension of one region and the shrinkage of neighboring regions (Sec. 3.2 and Fig. 2). At the same time, regions compete for space in the image canvas (Sec. 3.1). Hence there are two ways in which a single correction affects several region predictions. (III) Single object segmentation annotates only object regions, while our method also annotates stuff regions, which capture important classes such as pavement or river. (IV) When segmenting regions individually, many pixels along the boundaries of objects will be assigned to either multiple instances, resulting in contradictory labels, or to none, leading to holes. By letting nearby regions compete directly on a single image canvas, we avoid both problems.

Instead of using click corrections [23, 28, 29, 30, 34, 35, 52], in our scenario we use scribble corrections [3, 8, 43] as the more natural choice. In single object segmentation, any correction simply flips a binary label (from object to non-object or vice versa). In our scenario such correction is ambiguous since we deal with multiple regions. Therefore, the annotator first has to select the region to be extended, before making the correction itself. This selection followed by the corrective movement defines a complete path (scribble) which we can use (Fig. 3), which intuitively provides more information than just a click.

To the best of our knowledge, all deep interactive single object segmentation methods [23, 28, 29, 30, 34, 35, 52] are based on Fully Convolutional Network architectures [14, 33, 42]. Annotations are split into positive and negative corrections, indicating what should be added to the object segmentation and what should be removed. These are added as two extra input channels to the original RGB image. The network then outputs a binary image indicating which pixels are on the object (segmentation). In our case, since the number of regions differs per image, it is not clear how to adapt FCNs to multi-region input and instance-aware output. Instead, we base our approach on Mask-RCNN [21], which we adapt as follows. For each object and stuff region we use its bounding box to do Region-of-Interest (RoI) aligning, after which we add annotation corrections to the resulting feature map. This solves the problem of having a varying number of regions per image. As another advantage, processing a correction at inference time for FCNs [14, 33, 42] would require a forward pass through the whole network. Instead, in our case we only need to re-run the mask prediction branch of Mask-RCNN, which is considerably faster. Additionally, we propose a new loss. In Mask-RCNN [21] the loss is calculated for each mask individually. Instead, we project the mask predictions back on pixels in the common image canvas. Then we define a loss which is instance-aware yet lets predictions properly compete.

To summarize, our contributions are as follows: (1) We propose an interactive, scribble-based annotation framework which operates on the whole image to produce segmentations for all object and stuff regions. Importantly, our framework enables the annotator to focus on the largest errors made by the machine across the whole image. Moreover, even one provided correction can improve the segmentation quality of multiple regions. (2) We adapt Mask-RCNN [21] into a fast interactive segmentation framework and introduce a new instance-aware loss measured at the pixel-level in the common image canvas, which lets predictions for nearby regions properly compete. (3) We compare interactive full image segmentation to interactive single object segmentation on the on the COCO panoptic dataset [11, 25, 32]. We demonstrate that, at a budget of four extreme clicks and four corrective scribbles per region, our improvements bring a 5% IoU gain, reaching 90% IoU.

Figure 1: Our proposed region based model for interactive full image segmentation (see Sec. 3.1 for details). We start from Mask-RCNN [21], but use user provided boxes (from extreme points) instead of a box proposal networks for RoI cropping, and concatenate the RoI features with annotator provided corrective scribbles. Instead of predicting binary masks for each region, we project all region prediction into the common image canvas, where they compete for space. The network is trained end-to-end for a novel pixel-wise loss for the full image segmentation task (see Sec. 3.3).

2 Related Work

Semantic segmentation from weakly labeled data. Many works address semantic segmentation by training from weakly labeled data, such as image-level labels [26, 40, 50], point-clicks [5, 6, 13, 49], boxes [24, 35, 37] and scribbles [31, 51]. Boxes can be efficiently annotated using extreme points [37] which can also be used as an extra signal for generating segmentations [35, 37]. This is related as our method starts from extreme points for each region. However, the above methods operate from annotations collected before any machine processing. Our work instead is in the interactive scenario, where the annotator iteratively provide corrective annotations for the current machine segmentation.

Interactive object segmentation. Interactive object segmentation is a long standing research topic. Most classical approaches [3, 4, 8, 43, 18, 16, 20, 36] formulate object segmentation as energy minimization on a regular graph defined over pixels, with unary potential capturing low-level appearance properties and pairwise or higher-order potentials encouraging regular segmentation outputs.

Starting from Xu et al. [52]

, recent methods address interactive object segmentation with deep neural networks 

[23, 28, 29, 30, 34, 35, 52]. These works build on Fully Convolutional architectures such as FCNs [33] or Deeplab [14]. They input the RGB image plus two extra channels for object and non-object corrections, and output a binary mask.

In [15] they perform interactive object segmentation in video. They use Deeplab [14]

to create a pixel-wise embedding space. Annotator corrections are used to create a nearest neighbor classifier on top of this embedding, enabling quick updates of the object predictions.

Finally, Polygon-RNN [1, 12] is an interesting alternative approach. Instead of predicting a mask, they uses a recurrent neural net to predict polygon vertices. Corrections made by the annotator are used by the machine to refine its vertex predictions.

Interactive full image segmentation. Recently, [2] proposed Fluid Annotation, which also addresses the task of full image annotation. Our work shares the spirit of focusing annotator effort on the biggest errors made by the machine across the whole image. However, [2] uses Mask-RCNN [21] to create a large pool of fixed segments and then provides an efficient interface for the annotator to rapidly select which of these should form the final segmentation. In contrast, in our work all segments are created from the initial extreme points and are all part of the final annotation. Our method then enables to correct the shape of segments to precisely match object boundaries.

Other works on interactive annotation. In [44] they combine a segmentation network with a language module to allow a human to correct the segmentation by typing feedback in natural language, such as “there are no clouds visible in this image”. The work of [38] annotates bounding boxes using only human verification, while [27] trained agents to determine whether it is more efficient to verify or draw a bounding box. The avant-garde work of [45] had a machine dispatching many labeling questions to annotators, including whether an object class is present, box verification, box drawing, and finding missing instances of a particular class in the image. In [47]

they estimate the informativeness of having an image label, a box, or a segmentation for an image, which they use to guide an active learning scheme. Finally, several works tackle fine-grained classification through attributes interactive provided by annotators 

[9, 39, 7, 48].

3 Our interactive segmentation model

This section describes our model which we use to predict a segmentation from extreme points and scribble corrections (Fig. LABEL:fig:main_idea). We first discuss the model architecture (Sec. 3.1). We then describe how we feed annotations to the model (extreme points and scribble corrections, Sec. 3.2

). Finally, we describe model training with our new loss function (Sec. 


3.1 Model architecture

Our model is based on Mask-RCNN [21]. In Mask-RCNN inference is done as follows: (1) An input image is passed through a deep neural network backbone such as ResNet [22], producing a feature map . (2) A specialized network module (RPN [41]) predicts box proposals based on . (3) These box proposals are used to crop out Region-of-Interest (RoI) features from with a RoI cropping layer (RoI-align [21]). (4) Then each RoI feature is fed into three separate network modules which predict a class label, refined box coordinates, and a segmentation mask.

Fig. 1 illustrates how we adapt Mask-RCNN [21] for interactive full image segmentation. In particular, our network takes three types of inputs: (1) an image of size ; (2) annotation maps of size (for extreme points and scribble corrections, Sec. 3.2); and (3) boxes determined by the extreme points provided by annotators. Here is the number of regions that we want to segment, which is determined by the annotator, and which may vary per image.

As in Mask-RCNN, an image is fed into our backbone architecture (ResNet [22]) to produce feature map Z of size , where is the number of feature channels and is a reduction factor. Both and are determined by the choice of backbone architecture.

In contrast to Mask-RCNN, we already have boxes , so we do not need a box proposal module. Instead, we use each box directly to crop out an RoI feature from feature map . All features have the same fixed size (i.e. and are only dependent on the RoI cropping layer). We concatenate to this the corresponding annotation map which is described in Sec. 3.2, and obtain a feature map , which is of size .


, our network predicts a logit map

of size which represents the prediction of a single mask. While Mask-RCNN stops at such mask predictions and processes them with a sigmoid to obtain binary masks, we want to have predictions influence each other. Therefore we use the boxes to re-project the logit predictions of all masks back into the original image resolution which results in prediction maps

. We concatenate these prediction maps into a single tensor

of size

. For each pixel, we then obtain region probabilities

of dimension by applying a softmax to the logits,


where denotes the probability that pixel is assigned to region . This makes multiple nearby regions compete for space in the common image canvas.

3.2 Incorporating annotations

Figure 2: We illustrate how we combine all nearby annotations into two region specific annotation maps. The colored regions denote the current predicted segmentation while the black boundaries depict the true object boundaries. For the red region, the extreme points and the single positive scribble are combined into a single positive binary channel, whereas all scribbles from other regions are collected into a single negative binary channel.

Our model in Fig. 1 concatenates RoI features with annotation map . We now describe how we create . First, for each region we create a positive annotation map which is of the same size as the image. We choose the annotation map to be binary and we create it by pasting all extreme points and corrective scribbles for region onto it. Extreme points are represented by a circle which is 6 pixels in diameter. Scribbles are 3 pixels wide.

For each region , we collapse all annotations which do not belong to it into a single negative annotation map . Then, we concatenate the positive and negative annotation maps into a two-channel annotation map


which is illustrated in Fig. 2. Finally, we apply RoI-align [21] to using box to obtain the desired cropped annotation map .

The way we construct enables the sharing of all annotation information across multiple object and stuff regions in the image. The negative annotations for one region are formed by collecting the positive annotations of all other regions. In contrast, in single object segmentation works [3, 8, 18, 16, 20, 23, 28, 29, 30, 36, 34, 35, 43, 52] both positive and negative annotations are made only on the target object and they are never shared, so they only have an effect on that one object.

3.3 Training

Training data. As training data, we have ground-truth masks for all objects and stuff regions in all images. We represent the (non-overlapping) ground truth masks of an image with region indices. This results in a map of dimension , which assigns each pixel to a region .

Pixel-wise loss. Standard Mask-RCNN is trained with Binary Cross Entropy (BCE) losses for each mask prediction separately. This means that there is no direct interaction between adjacent masks, and they might even overlap. Instead, we propose a novel instance-aware loss which lets predictions compete for space in the original image canvas.

In particular, as described in Sec. 3.1 we project all region specific logits into a single image-level logit tensor , which is softmaxed into a region assignment probabilities of size .

As described above, the ground-truth segmentation is represented by with values in , which specifies for each pixel its region index. Since we simulate the extreme points from the ground-truth masks, there is a direct correspondence between the region assignment probabilities and . Thus, we can train our network end-to-end for the Categorical Cross Entropy (CCE) loss for the region assignments:


We note that while the CCE loss is commonly used in fully convolutional networks for semantic segmentation [14, 33, 42], we instead use it in an architecture based on Mask-RCNN [21]. Furthermore, usually the loss is defined over a fixed number of classes  [14, 33, 42], whereas we define it over the number of regions . This number of regions may vary per image.

The loss in (3) is computed over the pixels in the full resolution common image canvas. Consequently, larger regions have a greater impact on the loss. However, in our experiments we measure Intersection-over-Union (IoU) between ground-truth masks and predictions, which considers all regions equally independent of their size.

Therefore we weigh the terms in (3) as follows. For each pixel we find the smallest box which contains it, and reweigh the loss for that pixel by the inverse of the size of . This causes each region to contribute to the loss approximately equally.

Our loss shares similarities with [10]. They used Fast-RCNN [19] with selective search regions [46]

and generate a class prediction vector for each region. Then they project this vector back into the image canvas using its corresponding region, while resolving conflicts using a max operator. In our work instead, we project a full logit map back into the image (Fig. 

1). Furthermore, while in [10] the number of logit channels is equal to the number of classes , in our work it depends on the number of regions , which may vary per image.

3.4 Implementation details

The original implementation of Mask-RCNN [21] creates for each RoI feature mask predictions for all classes that it is trained on. At inference time, it uses the predicted class to select the corresponding predicted mask. Since we build on Mask-RCNN, we also do this in our framework for convenience of the implementation. During training we use the class labels to train class-specific mask prediction logits. During inference, for each region we use the class label predicted by Mask-RCNN to select which mask logits which we use as . Hence during inference time, we have implicit class labels. However, class labels are never exposed to the annotator and are considered to be irrelevant for this paper.

4 Simulating annotations

Like other interactive segmentation works [1, 12, 23, 28, 29, 30, 34, 35, 52], we simulate annotations in our experiments.

Extreme points. To simulate the extreme points that the annotator provides at the beginning, we use the code provided by [35].

Scribble corrections. To simulate scribble corrections during the interactive segmentation process, we first need to select an error region. Error regions are defined as a connected group of pixels of a ground-truth region which has been wrongly assigned to a different region (Fig. 3). We assess the importance of an error region by measuring how much segmentation quality (IoU) would improve if it was completely corrected. We use this to create annotator corrections on the most important error regions (the exact way depends on the particular experiment, details in Sec. 5).

Figure 3: The colored regions marks the machine predictions, whereas the solid black boundary marks the ground-truth for those regions. To simulate a corrective scribble, we first sample an initial point to indicate which region we want to expand. We do this by sampling a point (in yellow) on the red prediction, close to the error region (striped). Two additional points (orange) are sampled uniformly from the error region, and the scribble is formed as a trajectory through the three points.

To correct an error, we need a scribble that starts inside the ground-truth region and extends into the error region (Fig. LABEL:fig:main_idea). We simulate such scribbles with a three-step process, illustrated in Fig. 3: (1) first we randomly sample the first point on the border of the error region that touches the ground-truth region (yellow point in Fig. 3; (2) then we sample two more points uniformly inside the error region (yellow points in Fig. 3). (3) we construct a scribble as a smooth trajectory through these three points (using a bezier curve). We repeat this process ten times, and keep the longest scribble that is exclusively inside the ground-truth region (while all simulated points are on the ground-truth, the curve could cover parts outside the ground-truth).

5 Results

We use Mask-RCNN as basic segmentation framework instead of Fully Convolutional architectures [14, 33, 42] commonly used in single object segmentation works [23, 28, 29, 30, 34, 35, 52]. Therefore we first demonstrate in Sec. 5.1 that this is a valid choice by comparing to DEXTR [35] in the non-interactive setting where we generate masks starting from extreme points [37]. In Sec 5.2 we move to the full image segmentation task and demonstrate improvements resulting from sharing extreme points across regions and from our new pixel-wise loss. Finally, in Sec. 5.3 we show results on interactive full image segmentation.

5.1 Single object segmentation

DEXTR. In DEXTR [35] they predict object masks from four extreme points [37]. DEXTR is based on Deeplab-v2 [14], using a ResNet-101 [22] backbone architecture and a Pyramid Scene Parsing network [53] as prediction head. As input they crop a bounding box out of the RGB image based on the extreme points provided by the annotator. The locations of the extreme points are Gaussian blurred and fed as a heatmap to the network, concatenated to the cropped RGB input. The DEXTR segmentation model obtained state-of-the-art results on this task [35].

Details of our model. We compare DEXTR to a single object segmentation variant of our model, which we call our single region model. This version uses the original Mask-RCNN loss, computed individually per mask, and does not share annotations across regions. For fair comparison to DEXTR, here we also use a ResNet-101 [22] backbone, which due to memory constraints limits the resolution of our RoI features to pixels and our predicted mask to pixels. Moreover, we use their released code to generate simulated extreme point annotations. In contrast to subsequent experiments, for this experiment we also use the same Gaussian blurred heatmaps as annotation inputs to our model as used in [35].

Dataset. We follow the experimental setup of [35] on the COCO dataset [32], which has 80 object classes. Models are trained on the 2014 training set and evaluated on the 2017 validation set (previously called 2014 minival). We measure performance in terms of Intersection-over-Union averaged over all instances.

Method IoU
DEXTR [35] 82.1
DEXTR (released model) 81.9
Our single region model 81.6
Table 1: Performance on COCO (objects only). The accuracy of our single region model is comparable to DEXTR [35].
X-points not shared X-points shared
Mask-wise loss 75.8 76.0
Pixel-wise loss 78.4 79.1
Table 2: Performance on the COCO Panoptic validation set when predicting masks from extreme points (X-points). We vary the loss and whether extreme points are shared across regions. The top-left entry corresponds to our single region model, the bottom-right entry corresponds to our full image model.

Results. Tab. 1 reports the original results from DEXTR [35], our reproduction using their publicly released model, and results of our single region model. Their publicly released model and our model deliver very similar results (81.9 and 81.6 IoU). This demonstrate that Mask-RCNN-style models are competitive to commonly used FCN-style models for this task, and can yield state-of-the-art performance.

5.2 Full image segmentation

Experimental setup. We now move to the task of full image segmentation. Given extreme points for each object and stuff region, we predict a full image segmentation. We demonstrate the benefits of using our pixel-wise loss (Sec. 3.3 and sharing extreme points across regions (i.e. extreme points for one region are used as negative information for nearby regions, Sec. 3.2).

Details of our model. In preliminary experiments we found that RoI features of pixels resolution was limiting accuracy when feeding in annotations into the segmentation head. Therefore we increased both the RoI features and the predicted mask to pixels and switched to ResNet-50 [22] due to memory constraints. Importantly, in all experiments from now on our model uses the two channel annotation maps described in Sec. 3.2.

Dataset. We perform our experiments on the COCO panoptic challenge dataset [11, 25, 32], which has 80 object classes and 53 stuff classes. Since the final goal is to efficiently annotate data, we train on only 12.5% of the 2017 training set (15k images ). We evaluate on the 2017 validation set and measure IoU averaged over all object and stuff regions in all images.

Results. As Tab. 2 shows, our single region model, which uses a mask-wise loss and does not share extreme points across regions, yields 75.8 IoU. When sharing extreme points across regions, we only gain +0.2 IoU. In contrast, when only switching to our pixel-wise loss, results improve by +2.6 IoU. Sharing extreme points is more beneficial in combination with our new loss, yielding an additional improvement of +0.7 IoU. Overall this model with both improvements achieves 79.1 IoU, +3.3 higher than the single region model. We call it our full image model.

5.3 Interactive full image segmentation

We now move to our final system for interactive full image segmentation. We start from the segmentations from extreme points made by our single region and full image models from Sec. 5.2. Then we iterate between: (A) adding annotator corrections, and (B) updating the machine segmentations accordingly.

Dataset and training. As before, we experiment on the COCO panoptic challenge dataset and report results on the 2017 validation set. Since during iterations our models consume scribble corrections in addition to extreme points, we train two new interactive models which we call our single region scribble model and our full image scribble model. These are architecturally equivalent to their counterparts in Sec. 5.2 (which input only extreme points), but are trained differently. To create training data for our interactive models, for each we apply its counterpart to another 12.5% of the 2017 training set. We generate simulated corrective scribbles as described in Sec. 4 and train each model on the combined annotations of extreme points and scribbles (Sec. 3.2). We keep these models fixed throughout all iterations of interactive segmentation.

Interactive scribble corrections. For each iteration of scribble corrections, we use a budget of 1 scribble per region. For our single region scribble model, we use exactly one scribble per region. Instead, for our full image scribble model we explore two correction allocation strategies. The first allocates exactly one scribble to each region. The second is the more interesting strategy where the annotator gets a budget of 1 scribble per region on average, but can freely allocate these scribbles to the regions. This way the annotator can focus efforts on the biggest errors in the image. This results in an unequal division of scribble corrections over the regions.

Figure 4: We simulate corrective scribbles, starting with the output of the scribbles model on extreme points only.
Input image Machine predictions Corrective scribbles Machine predictions Final result Ground-truth
with extreme points from extreme points 1 scribble/region 1 scribble/region 9 scribbles/region
provided by annotator provided by annotator and extreme points and extreme points
Figure 5: We show example results obtained by our system using the full image scribble model with a total budget of one scribble per region for each step. The first four columns show the improvements obtained from the first step, while the last two compare the final result after 9 steps (using a total of 9 scribbles per region) with the ground-truth segmentation.

Results. Fig. 4 shows annotation cost (number of corrective scribbles per region) versus annotation quality. The two starting points at zero scribbles are equivalent to the top-left and bottom-right entries of Tab. 2 since they are made using the same (non-interactive) models, starting from extreme points only.

We first compare single region scribble to full image scribble while using the same scribble allocation strategy: exactly one scribble to each region. In contrast to its counterpart in Sec. 5.2, here the full image scribble model also shares scribble corrections across regions (in addition to sharing the extreme points). Fig. 4 shows that for both models accuracy rapidly improves with more scribble corrections. However, the full image scribble model always has a better trade-off between annotation effort and segmentation quality: to reach 85% IoU it takes 4 scribbles per region for the single region scribble model but only 2 scribbles for our full image scribble model. Similarly, to reach 88% IoU it takes 8 scribbles and 4 scribbles respectively. This demonstrate the benefits of sharing corrections across regions.

We now compare single region scribble to full image scribble, while using the more advanced allocation strategy of freely allocating scribbles, under the overall budget of one scribble per region on average. Fig. 4 demonstrates that the advantages of using full image scribble with this allocation strategy are even more pronounced: at a budget of four extreme clicks and four corrective scribbles per region, full image scribble delivers a 5% IoU gain over single region scribble, reaching 90% IoU. This demonstrate the benefits of considering the whole image when allocating scribbles, so as to focus effort on the largest errors across the whole image.

To get a feeling of how annotation progresses over iterations, Fig. 5 shows various qualitative examples. Notice how in the first example, the corrective scribble on the bear on the left creates a negative scribble for the rock, which in turn improves the segmentation of the bear on the right, demonstrating the benefit of shared annotations and competition between regions.

We conclude that for interactive full image segmentation, our new pixel-level loss, sharing annotations across regions, and focusing annotator effort on the biggest errors in the image, all results in significant improvements to the trade-off between annotation cost and quality, finally leading to an IoU of 90% using just four extreme points and four corrective scribbles per region.

6 Conclusion

We propose an interactive, scribble-based annotation framework which operates on the whole image to produce segmentations for all object and stuff regions. By presenting the whole image plus predicted segmentations, the annotator can focus on the largest errors made by the machine across the whole image, while we propose to share annotator corrections across multiple regions. We adapt Mask-RCNN [21] into a fast interactive segmentation framework and introduce a new instance-aware loss measured at the pixel-level in the full image canvas, which lets predictions for nearby regions properly compete. On single object prediction on the COCO dataset [32] we demonstrate that our Mask-RCNN based framework is competitive to the commonly used FCN-style models for this task [23, 28, 29, 30, 34, 35, 52]. More importantly, on the COCO panoptic challenge dataset [11, 25, 32] we show that our interactive full image segmentation system significantly outperforms its single region counterpart: at a budget of four extreme clicks and four corrective scribbles per region, our improvements yield +5% IoU, reaching 90% IoU.


  • [1] D. Acuna, H. Ling, A. Kar, and S. Fidler. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018.
  • [2] M. Andriluka, J. R. R. Uijlings, and V. Ferrari. Fluid annotation: A human-machine collaboration interface for full image annotation. In ACM Multimedia, 2018.
  • [3] X. Bai and G. Sapiro. Geodesic matting: A framework for fast interactive image and video segmentation and matting. IJCV, 2009.
  • [4] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen. Interactively co-segmentating topically related images with intelligent scribble guidance. IJCV, 2011.
  • [5] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei. What’s the point: Semantic segmentation with point supervision. 2016.
  • [6] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recognition in the wild with the materials in context database. In CVPR, 2015.
  • [7] A. Biswas and D. Parikh. Simultaneous active learning of classifiers & attributes via relative feedback. In CVPR, 2013.
  • [8] Y. Boykov and M. P. Jolly. Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In ICCV, 2001.
  • [9] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie. Visual recognition with humans in the loop. In ECCV, 2010.
  • [10] H. Caesar, J. Uijlings, and V. Ferrari. Region-based semantic segmentation with end-to-end training. In ECCV, 2016.
  • [11] H. Caesar, J. Uijlings, and V. Ferrari. COCO-stuff: Thing and stuff classes in context. In CVPR, 2018.
  • [12] L. Castrejón, K. Kundu, R. Urtasun, and S. Fidler. Annotating object instances with a polygon-rnn. In CVPR, 2017.
  • [13] D.-J. Chen, J.-T. Chien, H.-T. Chen, and L.-W. Chang. Tap and shoot segmentation. In AAAI, 2018.
  • [14] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. on PAMI, 2017.
  • [15] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blazingly fast video object segmentation with pixel-wise metric learning. In CVPR, 2018.
  • [16] M.-M. Cheng, V. A. Prisacariu, S. Zheng, P. H. S. Torr, and C. Rother. Densecut: Densely connected crfs for realtime grabcut. Computer Graphics Forum, 2015.
  • [17] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    CVPR, 2016.
  • [18] A. Criminisi, T. Sharp, C. Rother, and P. Perez. Geodesic image and video editing. In ACM Transactions on Graphics, 2010.
  • [19] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [20] V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zisserman. Geodesic star convexity for interactive image segmentation. In CVPR, 2010.
  • [21] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016.
  • [23] Y. Hu, A. Soltoggio, R. Lock, and S. Carter. A fully convolutional two-stream fusion network for interactive image segmentation. Neural Networks, 2019.
  • [24] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017.
  • [25] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár. Panoptic segmentation. In ArXiv, 2018.
  • [26] A. Kolesnikov and C. Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV, 2016.
  • [27] K. Konyushkova, J. Uijlings, C. Lampert, and V. Ferrari. Learning intelligent dialogs for bounding box annotation. In CVPR, 2018.
  • [28] H. Le, L. Mai, B. Price, S. Cohen, H. Jin, and F. Liu. Interactive boundary prediction for object selection. In ECCV, 2018.
  • [29] Z. Li, Q. Chen, and V. Koltun. Interactive image segmentation with latent diversity. In CVPR, 2018.
  • [30] J. Liew, Y. Wei, W. Xiong, S.-H. Ong, and J. Feng. Regional interactive image segmentation networks. In ICCV, 2017.
  • [31] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. ScribbleSup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016.
  • [32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • [33] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [34] S. Mahadevan, P. Voigtlaender, and B. Leibe. Iteratively trained interactive segmentation. In BMVC, 2018.
  • [35] K.-K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool. Deep extreme cut: From extreme points to object segmentation. In CVPR, 2018.
  • [36] N. S. Nagaraja, F. R. Schmidt, and T. Brox. Video segmentation with just a few strokes. In ICCV, 2015.
  • [37] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari. Extreme clicking for efficient object annotation. In ICCV, 2017.
  • [38] D. P. Papadopoulos, J. R. R. Uijlings, F. Keller, and V. Ferrari. We don’t need no bounding-boxes: Training object class detectors using only human verification. In CVPR, 2016.
  • [39] A. Parkash and D. Parikh. Attributes for classifier feedback. In ECCV, 2012.
  • [40] D. Pathak, P. Krähenbuhl, and T. Darrell.

    Constrained convolutional neural networks for weakly supervised segmentation.

    In ICCV, 2015.
  • [41] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [42] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  • [43] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In SIGGRAPH, 2004.
  • [44] C. Rupprecht, I. Laina, N. Navab, G. D. Hager, and F. Tombari. Guide me: Interacting with deep networks. In CVPR, 2018.
  • [45] O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both worlds: human-machine collaboration for object annotation. In CVPR, 2015.
  • [46] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. IJCV, 2013.
  • [47] S. Vijayanarasimhan and K. Grauman. What’s it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations. In CVPR, 2009.
  • [48] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and S. Belongie. Similarity comparisons for interactive fine-grained categorization. In CVPR, 2014.
  • [49] T. Wang, B. Han, and J. Collomosse. Touchcut: Fast image and video segmentation using single-touch interaction. CVIU, 2014.
  • [50] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang. Revisiting dilated convolution: A simple approach for weakly- and semi-supervised semantic segmentation. In CVPR, 2018.
  • [51] J. Xu, A. G. Schwing, and R. Urtasun. Learning to segment under various forms of weak supervision. In CVPR, 2015.
  • [52] N. Xu, B. Price, S. Cohen, J. Yang, and T. Huang. Deep interactive object selection. In CVPR, 2016.
  • [53] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.