Multi-Stage Fusion for One-Click Segmentation

by   Soumajit Majumder, et al.

Segmenting objects of interest in an image is an essential building block of applications such as photo-editing and image analysis. Under interactive settings, one should achieve good segmentations while minimizing user input. Current deep learning-based interactive segmentation approaches use early fusion and incorporate user cues at the image input layer. Since segmentation CNNs have many layers, early fusion may weaken the influence of user interactions on the final prediction results. As such, we propose a new multi-stage guidance framework for interactive segmentation. By incorporating user cues at different stages of the network, we allow user interactions to impact the final segmentation output in a more direct way. Our proposed framework has a negligible increase in parameter count compared to early-fusion frameworks. We perform extensive experimentation on the standard interactive instance segmentation and one-click segmentation benchmarks and report state-of-the-art performance.



page 2

page 7

page 9


Scale-aware multi-level guidance for interactive instance segmentation

In interactive instance segmentation, users give feedback to iteratively...

Localized Interactive Instance Segmentation

In current interactive instance segmentation works, the user is granted ...

A Fully Convolutional Two-Stream Fusion Network for Interactive Image Segmentation

In this paper, we propose a novel fully convolutional two-stream fusion ...

Iterative Interaction Training for Segmentation Editing Networks

Automatic segmentation has great potential to facilitate morphological m...

Unified Interactive Image Matting

Recent image matting studies are developing towards proposing trimap-fre...

Interactive Image Segmentation using Label Propagation through Complex Networks

Interactive image segmentation is a topic of many studies in image proce...

Recurrent Saliency Transformation Network: Incorporating Multi-Stage Visual Cues for Small Organ Segmentation

We aim at segmenting small organs (e.g., the pancreas) from abdominal CT...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The widespread availability of smartphones had made taking photos easier than ever. In a typical image capturing scenario, the user taps the device touchscreen to focus on the object of interest. This tap directly locates the object in the scene and can be leveraged for segmentation. Generated segmentations are implicit, but are applicable for downstream photo applications, such as simulated ‘bokeh’ or other special-effects filters such as background blur (see Fig. 1). In this work, we tackle “tap-and-shoot segmentation” [tapnshoot], a special case of interactive instance segmentation.

Interactive segmentation leverages inputs such as clicks, scribbles, or bounding boxes to help segment objects from the background down to the pixel level. Two key differences distinguish tap-and-shoot segmentation from standard interactive segmentation. First, tap-and-shoot uses only “positive” clicks marking foreground, as we assume that the user clicks (only) on the object of interest during the capture process. Standard interactive segmentation uses both positive and negative clicks [ifcn, itis, majumder19] to respectively indicate the object of interest versus background or non-relevant foreground objects. Secondly, tap-and-shoot has a strong focus on maximizing the mean intersection over union (mIoU) with a single click because the target application is casual photography. In contrast, standard interactive segmentation tries to achieve some threshold mIoU (e.g. 85%) while minimizing the total number of clicks.

This second distinction is subtle but critical for designing and learning tap-and-shoot segmentation frameworks. Our finding is that existing approaches fare poorly with only one or two clicks – they are simply not trained to maximize performance under such settings. To make the most of the first (few) click(s), we hypothesize that user cues’ guidance should be fused into the network at multiple locations rather than via early fusion. Just as gradients vanish towards the initial layers during back-propagation, input signals also diminish as it makes a forward pass through the network. The many layers of deep CNNs further exacerbate this effect [twostream, park2019]. A late fusion would allow the user interaction to have a direct and more pronounced effect on the final segmentation mask. To this end, we propose an interactive segmentation framework with multi-stage fusion and demonstrate its advantages over the common early fusion frameworks and other alternatives. Specifically, we propose a light-weight fusion block that encodes the user click transformation and allows a shorter connection from user inputs to the final segmentation layer.

Most similar in spirit to our framework is [twostream] and [guidedprop]. These two works also propose alternatives to early fusion but are extremely parameter heavy. For example, [twostream] uses two dedicated VGG [vgg] networks to to extract features from the image and the user interactions separately before fusing into a final instance segmentation mask (see Fig. 2(c)). [guidedprop] uses a single stream but applies a simple late fusion of element-wise multiplication on the feature maps (see Fig. 2(b)). It therefore has separate ‘positive’ and ‘negative’ feature maps and the number of weights for the following layer increases by a factor of . For VGG, this doubles the parameters of the ensuing ‘fc6’ layer from to million. Compared to [guidedprop], our last-stage fusion approach is light-weight and uses less than more trainable parameters.

Our contributions are summarized as follows:

  • We propose a novel one-click interactive segmentation framework that fuses user guidance at different network stages.

  • We demonstrate that multi-stage fusion is highly beneficial for propagating guidance and increasing the mIoU since it allows user interaction to have a more direct impact on the final segmentation.

  • Comprehensive experiments on six benchmarks show that our approach significantly outperforms existing state-of-the-art for both tap-and-shoot and standard interactive instance segmentation.

Figure 1: Motivation. We consider the popular special-effect filter used in mobile photography - background blur. Here the user intends to blur the rest of the image barring the dog. In most existing interactive segmentation approaches [ifcn, itis, majumder19]

, the user click (here placed on the dog) is leveraged only at the input layer and its influence diminishes through the layers. This can result in unsatisfactory image effects, e.g portions of the dog’s elbow and ear are wrongly classified as background and are mistakenly blurred (shown in enlarged red boxes). Our proposed multi-stage fusion allows user click to have a more direct effect leading to improvement in segmentation quality (shown in enlarged green boxes).

2 Related Works

As an essential building block of image/video editing applications, interactive segmentation and dates back decades [scissors]. The latest methods [itis, ifcn, majumder19, twostream, guidedprop] integrate deep architectures such as FCN-8s [fcn] or DeepLab [deeplabv3, deeplabv2]. Most of these approaches integrate user cues in the input stage. The clicks are transformed into ‘guidance’ maps and appended to the three-channel colour image input before being passed through a CNN [itis, ifcn, majumder19].

Early Interactive Instance Segmentation approaches used graph-cuts [graphcuts, grabcut], geodesics, or a combination [geodesic]. These methods’ performance is limited as they separate the foreground and background based on low-level colour and texture features. Consequently, for scenes where foreground and background are similar in appearance, or lighting and contrast is low, more labelling effort from the users to achieve good segmentations [ifcn]

. Recently, deep convolutional neural networks 

[fcn, deeplabv3] have been incorporated into interactive segmentation frameworks. Initially, [ifcn] used Euclidean distance-based guidance maps to represent user-provided clicks and are passed along with the input RGB image through a fully convolutional network.

Figure 2: (a) Existing interactive instance segmentation and “tap-and-shoot" segmentation techniques concatenate user provided cues as an extra guidance map(s) (for ‘positive’ and ‘negative’ clicks) with the RGB and pass everything through a segmentation network. (b-c) Other alternative approaches are extremely parameter heavy. (b) The work of [twostream] uses two dedicated VGG [vgg] networks for extracting features from image and user interactions separately. (c) The work of [guidedprop] performs late fusion via element-wise multiplication on the feature maps which requires an additional million parameters. (d) We leverage user guidance at the input (early fusion) and via late fusion. Our multi-stage fusion reduces the layers of abstraction and allows user interactions to have a more direct impact on the final output.

Subsequent works made extensions with newer CNN architectures [itis], iterative training procedures [itis] and structure-aware guidance maps [majumder19]. These works share a structural similarity: the guidance maps are concatenated with the RGB image as additional channels at the first (input) layer. We refer to this form of structure as early fusion (see Fig. 2(a)). Architecture-wise, early fusion is simple and easy to train; however, user inputs’ influence gets diminished through the layers.

Tap-and-Shoot Segmentation was introduced by [tapnshoot], and refers to the one-click interactive setting. One assumes that during image capture, the user taps the touchscreen (once) on the foreground object of interest, from which one can directly segment the object of interest.  [tapnshoot] uses early fusion; it transforms the user tap into a guidance map via two shortest-path minimizations and then concatenates the map to the input image. The authors validate only on simple datasets such as ECSSD [ecssd] and MSRA10K [msra], where the images contain a single dominant foreground object. As we show later in our benchmarks (see Table 1), these datasets are so simplistic that properly trained networks with no user input can also generate high-quality segmentation masks which are comparable or even surpass the results reported by [tapnshoot].

Feature Fusion in Deep Architectures is an efficient way to leverage complementary information, either from different modalities [temporal], or different levels of abstraction [latematting]. Element-wise multiplication [guidedprop] and addition [twostream, nuclei] are two common operations applied for fusing multiple channels. Other strategies include ‘skip’ connections [fcn]

, where features from earlier layers are concatenated with the features extracted from the deeper layers. Recently, a few interactive instance segmentation works have begun exploring outside of the early-fusion paradigm to integrate user guidance 

[twostream, guidedprop]. However, these approaches are heavy in their computational footprint, as they increase the number of parameters to be learned by order of hundred of millions [guidedprop]. Dilution of input information is common-place in deep CNNs as the input gets processed several blocks of convolution [park2019]. Feature fusion helps preserve input information by reducing the layers of abstraction between the user interaction and the segmentation output.

3 Proposed Method

3.1 Overview

We follow the conventional paradigm of [ifcn, itis, majumder19] in which ‘positive’ and ‘negative’ user clicks are transformed into ‘guidance’ maps of the same size as the input image. Unlike [ifcn, itis, majumder19], we work within the one-click setting. The user provides a single ‘positive’ click on the object of interest; this click is then encoded into a single channel guidance map (see Sec. 3.3). We then feed the -channel RGB image input and the guidance map as an additional channel into a fully convolutional network. Fig. 3(a) shows an overview of our pipeline. Typically these FCNs are fine-tuned versions of semantic segmentation networks such as FCN-8s [fcn] or DeepLab [deeplabv2].

For our base segmentation network, we use DeepLab-v2 [deeplabv2]; it consists of a ResNet-101 [resnet] feature extraction backbone and a Pyramid Scene Parsing (PSP) module [psp] acting as the prediction head. Upon receiving the input of size , the ResNet-101 backbone generates feature maps of dimension (Fig. 3(a)).

3.2 Multi-stage fusion

Figure 3: (a) Overview of our pipeline. Given an image and a ‘positive’ user click (shown in green circle), we transform the click into a Gaussian guidance map, which is concatenated with the -channel image input and is fed to our segmentation network. For ease of visualization, inverted values for the Gaussian guidance map is shown in the image. The output is the segmentation mask of the selected object. (b) SE-ResNet block (c) Residual block.

The fusion module consists of Squeeze-and-Excitation residual blocks (SE-ResNet) [senet]. Proposed in [senet]

, SE-ResNet blocks have been shown to effective for a variety of vision tasks such as image classification on ImageNet 


and object detection on MS COCO 

[mscoco]. SE-ResNet blocks incur minimal additional computational overhead as they consist of two convolutional layers, two inexpensive fully connected layers and channel-wise scaling operation.

Each SE-ResNet block consists of a residual block, a squeeze operation which produces a channel descriptor by aggregating feature maps across their spatial operation, dimensionality reduction layer (by reduction ratio r) and an excitation operation which captures the channel interdependencies. The individual components of the SE-ResNet block is shown in Fig. 3(b). The residual block consists of two

convolutions, batch normalization, and a ReLU non-linearity (Fig. 

3(c)). We fix the number of filter banks to be for each of the convolution. The reduction ratio r is kept as 16 [senet]. The input to the fusion block is a feature map which is obtained by processing the input with

convolution operation with stride 2, batch normalization, ReLU non-linearity and a

max-pooling operation with stride 2 (Init block, Fig. 3(a)). The final SE-ResNet block downsamples to generate a feature map. This is concatenated with the obtained from the feature extraction backbone to obtain a feature map.

On top of these feature maps, PSP performs pooling operations at different grid scales on the feature maps to gather the global contextual prior, leading to feature maps of dimensions . The multi-scale feature pooling of PSP [psp]

enables the network to capture objects occurring at different image scales. Pixel-wise foreground-background classification is performed on these down-sampled feature maps. The network outputs a probability map representing whether a pixel belongs to the object of interest or not. Bi-linear interpolation is performed to up-sample the predicted probability map to have the same dimensions as the original input image.

3.3 Transforming user click

In interactive approaches, pixel values of the guidance map are defined as a function of its distance on the image grid to the point of user interaction (Eqn. 1). This includes Euclidean [ifcn, twostream] and Gaussian guidance maps [itis]. For each pixel position on the image grid, the pair of distance-based guidance maps for positive () and negative clicks () can be computed as


For Euclidean guidance maps [ifcn], the function is the Euclidean distance. For Gaussian guidance maps, the ‘min’ is replaced by a ‘max’ operator. A more recent approach advocated taking image structures such as super-pixels and region-based object proposals into consideration to generate guidance maps [majumder19]. To generate the guidance maps, we use Gaussian transformations [itis]

as it offers a favourable trade-off between simplicity and performance. We initialize an image-sized all zero channel and place a Gaussian with a standard deviation of

pixels at the user click location. Note that we do not use ‘negative’ clicks in our framework.

3.4 Implementation Details

Network Optimization. We train the network to minimize the class-balanced binary cross-entropy loss,


where is the number of pixels in the image, BCE() is the standard cross-entropy loss between the label and the prediction at pixel location given by,


is the inverse normalized frequency of labels

within the mini-batch. We optimize using mini-batch SGD with Nesterov momentum (with default value of

) and a batch size of 5. The learning rate is fixed at

across all epochs and weight decay is

. For the ResNet-101 backbone, we initialize the network weights from a model pre-trained on ImageNet [imagenet]. During training, we first update the early-fusion skeleton for - epochs. Next we freeze the weights of the early-fusion model and train the late-fusion weights for - epochs. Finally, we train the joint network for another epochs.

Simulating user clicks. Manually collecting user interactions is an expensive and arduous process [benenson19]. In a similar vein as [tapnshoot] and other interactive segmentation frameworks [majumder19, ifcn, itis], we simulate user interactions to train and evaluate our method. During training, we use the ground truth masks of the object instances from the MSRA10K dataset. To initialize, we take the center of mass of the ground truth mask as our user click location; we then jitter the click location by pixels randomly. The clicked pixel location is constrained to the confines of the object ground truth mask. The random perturbation introduces variation in the training data and also allows better approximation of true user interactions.

4 Experimental Validation

4.1 Datasets

We evaluate on six publicly available datasets commonly used to benchmark interactive image segmentation [tapnshoot, ifcn, itis, majumder19]: MSRA10K [msra], ECSSD [ecssd], GrabCut [grabcut], Berkeley [berkeley]

, PASCAL VOC 2012 

[pascal] and MS COCO [mscoco]. We use mean intersection over union (mIoU) of foreground w.r.t. to the ground truth object mask across all instances to evaluate the segmentation accuracy as per existing works [fcn, ifcn, tapnshoot, itis, majumder19].

MSRA10K has natural images; the images are characterized by variety in the foreground objects whilst the background is relatively homogeneous. Extended complex scene saliency dataset (ECSSD) is a dataset of natural images with structurally complex backgrounds. GrabCut is a dataset consisting of images with typically a distinct foreground object. It is a popular dataset for benchmarking interactive instance segmentation algorithms. Berkeley dataset consists of natural images. PASCAL VOC 2012 consists of training and validation images across different object classes; many images contain multiple objects. MS COCO is a challenging large-scale image segmentation dataset with different object categories, of which are common with the PASCAL VOC categories.

4.2 Tap-and-Shoot Segmentation

Method res GrabCut[grabcut] Berkeley[berkeley] ECSSD[ecssd] MSRA-10K[msra]
TNS[tapnshoot] 256 72.3 / 79.0 55.7 / 67.0 70.3 / 76.0 81.1 / 85.0
vgg-baseline 256 73.5 / 77.4 58.2 / 63.2 71.2 / 72.3 83.4 / 86.2
vgg-early 256 76.2 / 80.1 62.8 / 65.3 74.8 / 76.5 87.1 / 87.5
resnet-baseline 256 81.6 / 83.0 68.5 / 68.2 80.2 / 82.0 86.4 / 86.9
resnet-early 256 83.3 / 84.3 75.0 / 75.3 82.0 / 83.6 88.6 / 89.6
resnet-multi 256 84.1 / 85.7 75.1 / 78.4 81.9 / 85.2 91.5 / 92.1
resnet-baseline 512 76.1 / 79.0 65.5 / 68.3 79.9 / 82.6 87.0 / 87.9
resnet-early 512 82.9 / 84.5 76.2 / 78.1 85.6 / 85.7 91.5 / 91.4
resnet-multi 512 83.1 / 86.2 80.1 / 81.3 86.8 / 87.1 92.5 / 93.1
Table 1: Ablation Study: Tap-and-Shoot Segmentation. ‘res’ refers to the image resolution used during training. We report average mIoU for the segmentation results after training for 16K iterations and after training convergence. The  -baseline models receive a -channel RGB image as input without the guidance map .

Following [tapnshoot], we use MSRA10K [msra] for training and partition the dataset into three non-overlapping subsets of , and images as our training, validation and test set. We report the mIoU after training for 16K iterations and again after network convergence (at 43k iterations for us, vs. 260k iterations in [tapnshoot]) in Table 1. During training, we resize the images to pixels. This choice of resolution is driven primarily by matching the resolution to that of the training images for the ResNet-101 backbone [resnet].

The -baseline models are trained using only the -channel RGB image and the instance ground truth mask without any user click transformations. The -early models use Gaussian guidance maps [itis]; the network input is -channel RGB image and Gaussian encoding of the user’s tap on the object of interest (Fig. 2(a)). The -multi models refer to the multi-stage fusion models with Gaussian encoding of user clicks. Note that we do not train a late-fusion model; standalone late-fusion models show inferior performance compared to their early-fusion counterparts [guidedprop].

From Table. 1, we observe that our trained network converges mostly within 16K iterations. For simplistic datasets such as MSRA10K and ECSSD, the vgg-baseline without user click transformation compares favourably with the approach of [tapnshoot] at the same training resolution of . resnet-baseline models trained with images significantly outperform [tapnshoot] reporting absolute mIoU gains of till across the datasets. Based on this result alone, we conclude that one-click (and standard) interactive segmentation approaches should be benchmarked on more challenging datasets. Examples include PASCAL VOC 2012 and MS COCO, which feature cluttered scenes, multiple objects, occlusions and challenging lighting conditions. (see Table 3).

Furthermore, with only the Gaussian transformation and ResNet-101 backbone trained on , we are able to achieve mIoU increase in the range of - across datasets at convergence w.r.t [tapnshoot]. Having the multi-stage fusion offers us absolute mIoU gains of - w.r.t the early fusion variant (resnet-early vs. resnet-multi when trained with images). Additionally, our resnet models require significantly less memory; MB (stored as -bit/-byte floating point numbers) instead of the MB required for the segmentation network of [tapnshoot].

4.3 Interactive image segmentation

Approaches in the literature [ifcn, itis, majumder19, twostream] are typically evaluated by (1) the average number of clicks needed to reach the desired level of segmentation ( mIoU for PASCAL VOC , MS COCO, mIoU for the less challenging Grabcut and Berkeley) and (2) the average mIoU vs the number of clicks.

Figure 4: Examples of guidance maps. Given a click (shown as green circle) on the object of interest, existing approaches transform it into guidance maps and uses it as an additional input channel. For ease of visualization, inverted values for the disk guidance map and the Gaussian guidance map are shown in the image.

The first criterion is primarily geared towards annotation tasks [itis, majumder19] where high-quality segments are desired for each instance in the scene; the fewer the number of clicks, the lower the annotation effort. In this work, we are concerned primarily with achieving high-quality segments for the object of interest given only a single click. Accordingly, given a single user click, we report the average mIoU across all instances for the GrabCut, Berkeley and the PASCAL VOC val dataset. For MS COCO object instances, following [ifcn], we split the dataset into the PASCAL VOC categories and the additional categories, and randomly sample images per category for evaluation. We also report the average mIoU across the sampled MS COCO instances [twostream].

For training [ifcn, majumder19, twostream], we use the ground truth masks of object instances from PASCAL VOC  [pascal] train set with additional masks from Semantic Boundaries Dataset (SBD) [sbd] resulting in images. Note that unlike [itis], we do not use the training instances from MS COCO.

Ablation Study. We perform extensive ablation studies to thoroughly analyze the effectiveness of the individual components of our one-click segmentation framework. First, to validate our choice of guidance maps, we consider the user click transformations commonly used in existing interactive segmentation algorithms - Euclidean distance maps [ifcn, twostream], Gaussian distance maps [itis] and disk [benenson19]. Fig. 4 shows examples of such guidance maps. For each kind of guidance map, we train separate networks to understand the impact of different user click transformations. For evaluation, we report the average mIoU over all instances in the dataset, given a single click (see Table 2). Next, we study the impact of our proposed late-fusion module (denoted by -multi in Table. 2); we observe an average mIoU improvement of around across different datasets.

GrabCut Berkeley VOC12 COCO-20 COCO-60
Euclidean [ifcn] 82.6 82.7 75.1 63.2 46.8
Disk [benenson19] 84.5 81.3 74.5 65.3 51.5
Gaussian [itis] 84.0 82.9 78.1 64.2 49.8
Gaussian-multi 86.2() 84.0() 80.8() 64.5() 52.3()
Table 2: User Click Transformation. The best results are indicated in bold. COCO- and COCO- refers to the instances from overlapping categories and non-overlapping categories of PASCAL VOC respectively.

One-click segmentation. We compare the segmentation performance of our method with existing interactive instance segmentation approaches (see Table 3). The approaches are grouped separately into different categories - pre-deep learning approaches, deep learning-based interactive instance segmentation approaches and tap-and-shoot segmentation approaches. From Table. 3, we observe that our approach outperforms the classical interactive segmentation works by a significant margin reporting absolute improvement in average mIoU. We also outperform existing state-of-the-art interactive instance segmentation approaches [majumder19, itis] by a considerable margin (). Additionally, we report an absolute mIoU improvement of and on Grabcut and Berkeley over the tap-and-shoot segmentation framework of [tapnshoot]. We show qualitative results to demonstrate the effectiveness of our proposed algorithm (see Fig. 5). The resulting segmentations demonstrate that our approach is highly effective for the one-click segmentation paradigm.

5 User Study

Across existing state-of-the-art interactive frameworks [ifcn, itis, majumder19], user clicks are simulated following the protocols established in [ifcn, itis]. For our user study, we consult participants uninitiated to the task of interactive segmentation. We prepare a toy dataset with object instances from the MSRA10K [msra] dataset. We presented the image with the segmentation mask for the target object overlaid on the image and asked the users to provide their click.

Method Network GrabCut Berkeley VOC12 COCO-20 COCO-60
GC[graphcuts] - 41.7 33.8 27.7 - 8.9
GM[geodesicmatting] - 23.7 24.5 23.8 - 22.1
GD[geodesic] - 48.8 36.1 31.0 - 25.2
iFCN[ifcn] FCN-8s[fcn] 62.9 61.3 53.6 42.9
ITIS[itis] DeepLabv3+[deeplabv3] 82.1 - 71.0 - -
CAG[majumder19] FCN-8s[fcn] 83.2 - 74.0 - -
TS[twostream] FCN-8s[fcn] 77.7 74.5 62.3 42.5 42.5
TNS[tapnshoot] FCN-8s[fcn] 79.0 67.0 - - -
Ours-best DeepLabv2[deeplabv2] 86.2() 84.0() 80.8() 64.5() 52.3()
Table 3: Average mIoU given a single click. The approaches are grouped separately into different categories - pre-deep learning approaches, deep learning-based interactive instance segmentation approaches and tap-and-shoot segmentation approaches respectively. For GC[graphcuts], GM[geodesicmatting], GD[geodesic], and iFCN[ifcn] we make use of the values provided by the authors of iFCN[ifcn]. The mIoU improvement (in %) over existing state-of-the-art approaches is indicated using .

During training the object selection stage, we applied random perturbations of pixels to the center of mass of the object instance to obtain the final user click. Our user study found that participants placed clicks at a mean distance of pixels from the center of the mask with a standard deviation of pixels. This result validates our assumption that users are more likely to click in the vicinity of the object’s center-of-mass. It also supports our click sampling scheme for generating training instances when training the object selection stage. On average, we observed that users took seconds with a standard deviation of 0.8 seconds to position their click.

6 Conclusion

In this work, we propose a one-click segmentation framework that produces high-quality segmentation masks. We validated our design choices through detailed ablation studies; we observed that having a multi-stage module improves the segmentation framework and gives the network an edge over its early-fusion variants. Via experiments, we observed that for the single click scenario, our proposed approach significantly outperforms existing state-of-the-art approaches - including the more complicated interactive instance segmentation models using state-of-the-art segmentation models [deeplabv3].

Figure 5: Qualitative Results. Incorporating the user clicks at different stages of the network leads to an improvement in the quality of masks generated (second row) w.r.t the early-fusion variants (first row). Click locations are shown in green circles. The extreme right column shows a scenario where both the networks failed to generate a satisfactory mask.

However, we observe existing tap-and-shoot segmentation frameworks [tapnshoot], including our proposed framework, are limited by their inability to learn from negative clicks [ifcn, majumder19, itis]. One major drawback of such a training scenario is that the network does not have a notion of corrective clicking; if the generated segmentation mask extends beyond the object boundaries, it cannot rectify this mistake. Clicking on locations outside the object can mitigate this effect, though this then deviates from tap-and-shoot interaction.

Acknowledgment. This work was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme [NRF-NRFFAI1-2019-0001] and NUS Startup Grant R-252-000-A40-133.