1 Introduction
The widespread availability of smartphones had made taking photos easier than ever. In a typical image capturing scenario, the user taps the device touchscreen to focus on the object of interest. This tap directly locates the object in the scene and can be leveraged for segmentation. Generated segmentations are implicit, but are applicable for downstream photo applications, such as simulated ‘bokeh’ or other special-effects filters such as background blur (see Fig. 1). In this work, we tackle “tap-and-shoot segmentation” [tapnshoot], a special case of interactive instance segmentation.
Interactive segmentation leverages inputs such as clicks, scribbles, or bounding boxes to help segment objects from the background down to the pixel level. Two key differences distinguish tap-and-shoot segmentation from standard interactive segmentation. First, tap-and-shoot uses only “positive” clicks marking foreground, as we assume that the user clicks (only) on the object of interest during the capture process. Standard interactive segmentation uses both positive and negative clicks [ifcn, itis, majumder19] to respectively indicate the object of interest versus background or non-relevant foreground objects. Secondly, tap-and-shoot has a strong focus on maximizing the mean intersection over union (mIoU) with a single click because the target application is casual photography. In contrast, standard interactive segmentation tries to achieve some threshold mIoU (e.g. 85%) while minimizing the total number of clicks.
This second distinction is subtle but critical for designing and learning tap-and-shoot segmentation frameworks. Our finding is that existing approaches fare poorly with only one or two clicks – they are simply not trained to maximize performance under such settings. To make the most of the first (few) click(s), we hypothesize that user cues’ guidance should be fused into the network at multiple locations rather than via early fusion. Just as gradients vanish towards the initial layers during back-propagation, input signals also diminish as it makes a forward pass through the network. The many layers of deep CNNs further exacerbate this effect [twostream, park2019]. A late fusion would allow the user interaction to have a direct and more pronounced effect on the final segmentation mask. To this end, we propose an interactive segmentation framework with multi-stage fusion and demonstrate its advantages over the common early fusion frameworks and other alternatives. Specifically, we propose a light-weight fusion block that encodes the user click transformation and allows a shorter connection from user inputs to the final segmentation layer.
Most similar in spirit to our framework is [twostream] and [guidedprop]. These two works also propose alternatives to early fusion but are extremely parameter heavy. For example, [twostream] uses two dedicated VGG [vgg] networks to to extract features from the image and the user interactions separately before fusing into a final instance segmentation mask (see Fig. 2(c)). [guidedprop] uses a single stream but applies a simple late fusion of element-wise multiplication on the feature maps (see Fig. 2(b)). It therefore has separate ‘positive’ and ‘negative’ feature maps and the number of weights for the following layer increases by a factor of . For VGG, this doubles the parameters of the ensuing ‘fc6’ layer from to million. Compared to [guidedprop], our last-stage fusion approach is light-weight and uses less than more trainable parameters.
Our contributions are summarized as follows:
-
We propose a novel one-click interactive segmentation framework that fuses user guidance at different network stages.
-
We demonstrate that multi-stage fusion is highly beneficial for propagating guidance and increasing the mIoU since it allows user interaction to have a more direct impact on the final segmentation.
-
Comprehensive experiments on six benchmarks show that our approach significantly outperforms existing state-of-the-art for both tap-and-shoot and standard interactive instance segmentation.

, the user click (here placed on the dog) is leveraged only at the input layer and its influence diminishes through the layers. This can result in unsatisfactory image effects, e.g portions of the dog’s elbow and ear are wrongly classified as background and are mistakenly blurred (shown in enlarged red boxes). Our proposed multi-stage fusion allows user click to have a more direct effect leading to improvement in segmentation quality (shown in enlarged green boxes).
2 Related Works
As an essential building block of image/video editing applications, interactive segmentation and dates back decades [scissors]. The latest methods [itis, ifcn, majumder19, twostream, guidedprop] integrate deep architectures such as FCN-8s [fcn] or DeepLab [deeplabv3, deeplabv2]. Most of these approaches integrate user cues in the input stage. The clicks are transformed into ‘guidance’ maps and appended to the three-channel colour image input before being passed through a CNN [itis, ifcn, majumder19].
Early Interactive Instance Segmentation approaches used graph-cuts [graphcuts, grabcut], geodesics, or a combination [geodesic]. These methods’ performance is limited as they separate the foreground and background based on low-level colour and texture features. Consequently, for scenes where foreground and background are similar in appearance, or lighting and contrast is low, more labelling effort from the users to achieve good segmentations [ifcn]
. Recently, deep convolutional neural networks
[fcn, deeplabv3] have been incorporated into interactive segmentation frameworks. Initially, [ifcn] used Euclidean distance-based guidance maps to represent user-provided clicks and are passed along with the input RGB image through a fully convolutional network.
Subsequent works made extensions with newer CNN architectures [itis], iterative training procedures [itis] and structure-aware guidance maps [majumder19]. These works share a structural similarity: the guidance maps are concatenated with the RGB image as additional channels at the first (input) layer. We refer to this form of structure as early fusion (see Fig. 2(a)). Architecture-wise, early fusion is simple and easy to train; however, user inputs’ influence gets diminished through the layers.
Tap-and-Shoot Segmentation was introduced by [tapnshoot], and refers to the one-click interactive setting. One assumes that during image capture, the user taps the touchscreen (once) on the foreground object of interest, from which one can directly segment the object of interest. [tapnshoot] uses early fusion; it transforms the user tap into a guidance map via two shortest-path minimizations and then concatenates the map to the input image. The authors validate only on simple datasets such as ECSSD [ecssd] and MSRA10K [msra], where the images contain a single dominant foreground object. As we show later in our benchmarks (see Table 1), these datasets are so simplistic that properly trained networks with no user input can also generate high-quality segmentation masks which are comparable or even surpass the results reported by [tapnshoot].
Feature Fusion in Deep Architectures is an efficient way to leverage complementary information, either from different modalities [temporal], or different levels of abstraction [latematting]. Element-wise multiplication [guidedprop] and addition [twostream, nuclei] are two common operations applied for fusing multiple channels. Other strategies include ‘skip’ connections [fcn]
, where features from earlier layers are concatenated with the features extracted from the deeper layers. Recently, a few interactive instance segmentation works have begun exploring outside of the early-fusion paradigm to integrate user guidance
[twostream, guidedprop]. However, these approaches are heavy in their computational footprint, as they increase the number of parameters to be learned by order of hundred of millions [guidedprop]. Dilution of input information is common-place in deep CNNs as the input gets processed several blocks of convolution [park2019]. Feature fusion helps preserve input information by reducing the layers of abstraction between the user interaction and the segmentation output.3 Proposed Method
3.1 Overview
We follow the conventional paradigm of [ifcn, itis, majumder19] in which ‘positive’ and ‘negative’ user clicks are transformed into ‘guidance’ maps of the same size as the input image. Unlike [ifcn, itis, majumder19], we work within the one-click setting. The user provides a single ‘positive’ click on the object of interest; this click is then encoded into a single channel guidance map (see Sec. 3.3). We then feed the -channel RGB image input and the guidance map as an additional channel into a fully convolutional network. Fig. 3(a) shows an overview of our pipeline. Typically these FCNs are fine-tuned versions of semantic segmentation networks such as FCN-8s [fcn] or DeepLab [deeplabv2].
For our base segmentation network, we use DeepLab-v2 [deeplabv2]; it consists of a ResNet-101 [resnet] feature extraction backbone and a Pyramid Scene Parsing (PSP) module [psp] acting as the prediction head. Upon receiving the input of size , the ResNet-101 backbone generates feature maps of dimension (Fig. 3(a)).
3.2 Multi-stage fusion

The fusion module consists of Squeeze-and-Excitation residual blocks (SE-ResNet) [senet]. Proposed in [senet]
, SE-ResNet blocks have been shown to effective for a variety of vision tasks such as image classification on ImageNet
[imagenet]and object detection on MS COCO
[mscoco]. SE-ResNet blocks incur minimal additional computational overhead as they consist of two convolutional layers, two inexpensive fully connected layers and channel-wise scaling operation.Each SE-ResNet block consists of a residual block, a squeeze operation which produces a channel descriptor by aggregating feature maps across their spatial operation, dimensionality reduction layer (by reduction ratio r) and an excitation operation which captures the channel interdependencies. The individual components of the SE-ResNet block is shown in Fig. 3(b). The residual block consists of two
convolutions, batch normalization, and a ReLU non-linearity (Fig.
3(c)). We fix the number of filter banks to be for each of the convolution. The reduction ratio r is kept as 16 [senet]. The input to the fusion block is a feature map which is obtained by processing the input withconvolution operation with stride 2, batch normalization, ReLU non-linearity and a
max-pooling operation with stride 2 (Init block, Fig. 3(a)). The final SE-ResNet block downsamples to generate a feature map. This is concatenated with the obtained from the feature extraction backbone to obtain a feature map.On top of these feature maps, PSP performs pooling operations at different grid scales on the feature maps to gather the global contextual prior, leading to feature maps of dimensions . The multi-scale feature pooling of PSP [psp]
enables the network to capture objects occurring at different image scales. Pixel-wise foreground-background classification is performed on these down-sampled feature maps. The network outputs a probability map representing whether a pixel belongs to the object of interest or not. Bi-linear interpolation is performed to up-sample the predicted probability map to have the same dimensions as the original input image.
3.3 Transforming user click
In interactive approaches, pixel values of the guidance map are defined as a function of its distance on the image grid to the point of user interaction (Eqn. 1). This includes Euclidean [ifcn, twostream] and Gaussian guidance maps [itis]. For each pixel position on the image grid, the pair of distance-based guidance maps for positive () and negative clicks () can be computed as
(1) |
For Euclidean guidance maps [ifcn], the function is the Euclidean distance. For Gaussian guidance maps, the ‘min’ is replaced by a ‘max’ operator. A more recent approach advocated taking image structures such as super-pixels and region-based object proposals into consideration to generate guidance maps [majumder19]. To generate the guidance maps, we use Gaussian transformations [itis]
as it offers a favourable trade-off between simplicity and performance. We initialize an image-sized all zero channel and place a Gaussian with a standard deviation of
pixels at the user click location. Note that we do not use ‘negative’ clicks in our framework.3.4 Implementation Details
Network Optimization. We train the network to minimize the class-balanced binary cross-entropy loss,
(2) |
where is the number of pixels in the image, BCE() is the standard cross-entropy loss between the label and the prediction at pixel location given by,
(3) |
is the inverse normalized frequency of labels
within the mini-batch. We optimize using mini-batch SGD with Nesterov momentum (with default value of
) and a batch size of 5. The learning rate is fixed atacross all epochs and weight decay is
. For the ResNet-101 backbone, we initialize the network weights from a model pre-trained on ImageNet [imagenet]. During training, we first update the early-fusion skeleton for - epochs. Next we freeze the weights of the early-fusion model and train the late-fusion weights for - epochs. Finally, we train the joint network for another epochs.Simulating user clicks. Manually collecting user interactions is an expensive and arduous process [benenson19]. In a similar vein as [tapnshoot] and other interactive segmentation frameworks [majumder19, ifcn, itis], we simulate user interactions to train and evaluate our method. During training, we use the ground truth masks of the object instances from the MSRA10K dataset. To initialize, we take the center of mass of the ground truth mask as our user click location; we then jitter the click location by pixels randomly. The clicked pixel location is constrained to the confines of the object ground truth mask. The random perturbation introduces variation in the training data and also allows better approximation of true user interactions.
4 Experimental Validation
4.1 Datasets
We evaluate on six publicly available datasets commonly used to benchmark interactive image segmentation [tapnshoot, ifcn, itis, majumder19]: MSRA10K [msra], ECSSD [ecssd], GrabCut [grabcut], Berkeley [berkeley]
, PASCAL VOC 2012
[pascal] and MS COCO [mscoco]. We use mean intersection over union (mIoU) of foreground w.r.t. to the ground truth object mask across all instances to evaluate the segmentation accuracy as per existing works [fcn, ifcn, tapnshoot, itis, majumder19].MSRA10K has natural images; the images are characterized by variety in the foreground objects whilst the background is relatively homogeneous. Extended complex scene saliency dataset (ECSSD) is a dataset of natural images with structurally complex backgrounds. GrabCut is a dataset consisting of images with typically a distinct foreground object. It is a popular dataset for benchmarking interactive instance segmentation algorithms. Berkeley dataset consists of natural images. PASCAL VOC 2012 consists of training and validation images across different object classes; many images contain multiple objects. MS COCO is a challenging large-scale image segmentation dataset with different object categories, of which are common with the PASCAL VOC categories.
4.2 Tap-and-Shoot Segmentation
Method | res | GrabCut[grabcut] | Berkeley[berkeley] | ECSSD[ecssd] | MSRA-10K[msra] | |
---|---|---|---|---|---|---|
TNS[tapnshoot] | 256 | 72.3 / 79.0 | 55.7 / 67.0 | 70.3 / 76.0 | 81.1 / 85.0 | |
vgg-baseline | 256 | 73.5 / 77.4 | 58.2 / 63.2 | 71.2 / 72.3 | 83.4 / 86.2 | |
vgg-early | 256 | 76.2 / 80.1 | 62.8 / 65.3 | 74.8 / 76.5 | 87.1 / 87.5 | |
resnet-baseline | 256 | 81.6 / 83.0 | 68.5 / 68.2 | 80.2 / 82.0 | 86.4 / 86.9 | |
resnet-early | 256 | 83.3 / 84.3 | 75.0 / 75.3 | 82.0 / 83.6 | 88.6 / 89.6 | |
resnet-multi | 256 | 84.1 / 85.7 | 75.1 / 78.4 | 81.9 / 85.2 | 91.5 / 92.1 | |
resnet-baseline | 512 | 76.1 / 79.0 | 65.5 / 68.3 | 79.9 / 82.6 | 87.0 / 87.9 | |
resnet-early | 512 | 82.9 / 84.5 | 76.2 / 78.1 | 85.6 / 85.7 | 91.5 / 91.4 | |
resnet-multi | 512 | 83.1 / 86.2 | 80.1 / 81.3 | 86.8 / 87.1 | 92.5 / 93.1 |
Following [tapnshoot], we use MSRA10K [msra] for training and partition the dataset into three non-overlapping subsets of , and images as our training, validation and test set. We report the mIoU after training for 16K iterations and again after network convergence (at 43k iterations for us, vs. 260k iterations in [tapnshoot]) in Table 1. During training, we resize the images to pixels. This choice of resolution is driven primarily by matching the resolution to that of the training images for the ResNet-101 backbone [resnet].
The -baseline models are trained using only the -channel RGB image and the instance ground truth mask without any user click transformations. The -early models use Gaussian guidance maps [itis]; the network input is -channel RGB image and Gaussian encoding of the user’s tap on the object of interest (Fig. 2(a)). The -multi models refer to the multi-stage fusion models with Gaussian encoding of user clicks. Note that we do not train a late-fusion model; standalone late-fusion models show inferior performance compared to their early-fusion counterparts [guidedprop].
From Table. 1, we observe that our trained network converges mostly within 16K iterations. For simplistic datasets such as MSRA10K and ECSSD, the vgg-baseline without user click transformation compares favourably with the approach of [tapnshoot] at the same training resolution of . resnet-baseline models trained with images significantly outperform [tapnshoot] reporting absolute mIoU gains of till across the datasets. Based on this result alone, we conclude that one-click (and standard) interactive segmentation approaches should be benchmarked on more challenging datasets. Examples include PASCAL VOC 2012 and MS COCO, which feature cluttered scenes, multiple objects, occlusions and challenging lighting conditions. (see Table 3).
Furthermore, with only the Gaussian transformation and ResNet-101 backbone trained on , we are able to achieve mIoU increase in the range of - across datasets at convergence w.r.t [tapnshoot]. Having the multi-stage fusion offers us absolute mIoU gains of - w.r.t the early fusion variant (resnet-early vs. resnet-multi when trained with images). Additionally, our resnet models require significantly less memory; MB (stored as -bit/-byte floating point numbers) instead of the MB required for the segmentation network of [tapnshoot].
4.3 Interactive image segmentation
Approaches in the literature [ifcn, itis, majumder19, twostream] are typically evaluated by (1) the average number of clicks needed to reach the desired level of segmentation ( mIoU for PASCAL VOC , MS COCO, mIoU for the less challenging Grabcut and Berkeley) and (2) the average mIoU vs the number of clicks.

The first criterion is primarily geared towards annotation tasks [itis, majumder19] where high-quality segments are desired for each instance in the scene; the fewer the number of clicks, the lower the annotation effort. In this work, we are concerned primarily with achieving high-quality segments for the object of interest given only a single click. Accordingly, given a single user click, we report the average mIoU across all instances for the GrabCut, Berkeley and the PASCAL VOC val dataset. For MS COCO object instances, following [ifcn], we split the dataset into the PASCAL VOC categories and the additional categories, and randomly sample images per category for evaluation. We also report the average mIoU across the sampled MS COCO instances [twostream].
For training [ifcn, majumder19, twostream], we use the ground truth masks of object instances from PASCAL VOC [pascal] train set with additional masks from Semantic Boundaries Dataset (SBD) [sbd] resulting in images. Note that unlike [itis], we do not use the training instances from MS COCO.
Ablation Study. We perform extensive ablation studies to thoroughly analyze the effectiveness of the individual components of our one-click segmentation framework. First, to validate our choice of guidance maps, we consider the user click transformations commonly used in existing interactive segmentation algorithms - Euclidean distance maps [ifcn, twostream], Gaussian distance maps [itis] and disk [benenson19]. Fig. 4 shows examples of such guidance maps. For each kind of guidance map, we train separate networks to understand the impact of different user click transformations. For evaluation, we report the average mIoU over all instances in the dataset, given a single click (see Table 2). Next, we study the impact of our proposed late-fusion module (denoted by -multi in Table. 2); we observe an average mIoU improvement of around across different datasets.
GrabCut | Berkeley | VOC12 | COCO-20 | COCO-60 | |
---|---|---|---|---|---|
Euclidean [ifcn] | 82.6 | 82.7 | 75.1 | 63.2 | 46.8 |
Disk [benenson19] | 84.5 | 81.3 | 74.5 | 65.3 | 51.5 |
Gaussian [itis] | 84.0 | 82.9 | 78.1 | 64.2 | 49.8 |
Gaussian-multi | 86.2() | 84.0() | 80.8() | 64.5() | 52.3() |
One-click segmentation. We compare the segmentation performance of our method with existing interactive instance segmentation approaches (see Table 3). The approaches are grouped separately into different categories - pre-deep learning approaches, deep learning-based interactive instance segmentation approaches and tap-and-shoot segmentation approaches. From Table. 3, we observe that our approach outperforms the classical interactive segmentation works by a significant margin reporting absolute improvement in average mIoU. We also outperform existing state-of-the-art interactive instance segmentation approaches [majumder19, itis] by a considerable margin (). Additionally, we report an absolute mIoU improvement of and on Grabcut and Berkeley over the tap-and-shoot segmentation framework of [tapnshoot]. We show qualitative results to demonstrate the effectiveness of our proposed algorithm (see Fig. 5). The resulting segmentations demonstrate that our approach is highly effective for the one-click segmentation paradigm.
5 User Study
Across existing state-of-the-art interactive frameworks [ifcn, itis, majumder19], user clicks are simulated following the protocols established in [ifcn, itis]. For our user study, we consult participants uninitiated to the task of interactive segmentation. We prepare a toy dataset with object instances from the MSRA10K [msra] dataset. We presented the image with the segmentation mask for the target object overlaid on the image and asked the users to provide their click.
Method | Network | GrabCut | Berkeley | VOC12 | COCO-20 | COCO-60 |
---|---|---|---|---|---|---|
GC[graphcuts] | - | 41.7 | 33.8 | 27.7 | - | 8.9 |
GM[geodesicmatting] | - | 23.7 | 24.5 | 23.8 | - | 22.1 |
GD[geodesic] | - | 48.8 | 36.1 | 31.0 | - | 25.2 |
iFCN[ifcn] | FCN-8s[fcn] | 62.9 | 61.3 | 53.6 | 42.9 | |
ITIS[itis] | DeepLabv3+[deeplabv3] | 82.1 | - | 71.0 | - | - |
CAG[majumder19] | FCN-8s[fcn] | 83.2 | - | 74.0 | - | - |
TS[twostream] | FCN-8s[fcn] | 77.7 | 74.5 | 62.3 | 42.5 | 42.5 |
TNS[tapnshoot] | FCN-8s[fcn] | 79.0 | 67.0 | - | - | - |
Ours-best | DeepLabv2[deeplabv2] | 86.2() | 84.0() | 80.8() | 64.5() | 52.3() |
During training the object selection stage, we applied random perturbations of pixels to the center of mass of the object instance to obtain the final user click. Our user study found that participants placed clicks at a mean distance of pixels from the center of the mask with a standard deviation of pixels. This result validates our assumption that users are more likely to click in the vicinity of the object’s center-of-mass. It also supports our click sampling scheme for generating training instances when training the object selection stage. On average, we observed that users took seconds with a standard deviation of 0.8 seconds to position their click.
6 Conclusion
In this work, we propose a one-click segmentation framework that produces high-quality segmentation masks. We validated our design choices through detailed ablation studies; we observed that having a multi-stage module improves the segmentation framework and gives the network an edge over its early-fusion variants. Via experiments, we observed that for the single click scenario, our proposed approach significantly outperforms existing state-of-the-art approaches - including the more complicated interactive instance segmentation models using state-of-the-art segmentation models [deeplabv3].

However, we observe existing tap-and-shoot segmentation frameworks [tapnshoot], including our proposed framework, are limited by their inability to learn from negative clicks [ifcn, majumder19, itis]. One major drawback of such a training scenario is that the network does not have a notion of corrective clicking; if the generated segmentation mask extends beyond the object boundaries, it cannot rectify this mistake. Clicking on locations outside the object can mitigate this effect, though this then deviates from tap-and-shoot interaction.
Acknowledgment. This work was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme [NRF-NRFFAI1-2019-0001] and NUS Startup Grant R-252-000-A40-133.