Interactive Object Segmentation with Dynamic Click Transform

by   Chun-Tse Lin, et al.

In the interactive segmentation, users initially click on the target object to segment the main body and then provide corrections on mislabeled regions to iteratively refine the segmentation masks. Most existing methods transform these user-provided clicks into interaction maps and concatenate them with image as the input tensor. Typically, the interaction maps are determined by measuring the distance of each pixel to the clicked points, ignoring the relation between clicks and mislabeled regions. We propose a Dynamic Click Transform Network (DCT-Net), consisting of Spatial-DCT and Feature-DCT, to better represent user interactions. Spatial-DCT transforms each user-provided click with individual diffusion distance according to the target scale, and Feature-DCT normalizes the extracted feature map to a specific distribution predicted from the clicked points. We demonstrate the effectiveness of our proposed method and achieve favorable performance compared to the state-of-the-art on three standard benchmark datasets.



page 2

page 3

page 4


Scale-aware multi-level guidance for interactive instance segmentation

In interactive instance segmentation, users give feedback to iteratively...

Localized Interactive Instance Segmentation

In current interactive instance segmentation works, the user is granted ...

Deep Interactive Object Selection

Interactive object selection is a very important research problem and ha...

Interactive segmentation using U-Net with weight map and dynamic user interactions

Interactive segmentation has recently attracted attention for specialize...

Memory Aggregation Networks for Efficient Interactive Video Object Segmentation

Interactive video object segmentation (iVOS) aims at efficiently harvest...

Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

We present Modular interactive VOS (MiVOS) framework which decouples int...

Continuous Adaptation for Interactive Object Segmentation by Learning from Corrections

In interactive object segmentation a user collaborates with a computer v...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interactive segmentation, also known as interactive object selection, aims to segment the object of interest and refines the segmentation mask via humans-in-the-loop. The segmentation results are useful for many applications, such as video editing, medical image analysis, and especially human-machine collaborative annotation. Because the demand for fine-grained image annotations dramatically increases with the development of data-driven deep learning methods, an efficient interactive segmentation is in need to alleviate the burden of manually labeling each pixel in an image.

Among the interactive segmentation scenarios, user interactions are usually given through bounding boxes [18], clicks [21, 8, 9, 10], or scribbles [5, 6]. A box-interfaced one lets the user indicate the target by drawing a bounding box to obtain the entire object’s information. However, some background pixels are included at the same time, making the user intention imprecise. In contrast, a click-interfaced one gets a precise location but lacks object/region size. Although scribbles can get precise and rich information, drawing scribbles places much more burden than clicking points. Given those user inputs, classical approaches [2, 18, 1, 5]

formulate the segmentation process as a graph-based optimization problem. Inspired by the success of fully convolutional neural networks (FCNs) 

[12] on semantic segmentation, Xu et al[21] first proposed a deep learning-based interactive segmentation algorithm. They compute two additional distance maps representing positive and negative clicks from the user and concatenate them with the input image to generate the desired foreground mask with an FCN model. Most later works follow this strategy but transform the user clicks into Euclidean distance maps [21, 10], Gaussian [9, 13], or multiple guidance maps [14], respectively. To further make use of user clicks, BRS [8] and f-BRS [19]

proposed a back-propagating refinement scheme to adjust the original input clicks maps by forcing the interaction points to have the correct predicted labels. However, these methods regard all clicks as equal importance and transform them with an identical function, discarding the relation between clicks and the target object. Moreover, the back-propagating-based methods need to additionally minimize the predefined energy function through backpropagation iteratively, which includes extra computation and increases the inference time.

Figure 1: Overview of Dynamic Click Transform Network. Spatial-DCT dynamically encodes user interactions into distance map and Feature-DCT scales and shift the original feature for better prediction.

In this paper, we start from adopting a click-based interaction firstly proposed in [17], called Click-and-Drag. It adds a drag action for each click, which is nearly without extra burden for users. This novel interaction scheme combines the advantage of click and bounding box to get precise location and contain more object scale information. Then, we propose a Dynamic Click Transform Network (DCT-Net), which contains two components, a Spatial Dynamic Click Transform (Spatial-DCT) and a Feature Dynamic Click Transform (Feature-DCT). This network takes both spatial geometry and feature distribution into consideration to make good use of click-and-drag interaction. Spatial-DCT transforms each user click into 2D maps by applying an individual Gaussian mask which is dynamically determined by the object scale. Compared to the identical transform used in most previous works, our approach is more robust to object in different scales. Feature-DCT further uses the user clicks in the feature domain by refining the whole feature distribution of the input image according to the feature at the clicked position. With this operation, the feature changes dynamically in each interaction, helping focus on some mislabeled parts. The main contributions of this paper are:

  • We adopt a Click-and-Drag interaction, which can take advantage of both click and bounding box.

  • We propose a Spatial Dynamic Click Transform to encode both the object scale and refine region into the distance maps.

  • We propose a Feature Dynamic Click Transform to aggregate all clicked features and adjust the whole image feature to distinguish pixels belonging to the object of interest.

2 Proposed Method

The architecture overview is illustrated in Fig. 1

. The proposed Dynamic Click Transform Network (DCT-Net) is based on an encoder-decoder architecture with spatial pyramid pooling (SPP). We transform the user clicks not only in the spatial domain but also in the high dimensional feature domain by using the proposed Spatial-DCT and Feature-DCT, respectively. In the Spatial-DCT, we encode each click by Gaussian mask with individual diffusion radius, determined from the target region size. The Feature-DCT is performed by scaling and shifting the feature extracted from the input image, resulting in a better distribution for separating the target object.

2.1 Spatial Dynamic Click Transform

Most previous works, which use either Euclidean distance transform or Gaussian transform, regard all clicks as equal importance. Given a sequence of user interactions includes a positive click set and a negative click set . The clicks are encoded into two distance maps and for positive and negative clicks, respectively. More formally, the pixel value at the location can be computed as:


where is the Euclidean distance or Gaussian map. These distance maps only localize the user clicks and ignore the target object scale or mislabeled region size, which can directly impact network performance. Instead, we take the relation between click and target object or mislabeled region into consideration and encode this information into the distance maps. The function can be written as:


We take a Gaussian function as our transform function and dynamically change the diffusion distance by an extra variable according to the user clicks. The simplest and precise way to get a proper value of is obtaining from user interactions. Illustrated in Fig. 2, we adopt a novel interaction interface called Click-and-Drag. The user is asked to click on the center of the largest incorrect region and then drag outward until reaching the nearest boundary. The distance between the click and released position is recorded as the diffusion distance of this click. With the user-given drag, we can clearly understand the user’s intention and know the size of the mislabeled region. In our experiments, to fairly compare to other click-based methods, we also propose an Auto-Drag-Head, a lightweight neural network that can automatically predict the diffusion distance . Even with this predicted value, we can also perform a better result than that of using an identical transform.

Figure 2: Click and Drag scheme.

2.2 Feature Dynamic Click Transform

From the user input clicks, we can gather more information in addition to the spatial correlation. In the Feature-DCT, we utilize the features extracted from the input image at the user clicked positions, which are rarely used in most existing methods. Firstly, we gather the feature at the clicked position and feed it into a fully connected network to output a set of means and variances for each channel. Secondly, the original feature map is scaled and shifted by the predicted means and variances, then fed into the segmentation head. When a new click comes, we apply a feature aggregation strategy to take all click points into account. The aggregation is doing by vector sum if a positive click is given; otherwise, vector rejection is applied for a negative click. Given a user correction clicked at

, the aggregated feature can define as:


where is the aggregated feature in the interaction, and is the feature extracted from the input image. Fig. 3 illustrates the Feature-DCT for the input image feature. More detailed, we extract the feature in three different layers corresponding to the click position and concatenate these features. And then aggregate this multi-level feature by the strategy mentioned above. Last, the fully connected network predicts three sets of means and variance for applying instance normalization (IN) [20] on the features of a U-net. The correction level that each click focuses on and the difference between clicks at each interaction is efficiently used. The Feature-DCT, thus, refines not only the internal region but the area near boundaries.

Figure 3: Feature Dynamic Click Transform.

2.3 Interactively Training

For the Dynamic Click Transform Network to learn the relation between user correction and predicted segmentation, we train our network click by click, similar to that in [4]. Starting from a single click on the farthest pixel from the object boundary, a sequence of interactions is given according to the output mask. The loss is computed, and the weights are updated at each interaction. Since user annotations are impractical to obtain from humans during training, we turn to simulate from the ground truth segmentation mask and network predicted mask. For the first click, we compute the minimum distance to the object boundary for each pixel on the target object. Then pick the farthest point from the boundary and take the corresponding distance computed above as the diffusion distance for spatial-DCT. After the initial segmentation mask is predicted, we generate the subsequent clicks with respect to the previous prediction of the network. A click is then sampled on the largest mislabelled region such that the euclidean distance from the boundary is larger than other pixels within this region. Then the sampled click is considered a positive click if the corresponding pixel lies on the object or a negative click otherwise.

3 Experiments

3.1 Experimental Settings

We evaluate our proposed method on three publicly available datasets: GrabCut [18], Berkeley [15] and DAVIS [16]. GrabCut contains 50 images and provides a single object mask for each image; pixels in a thin band around the object boundary are not valid. Berkeley consists of 100 object masks on 96 images and represents some challenges encountered in interactive segmentation. DAVIS contains 50 videos with high-quality ground truth masks. To evaluate interactive segmentation algorithms, we use the same 354 individual frames sampled from videos as [8].

As for the evaluation, we use the same click generation strategy as in previous works and take a robot to simulate user clicks. After each interaction, we calculate the intersection of union (IoU) between the predicted mask and ground truth mask and plot the mean intersection of union (mIoU) score according to the number of clicks. Then, we adopt the mean number of clicks (mNoC), which counts the average number of clicks required to achieve a target IoU threshold. We set the IoU threshold as 90%, and the default maximum number of clicks is limited to 20 for each sample, consistent with the previous works.

Method Interaction GrabCut Berkeley DAVIS
NoC @ 90% AuC NoC @ 90% AuC NoC @ 90% AuC
Baseline Click and Drag 4.4 0.904 5.73 0.901 9.13 0.821
Baseline+Spatial-DCT 2.56 0.967 3.68 0.943 7.58 0.880
Baseline+Spatial-DCT+Feature-DCT 1.70 0.979 2.97 0.952 5.92 0.907
Baseline+Spatial-DCT+Feature-DCT Click 2.68 0.961 4.08 0.940 7.00 0.889
Table 1: Ablation studies of proposed methods.

3.2 Implementation Details

We formulate the training task as a binary segmentation problem and use binary cross-entropy loss for training. We train all the models with a similar iterative training strategy in [4] on the 8498 images of SBD [7] and set the batch size to 8. The input images are randomly resized from 0.75 to 1.25 of the original size and then randomly cropped at a fixed size of

pixels. We further augment the training samples with horizontal flipping and color jitter. We take ResNet50 pre-trained on ImageNet 

[3] as backbone. For optimization, we use Adam with and a learning rate of

. The learning rate is reduced by a factor of 0.1 every 10 epochs, and training completes after 20 epochs.

Method GrabCut Berkeley DAVIS
NoC @ 90% NoC @ 90% NoC @ 90%
Graph cut [2] 11.10 14.33 17.41
Random walker [5] 12.30 14.02 18.31
Geodesic matting [1] 12.44 15.96 19.50
ESC [6] 9.20 12.11 17.70
GSC [6] 9.12 12.57 17.52
DOS [21] 6.04 8.65 12.58
RIS-Net [10] 5.00 6.03 -
IIS-LD [9] 4.79 - 9.57
CMG [14] 3.58 5.60 -
BRS [8] 3.60 5.08 8.24
f-BRS [19] 2.98 4.34 7.81
FCA-Net [11] 2.14 4.19 7.90
DCT-Net 2.68 4.08 7.00
DCT-Net* 1.70 2.97 5.92
Table 2: Comparison of the mean number of clicks (mNoC) on different datasets [18, 15, 16]. * indicates the use of click and drag scheme to get the diffusion radius.

3.3 Results

Ablation study. In Tab. 1, we analyze the effectiveness of each component in our proposed method. We take the basic segmentation network with Euclidean distance transform as our baseline model and then gradually equip the proposed components. Overall, our proposed method is highly beneficial for the interactive segmentation model.

Comparison to the state-of-the-art. We compare our results with existing State-of-the-Art methods on three standard benchmark datasets, GrabCut [18], Berkeley [15], and DAVIS [16]. Tab. 2 shows the average number of clicks required to reach 90% IoU threshold noted as NoC @ 90%. Our model requires 2.68 clicks and 4.08 clicks on GrabCut and Berkeley, respectively, when using click input only. Under the Click and Drag scheme, it achieves the same threshold in only 1.98 clicks and 2.68 clicks, while the existing methods need more than 2 clicks and 4 clicks. For DAVIS, we can reach 90% IoU threshold with less than 7 clicks and achieve a relative improvement of 20%. Our method achieves the lowest number of clicks required to reach the IoU threshold for all datasets, whether using Auto-Drag-Head or Click-and-Drag scheme to determine diffusion distance dynamically.

1 click 2 clicks 3 clicks 4 clicks 5 clicks
Figure 4: Qualitative comparison between our baseline and full model for the first 5 clicks. Green points are positive clicks, red points are negative clicks, objects are overlaid with mask in dark red.

4 Conclusion

In this paper, we contribute to improving interactive object segmentation by a novel algorithm that reaches a good balance in human-machine collaboration. Specifically, we propose the Dynamic Click Transform Network (DCT-Net), which consists of a Spatial-DCT and a Feature-DCT to apply anisotropic diffusion for individual clicks and aggregate the corresponding feature to adjust the distribution of the original feature map in a forward pass, respectively. The conducted experiments demonstrate the effectiveness of our proposed method and show the state-of-the-art performances over three standard interactive segmentation benchmarks.

Acknowledgement This research was supported in part by the Ministry of Science and Technology of Taiwan (MOST 110-2218-E-002-025-), National Taiwan University (NTU-108L104039), Intel Corporation, Delta Electronics and Compal Electronics.


  • [1] X. Bai and G. Sapiro (2007) A geodesic framework for fast interactive image and video segmentation and matting. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 1–8. Cited by: §1, Table 2.
  • [2] Y. Y. Boykov and M. Jolly (2001) Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In Proceedings of IEEE International Conference on Computer Vision (ICCV), Vol. 1, pp. 105–112. Cited by: §1, Table 2.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 248–255. External Links: Document Cited by: §3.2.
  • [4] M. Forte, B. Price, S. Cohen, N. Xu, and F. Pitié (2020) Getting to 99% accuracy in interactive segmentation. arXiv preprint arXiv:2003.07932. Cited by: §2.3, §3.2.
  • [5] L. Grady (2006) Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 28 (11), pp. 1768–1783. Cited by: §1, Table 2.
  • [6] V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zisserman (2010) Geodesic star convexity for interactive image segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3129–3136. Cited by: §1, Table 2.
  • [7] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 991–998. Cited by: §3.2.
  • [8] W. Jang and C. Kim (2019) Interactive image segmentation via backpropagating refinement scheme. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5297–5306. Cited by: §1, §3.1, Table 2.
  • [9] Z. Li, Q. Chen, and V. Koltun (2018) Interactive image segmentation with latent diversity. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 577–585. Cited by: §1, Table 2.
  • [10] J. Liew, Y. Wei, W. Xiong, S. Ong, and J. Feng (2017) Regional interactive image segmentation networks. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 2746–2754. Cited by: §1, Table 2.
  • [11] Z. Lin, Z. Zhang, L. Chen, M. Cheng, and S. Lu (2020) Interactive image segmentation with first click attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13339–13348. Cited by: Table 2.
  • [12] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §1.
  • [13] S. Mahadevan, P. Voigtlaender, and B. Leibe (2018) Iteratively trained interactive segmentation. In Proceedings of British Machine Vision Conference (BMVC), Cited by: §1.
  • [14] S. Majumder and A. Yao (2019) Content-aware multi-level guidance for interactive instance segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11602–11611. Cited by: §1, Table 2.
  • [15] K. McGuinness and N. E. O’connor (2010) A comparative evaluation of interactive segmentation algorithms. Pattern Recognition 43 (2), pp. 434–444. Cited by: §3.1, §3.3, Table 2.
  • [16] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 724–732. Cited by: §3.1, §3.3, Table 2.
  • [17] J. Pont-Tuset, M. A. Farré, and A. Smolic (2015) Semi-automatic video object segmentation by advanced manipulation of segmentation hierarchies. In 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1–6. Cited by: §1.
  • [18] C. Rother, V. Kolmogorov, and A. Blake (2004) ”GrabCut” interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG) 23 (3), pp. 309–314. Cited by: §1, §3.1, §3.3, Table 2.
  • [19] K. Sofiiuk, I. Petrov, O. Barinova, and A. Konushin (2020) f-BRS: rethinking backpropagating refinement for interactive segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8623–8632. Cited by: §1, Table 2.
  • [20] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.2.
  • [21] N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang (2016) Deep interactive object selection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 373–381. Cited by: §1, Table 2.