Interactive segmentation, also known as interactive object selection, aims to segment the object of interest and refines the segmentation mask via humans-in-the-loop. The segmentation results are useful for many applications, such as video editing, medical image analysis, and especially human-machine collaborative annotation. Because the demand for fine-grained image annotations dramatically increases with the development of data-driven deep learning methods, an efficient interactive segmentation is in need to alleviate the burden of manually labeling each pixel in an image.
Among the interactive segmentation scenarios, user interactions are usually given through bounding boxes , clicks [21, 8, 9, 10], or scribbles [5, 6]. A box-interfaced one lets the user indicate the target by drawing a bounding box to obtain the entire object’s information. However, some background pixels are included at the same time, making the user intention imprecise. In contrast, a click-interfaced one gets a precise location but lacks object/region size. Although scribbles can get precise and rich information, drawing scribbles places much more burden than clicking points. Given those user inputs, classical approaches [2, 18, 1, 5]
formulate the segmentation process as a graph-based optimization problem. Inspired by the success of fully convolutional neural networks (FCNs) on semantic segmentation, Xu et al.  first proposed a deep learning-based interactive segmentation algorithm. They compute two additional distance maps representing positive and negative clicks from the user and concatenate them with the input image to generate the desired foreground mask with an FCN model. Most later works follow this strategy but transform the user clicks into Euclidean distance maps [21, 10], Gaussian [9, 13], or multiple guidance maps , respectively. To further make use of user clicks, BRS  and f-BRS 
proposed a back-propagating refinement scheme to adjust the original input clicks maps by forcing the interaction points to have the correct predicted labels. However, these methods regard all clicks as equal importance and transform them with an identical function, discarding the relation between clicks and the target object. Moreover, the back-propagating-based methods need to additionally minimize the predefined energy function through backpropagation iteratively, which includes extra computation and increases the inference time.
In this paper, we start from adopting a click-based interaction firstly proposed in , called Click-and-Drag. It adds a drag action for each click, which is nearly without extra burden for users. This novel interaction scheme combines the advantage of click and bounding box to get precise location and contain more object scale information. Then, we propose a Dynamic Click Transform Network (DCT-Net), which contains two components, a Spatial Dynamic Click Transform (Spatial-DCT) and a Feature Dynamic Click Transform (Feature-DCT). This network takes both spatial geometry and feature distribution into consideration to make good use of click-and-drag interaction. Spatial-DCT transforms each user click into 2D maps by applying an individual Gaussian mask which is dynamically determined by the object scale. Compared to the identical transform used in most previous works, our approach is more robust to object in different scales. Feature-DCT further uses the user clicks in the feature domain by refining the whole feature distribution of the input image according to the feature at the clicked position. With this operation, the feature changes dynamically in each interaction, helping focus on some mislabeled parts. The main contributions of this paper are:
We adopt a Click-and-Drag interaction, which can take advantage of both click and bounding box.
We propose a Spatial Dynamic Click Transform to encode both the object scale and refine region into the distance maps.
We propose a Feature Dynamic Click Transform to aggregate all clicked features and adjust the whole image feature to distinguish pixels belonging to the object of interest.
2 Proposed Method
The architecture overview is illustrated in Fig. 1
. The proposed Dynamic Click Transform Network (DCT-Net) is based on an encoder-decoder architecture with spatial pyramid pooling (SPP). We transform the user clicks not only in the spatial domain but also in the high dimensional feature domain by using the proposed Spatial-DCT and Feature-DCT, respectively. In the Spatial-DCT, we encode each click by Gaussian mask with individual diffusion radius, determined from the target region size. The Feature-DCT is performed by scaling and shifting the feature extracted from the input image, resulting in a better distribution for separating the target object.
2.1 Spatial Dynamic Click Transform
Most previous works, which use either Euclidean distance transform or Gaussian transform, regard all clicks as equal importance. Given a sequence of user interactions includes a positive click set and a negative click set . The clicks are encoded into two distance maps and for positive and negative clicks, respectively. More formally, the pixel value at the location can be computed as:
where is the Euclidean distance or Gaussian map. These distance maps only localize the user clicks and ignore the target object scale or mislabeled region size, which can directly impact network performance. Instead, we take the relation between click and target object or mislabeled region into consideration and encode this information into the distance maps. The function can be written as:
We take a Gaussian function as our transform function and dynamically change the diffusion distance by an extra variable according to the user clicks. The simplest and precise way to get a proper value of is obtaining from user interactions. Illustrated in Fig. 2, we adopt a novel interaction interface called Click-and-Drag. The user is asked to click on the center of the largest incorrect region and then drag outward until reaching the nearest boundary. The distance between the click and released position is recorded as the diffusion distance of this click. With the user-given drag, we can clearly understand the user’s intention and know the size of the mislabeled region. In our experiments, to fairly compare to other click-based methods, we also propose an Auto-Drag-Head, a lightweight neural network that can automatically predict the diffusion distance . Even with this predicted value, we can also perform a better result than that of using an identical transform.
2.2 Feature Dynamic Click Transform
From the user input clicks, we can gather more information in addition to the spatial correlation. In the Feature-DCT, we utilize the features extracted from the input image at the user clicked positions, which are rarely used in most existing methods. Firstly, we gather the feature at the clicked position and feed it into a fully connected network to output a set of means and variances for each channel. Secondly, the original feature map is scaled and shifted by the predicted means and variances, then fed into the segmentation head. When a new click comes, we apply a feature aggregation strategy to take all click points into account. The aggregation is doing by vector sum if a positive click is given; otherwise, vector rejection is applied for a negative click. Given a user correction clicked at, the aggregated feature can define as:
where is the aggregated feature in the interaction, and is the feature extracted from the input image. Fig. 3 illustrates the Feature-DCT for the input image feature. More detailed, we extract the feature in three different layers corresponding to the click position and concatenate these features. And then aggregate this multi-level feature by the strategy mentioned above. Last, the fully connected network predicts three sets of means and variance for applying instance normalization (IN)  on the features of a U-net. The correction level that each click focuses on and the difference between clicks at each interaction is efficiently used. The Feature-DCT, thus, refines not only the internal region but the area near boundaries.
2.3 Interactively Training
For the Dynamic Click Transform Network to learn the relation between user correction and predicted segmentation, we train our network click by click, similar to that in . Starting from a single click on the farthest pixel from the object boundary, a sequence of interactions is given according to the output mask. The loss is computed, and the weights are updated at each interaction. Since user annotations are impractical to obtain from humans during training, we turn to simulate from the ground truth segmentation mask and network predicted mask. For the first click, we compute the minimum distance to the object boundary for each pixel on the target object. Then pick the farthest point from the boundary and take the corresponding distance computed above as the diffusion distance for spatial-DCT. After the initial segmentation mask is predicted, we generate the subsequent clicks with respect to the previous prediction of the network. A click is then sampled on the largest mislabelled region such that the euclidean distance from the boundary is larger than other pixels within this region. Then the sampled click is considered a positive click if the corresponding pixel lies on the object or a negative click otherwise.
3.1 Experimental Settings
We evaluate our proposed method on three publicly available datasets: GrabCut , Berkeley  and DAVIS . GrabCut contains 50 images and provides a single object mask for each image; pixels in a thin band around the object boundary are not valid. Berkeley consists of 100 object masks on 96 images and represents some challenges encountered in interactive segmentation. DAVIS contains 50 videos with high-quality ground truth masks. To evaluate interactive segmentation algorithms, we use the same 354 individual frames sampled from videos as .
As for the evaluation, we use the same click generation strategy as in previous works and take a robot to simulate user clicks. After each interaction, we calculate the intersection of union (IoU) between the predicted mask and ground truth mask and plot the mean intersection of union (mIoU) score according to the number of clicks. Then, we adopt the mean number of clicks (mNoC), which counts the average number of clicks required to achieve a target IoU threshold. We set the IoU threshold as 90%, and the default maximum number of clicks is limited to 20 for each sample, consistent with the previous works.
|NoC @ 90%||AuC||NoC @ 90%||AuC||NoC @ 90%||AuC|
|Baseline||Click and Drag||4.4||0.904||5.73||0.901||9.13||0.821|
3.2 Implementation Details
We formulate the training task as a binary segmentation problem and use binary cross-entropy loss for training. We train all the models with a similar iterative training strategy in  on the 8498 images of SBD  and set the batch size to 8. The input images are randomly resized from 0.75 to 1.25 of the original size and then randomly cropped at a fixed size of
pixels. We further augment the training samples with horizontal flipping and color jitter. We take ResNet50 pre-trained on ImageNet as backbone. For optimization, we use Adam with and a learning rate of
. The learning rate is reduced by a factor of 0.1 every 10 epochs, and training completes after 20 epochs.
|NoC @ 90%||NoC @ 90%||NoC @ 90%|
|Graph cut ||11.10||14.33||17.41|
|Random walker ||12.30||14.02||18.31|
|Geodesic matting ||12.44||15.96||19.50|
Ablation study. In Tab. 1, we analyze the effectiveness of each component in our proposed method. We take the basic segmentation network with Euclidean distance transform as our baseline model and then gradually equip the proposed components. Overall, our proposed method is highly beneficial for the interactive segmentation model.
Comparison to the state-of-the-art. We compare our results with existing State-of-the-Art methods on three standard benchmark datasets, GrabCut , Berkeley , and DAVIS . Tab. 2 shows the average number of clicks required to reach 90% IoU threshold noted as NoC @ 90%. Our model requires 2.68 clicks and 4.08 clicks on GrabCut and Berkeley, respectively, when using click input only. Under the Click and Drag scheme, it achieves the same threshold in only 1.98 clicks and 2.68 clicks, while the existing methods need more than 2 clicks and 4 clicks. For DAVIS, we can reach 90% IoU threshold with less than 7 clicks and achieve a relative improvement of 20%. Our method achieves the lowest number of clicks required to reach the IoU threshold for all datasets, whether using Auto-Drag-Head or Click-and-Drag scheme to determine diffusion distance dynamically.
|1 click||2 clicks||3 clicks||4 clicks||5 clicks|
In this paper, we contribute to improving interactive object segmentation by a novel algorithm that reaches a good balance in human-machine collaboration. Specifically, we propose the Dynamic Click Transform Network (DCT-Net), which consists of a Spatial-DCT and a Feature-DCT to apply anisotropic diffusion for individual clicks and aggregate the corresponding feature to adjust the distribution of the original feature map in a forward pass, respectively. The conducted experiments demonstrate the effectiveness of our proposed method and show the state-of-the-art performances over three standard interactive segmentation benchmarks.
Acknowledgement This research was supported in part by the Ministry of Science and Technology of Taiwan (MOST 110-2218-E-002-025-), National Taiwan University (NTU-108L104039), Intel Corporation, Delta Electronics and Compal Electronics.
-  (2007) A geodesic framework for fast interactive image and video segmentation and matting. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 1–8. Cited by: §1, Table 2.
-  (2001) Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In Proceedings of IEEE International Conference on Computer Vision (ICCV), Vol. 1, pp. 105–112. Cited by: §1, Table 2.
ImageNet: a large-scale hierarchical image database.
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 248–255. External Links: Cited by: §3.2.
-  (2020) Getting to 99% accuracy in interactive segmentation. arXiv preprint arXiv:2003.07932. Cited by: §2.3, §3.2.
-  (2006) Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 28 (11), pp. 1768–1783. Cited by: §1, Table 2.
-  (2010) Geodesic star convexity for interactive image segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3129–3136. Cited by: §1, Table 2.
-  (2011) Semantic contours from inverse detectors. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 991–998. Cited by: §3.2.
-  (2019) Interactive image segmentation via backpropagating refinement scheme. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5297–5306. Cited by: §1, §3.1, Table 2.
-  (2018) Interactive image segmentation with latent diversity. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 577–585. Cited by: §1, Table 2.
-  (2017) Regional interactive image segmentation networks. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 2746–2754. Cited by: §1, Table 2.
-  (2020) Interactive image segmentation with first click attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13339–13348. Cited by: Table 2.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §1.
-  (2018) Iteratively trained interactive segmentation. In Proceedings of British Machine Vision Conference (BMVC), Cited by: §1.
-  (2019) Content-aware multi-level guidance for interactive instance segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11602–11611. Cited by: §1, Table 2.
-  (2010) A comparative evaluation of interactive segmentation algorithms. Pattern Recognition 43 (2), pp. 434–444. Cited by: §3.1, §3.3, Table 2.
-  (2016) A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 724–732. Cited by: §3.1, §3.3, Table 2.
-  (2015) Semi-automatic video object segmentation by advanced manipulation of segmentation hierarchies. In 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1–6. Cited by: §1.
-  (2004) ”GrabCut” interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG) 23 (3), pp. 309–314. Cited by: §1, §3.1, §3.3, Table 2.
-  (2020) f-BRS: rethinking backpropagating refinement for interactive segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8623–8632. Cited by: §1, Table 2.
-  (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.2.
-  (2016) Deep interactive object selection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 373–381. Cited by: §1, Table 2.