As deep learning gains popularity, the need for large amounts of annotated images has never been greater. Annotating images is a tedious and time consuming task, especially in the field of image segmentation where human annotators have to draw complex polygons around all sorts of objects. Interactive image segmentation aims to reduce the workload required to extract objects or regions from images. It relies on sparse user interactions such as clicks or scribbles to produce dense binary masks that precisely encompass the desired regions. It is often an iterative process, where users interact with the algorithm both to initialize and adjust the generated segmentation masks. To be considered effective, an interactive segmentation algorithm must comply with three main requirements: i) meet high-quality standards, i.e. 80-90% intersection over union (IoU); ii) be less time-consuming than manual segmentation; and iii) be robust to variation in user interactions. It is also usually expected to be robust to domain or category shift.
Although most approaches rely on positive and negative clicks [26, 10, 13, 14], recent studies have shown that extreme clicks [18, 15] can be effectively used to give scale information and indicate precisely points that belong to the object thus removing ambiguities. In particular, the scale information allows the crop of the original image and therefore a higher resolution which significantly increases the performance compared with the full image . However, such an interaction requires users to click at exact object locations, a time-consuming task which is prone to user inattention mistakes, and to click 4 times regardless of the object complexity, which can either be insufficient at times and dispensable at others. Our method aims to extend extreme clicking toward generic contour clicking by removing its two main constraints: the fixed number of clicks and the need for specific click location. The key contributions of our work can be summarized as follows:
We design a novel interactive segmentation pipeline suitable for consuming unconstrained contour clicks, hence relaxing the strong cognitive load of extreme points while maintaining the benefits of enclosure-based interactions such as a higher resolution and scale-information.
We show how such a pipeline can be trained to perform satisfactory segmentation from unstructured contour points as few as two.
We conduct an extensive study of main enclosure based interaction types with human annotators.
The resulting model is able to perform segmentation in real-time directly in a web browser (53 ms for 128 x 128 instances through WebGL) with no need for a dedicated GPU. We believed that it will constitute a more realistic solution in comparison with the standard deep learning frameworks that require standalone graphics cards.
Ii Related work
User interaction types
. This often translates into scribble interactions on both foreground (positive) and background (negative) pixels or rough drawing around the target. This information is then fed to a heuristic algorithm to produce a segmentation. The arrival of deep neural networks able to extract higher-level features enabled for sparser interactions such as simple clicks. While most approaches use positive and negative clicks[26, 10, 13, 14], recent studies have shown that extreme clicks [18, 15] can be effectively used to give scale information and indicate precisely points that belong to the object thus delivering valuable information. As users must click on the left-most, right-most, top and bottom points of the object they want to segment, the interaction is also more consistent and reliably reproducible. Unlike positive and negative clicks, extreme points have not been extended to an iterative refinement training scheme.
In addition to the interaction effectiveness, the mechanism behind the automatic simulation of extreme clicks is a confounding factor for both training and evaluation. Maninis et al.  observe in the case of extreme clicks a decrease of up to 5% of the mean IoU between the simulated and real clicks evaluation. We briefly present commonly used simulation strategies to mimic human behavior. Foreground clicks are usually constrained to cover the central area by using a margin from the object boundary or by applying k-medoids , whereas negative clicks are either peripheral to the object or on negative objects [26, 12, 1, 10, 11, 14]. Stricter interaction policies such as bounding boxes and extreme points simply include noise by perturbing the corners of the perfect coordinates up to a certain pixel amount [15, 2, 21] or scale percentage .
Embedding User Interactions
User interactions being sparse, they require an effective pre-processing so as to be fully perceived and exploited by the segmentation network. A popular pre-processing consists in encoding the interactions into a 2d-image that can be fed to the convolutional network. Clicks are usually turned into Euclidean [26, 12, 7, 10, 8] or Gaussian [1, 10, 15, 13, 4] distance maps. The authors of [1, 13, 15, 2] observed that Gaussians yield better results than distance transforms. Three other transforms led to an improvement over Gaussians: binary disks , superpixels  and multi-focal ellipses , but no comparison between them was provided.
In 2016, Xu et al.  proposed the first interactive segmentation pipeline relying on a Fully Convolutional Network (FCN) encoder-decoder taking the concatenation between the RGB image and the embedded user interaction as input. Most modern approaches to interactive segmentation follow this lead [13, 1, 15, 2, 14, 12, 11, 21]. While the majority use the whole image as input [13, 1, 14, 12, 11, 4, 8], recent architectures based on object enclosure [15, 24, 21, 2, 27, 22] feed image crops to the FCN to achieve speed-up and preserve object details. The approach proposed by  takes the whole image as input and exploits the predicted mask boundaries to obtain a crop of the image, which is subsequently fed into a refinement model. Instead of using image patches, Liew et al.  crop the feature maps around the input clicks to infer local predictions which are reassembled afterwards.
In order to learn deep features for images and interaction maps individually, the authors of[7, 4] use two separate encoder streams: one for the image and another one for the interactions, leading however to a heavier model.
In comparison with negative and positive clicks, extreme clicks have the advantage of being less ambiguous and enable to reduce the search space by extracting an RoI around the object. However, such methods require users to click at exact object locations which is more constraining than positive and negative clicks. Moreover, they require users to click at least 4 times regardless of the object complexity. To solve these two main limitations, we propose a novel interactive segmentation approach that exploits unconstrained contour clicks ranging from 2 to . This increased flexibility enables the unified approach to generate masks with different precision levels. We demonstrate that our method is able to deliver high-quality results with a lower number of clicks than the current state of the art of interactive segmentation.
Our network is built upon both approaches based on extreme clicks and those based on positive negative clicks. Exploiting the contour clicks representing the target object enables us to crop the original image and benefit from a higher resolution. Similar to , the crop is concatenated with its corresponding click heatmap and then fed to a binary segmentation network (Figure 2). However, unlike extreme clicks methods and similar to positive-negative methods, we choose to investigate a much broader range of number of clicks with unconstrained locations in an iterative fashion. This flexibility speeds up the interaction process even further and adapts well to both coarse and fine objects. Indeed, we observed that, in most cases two unconstrained user clicks provide enough information for a model to predict an accurate segmentation mask (Table I). In some cases, complex objects or situations can lead to the two clicks being insufficient for the model to correctly segment the object (Figure 6). In regard of this observation, we propose an iterative approach where correction clicks are added until a satisfactory segmentation mask is predicted by the model. Therefore, the number of clicks fits the complexity of the setup, thus speeding up the annotation process.
From a user’s perspective, our interactive segmentation pipeline can be summarized as follows: first the user clicks on two locations of the object contour, then they can add additional contour clicks to correct or refine the mask (Figure 2).
Simulating user interactions
Simulating user interactions is a challenge in the field of interactive segmentation. We propose a novel online iterative training scheme, during which our model is trained with a combination of three strategies to simulate human contour clicks. When asking 5 annotators to draw a few clicks on object contours on the Berkeley data set (100 instances), we observe that they instinctively distribute them to best represent the targeted object breadth. In particular for
contour clicks, we measure that the distance ratio between clicked pair points and the furthest ground truth pair points is approximately distributed as a normal random variable with mean 1 and standard deviation 0.03 (Figure3). The interaction can therefore be simulated by selecting pairs producing a distance following this distribution, both during training and testing.
We describe here the other two simulation strategies for as illustrated in Figure 4. Let be the set of ground truth contour pixels.
Geometric strategy: gradually refines salient regions of the target. We denote the contour pixels resulting from the conversion of the points set into polygon boundaries. The click is then obtained sequentially as the furthest ground truth pixel from so as to mold the clicks to the shape of the target:
Corrective strategy: relies upon the prediction of the interactive segmentation network from clicks. We note its corresponding contour pixels. The click is defined as the furthest ground truth pixel from the prediction:
Batch of new clicks can be added at once by partitioning the erroneous areas and applying the strategy to each blob.
The first aims to best represent the targeted object by gradually refining salient regions. The second aims to simulate human correction to the network errors by selecting contour clicks furthest from the prediction contours. The corrective strategy applies natively to multi-region objects or objects with holes as it is based on euclidean distance from contours regardless of the contours hierarchy. The extension of the geometrical strategy to multi-regions is defined using a coarse-to-fine policy by prioritizing exterior hull coverage and subsequently interior regions as shown in Figure 4.
Region of interest
Similarly to , we feed the network with a crop of the original image to benefit from a higher resolution. While extracting a region of interest is straight-forward in the case where provided clicks give a good approximation of the shape of the targeted object, it can be more difficult in the case of very sparse contour clicks (under four). As described previously, we conducted a human experiment that showed an innate distribution of the clicks to best represent the breadth of the targeted object. To ensure full enclosure of the targeted object, we therefore extract the RoI by solving the smallest-circle problem. To do so, we rely on Welzl’s algorithm . It corresponds to the diagonal’s circle and the circumscribed circle for two and three points respectively (Figure 2). In order to reckon the users’ interaction fluctuation, guarantee object enclosure and have context information, we expand the circle’s diagonal by 1.4 (Figure 5). The crop generated by this cut-off expansion ratio ensures a negligible mean loss (1%) on more than 20K instances in the SBD train set  using simulated pair clicks of 0.95 fraction.
An iterative training scheme
The training first consists of a warm-up phase with two contour clicks as input. These clicks are simulated geometrically as described in the previous section. During a second stage, we aim to cover a wider range of click numbers and randomly pick additional geometric contour clicks to each sample. Experimentally, we observe that a range allows for precise segmentation masks (Figure 7).
|VOS-Wild*  (2017)||SBD Full||-||3.8||-||-|
|DEXTR*  (2018)||SBD Full||-||4.00||4+ (89.4%)||4+ (80.1%)|
|CAMLGIIS  (2019)||SBD Full||-||3.58||5.60||-|
|ITIS  (2018)||SBD Train + VOC12||-||5.60||-||-|
|IIS-LD  (2018)||SBD Train||7.41||4.79||-||7.86|
|BRS  (2019)||SBD Train||6.59||3.60||5.08||-|
|f-BRS-101  (2020)||SBD Train||4.81||2.72||4.57||-|
|GAIS  (2020)||SBD Train + synt.||3.90||2.54||3.53||-|
|iFCN  (2016)||VOC12||9.22||6.08||8.65||9.07|
|RIS-Net  (2017)||VOC12||-||5.00||6.03||-|
|FCTSFN  (2019)||VOC12||-||3.76||6.49||9.62|
|MultiSeg**  (2019)||SBD+VOC||-||2.30||4.00||-|
|FAIRS  (2020)||VOC12||4.0||3.0||4.0||-|
|UCP-Net* (Ours)||SBD Train||2.73||2.76||2.70||2.00|
We evaluate our model across five publicly available segmentation datasets. To compare our model with other segmentation methods, we use the mean number of clicks necessary to reach the typically used 85-90% IoU threshold, known as the Number of Clicks metric (NoC @x%). Forte et al.  argue this widely used metric fails to characterize the ability of models to progress over a wider range of clicks, particularly useful for applications with high-quality requirements such as image editing. They recommend the additional use of accuracy score across a range of clicks. We use the SBD dataset  to train the proposed model. It includes 8,498 training images and 2,857 test images, corresponding respectively to 20,164 and 6,671 instances.
To simulate user clicks during evaluation, we first generate 2 clicks on the target object and then apply the corrective strategy to refine the prediction. To compute the NoC@x% metric, the refinement is limited to the targeted IoU threshold. To compute the mIoU for progressive clicks, the refinement is limited to clicks.
Iv-B Comparison of user interactions
To compare contour clicks with traditional enclosure interactions, we conducted an experiment with five human annotators. Annotators had to label the 100 images of the Berkeley dataset  using bounding boxes, extreme clicks, as well as three free contour clicks and two free contour clicks.
|Interaction type||NoC||Average time (s)||Median time (s)|
|Free contour clicks||3||5.51||5.42|
|Free contour clicks||2||3.78||4.36|
|Interaction type||NoC||Simulated clicks||Real user clicks|
|Extreme clicks *||4||89.1||87.7|
|Free contour clicks||2||87.0||86.3|
The user guidelines for each interaction types are shown in Figure 10. To both designate the object of interest in the image and prevent the users from anticipating their cursor position, a miniature of the image with the ground truth mask of the targeted object is briefly displayed during two seconds. Then the miniature is replaced with the full resolution image on which the user can draw a box or click on contour points. Figure 11 gives an overview of the annotation interface. While click precision was not mentioned in the guidelines, we calculated the accuracy with respect to the ground-truth box to ensure fairness in time comparison between extreme points and bounding boxes and found no significant difference in standard deviation (3.2% vs 3.6%).
Results are shown in Table II. Two user-clicks proved to be almost three times faster than extreme clicks, while also being significantly faster than simple bounding boxes. Note that this finding is contradictory with the results of Papadopoulos et al.  who observed that extreme points (7.2s) were significantly faster than bounding boxes (34.5s).
Iv-C Implementation details
Architecture and hyper-parameters
which has become the backbone of choice for many deep learning tasks and is at the top of the ImageNet classification leader-boards. After a pre-training on ImageNet, we train on SBD train (20,172 instance images; 8,498 images) and use the SBD val for validation (6,671 instance images). Simulated user clicks are represented as Gaussian distance functions and fed to the model as a fourth image channel. Unless specified otherwise, the results given in this report were obtained while evaluating on SBD val. We use the dice coefficient as our loss function as our experiments demonstrated that it enables a slightly higher mean IoU than binary cross entropy alone and binary cross entropy and the dice coefficient combined. We use a learning rate of 1e-5, that is reduced to 1e-6 when the loss has not been improving for the last 15 epochs. Training stops after 7 epochs without improvement of the loss function. We use a batch size of 12 as it gave better results than batch sizes of 8 and 16. We resize images to 256*256 as it enables to obtain a better IoU than resizing the images to 128*128 or 512*512. We set dropout to 0.5.
Table I provides a comparison of UCP-Net against previous interactive segmentation methods. We reach standard benchmark IoUs with lower numbers of clicks on SBD and Berkeley while getting close to state of the art on GrabCut. Moreover, we conducted an ablation study to evaluate the accuracy gain between extreme and unconstrained contour points. Following DEXTR’s training protocol  with 4 extreme points using our architecture, we observe that two unconstrained contour clicks allow for a similar accuracy (-1.4%) while being more than twice as fast (Table II, III). We also observe experimentally a robustness to click location variation (Figure 8), which further validates unconstrained clicks as a flexible and cognitively easy option for interactive segmentation. Qualitatively, our approach seems robust to object shape variation, occlusion and dense scenes (Figure 9).
Figure 7 gives a comparison of the ability of our model to improve segmentation masks with an increase number of user clicks with other methods. Our pipeline is able to continuously improve IoU with an increasing number of clicks. We observe a larger gap against other methods on SBD  which may be due to bias as its train and test set are most resembling. Note that we do not include the GAIS method in the curve comparison as they use a synthetic dataset for training.
When compared against other contour based methods, UCP-Net enables for a significant drop in the number of needed user clicks to achieve a satisfactory segmentation on SBD, GrabCut and COCO MVal.
With our generic contour based approach we have shown that unconstrained contour clicks enable for faster and more accurate segmentation thanks to fewer user clicks. We set a new state of the art of interactive segmentation on SBD, Berkeley, and COCO MVal. Our method is suitable for annotation purposes, enabling to label datasets requiring only a handful of user interaction. Moreover, it is also perfectly suitable for image editing applications as our iterative scheme makes it possible to reach a very high accuracy. In future work, investigating the contour clicks’ embedding might prove relevant to best exploit this interaction as it was found for extreme clicks . Moreover, the usage of the previously predicted segmentation yields significant improvement in iterative positive and negative interaction approaches [1, 4] and may be equally applicable for unconstrained clicks. Given the nature of contour clicks, they could also be further exploited to simultaneously segment or correct objects which are close to one another or which overlap as they share common boundaries.
-  (2018) Interactive video object segmentation in the wild. Vol. abs/1801.00269. External Links: Cited by: §II, §II, §II, TABLE I, §V.
-  (2019) Large-scale interactive object segmentation with human annotators. abs/1903.10830. External Links: Cited by: §II, §II, §II.
-  (2018) Iterative interaction training for segmentation editing networks. In Machine Learning in Medical Imaging - 9th International Workshop, MLMI 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings, Lecture Notes in Computer Science, Vol. 11046, pp. 363–370. External Links: Cited by: §IV-C.
-  (2020-03) Getting to 99% accuracy in interactive segmentation. In arXiv preprint, pp. . Cited by: §II, §II, §II, Fig. 7, TABLE I, §IV-A, §V.
Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society, Series B 51, pp. 271–279. External Links: Cited by: §II.
-  (2011-11) Semantic contours from inverse detectors. pp. 991–998. External Links: Cited by: Fig. 1, Fig. 2, Fig. 7, §III, §IV-A, §IV-D.
-  (2019-01) A fully convolutional two-stream fusion network for interactive image segmentation. Neural networks : the official journal of the International Neural Network Society 109. External Links: Cited by: §II, §II, TABLE I.
Interactive image segmentation via backpropagating refinement scheme. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II, §II, §II, TABLE I.
-  (2020) Imgaug. Note: https://github.com/aleju/imgaug Cited by: §IV-C.
-  (2018) Interactive image segmentation with latent diversity. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 577–585. Cited by: §I, §II, §II, §II, TABLE I.
-  (2019) MultiSeg: semantically meaningful, scale-diverse segmentations from minimal user input. In International Conference on Computer Vision (ICCV), Vol. , pp. 662–670. Cited by: §II, §II, TABLE I.
-  (2017) Regional interactive image segmentation networks. In International Conference on Computer Vision (ICCV), Vol. , pp. 2746–2754. Cited by: §II, §II, §II, TABLE I.
-  (2018) Iteratively trained interactive segmentation. British Machine Vision Conference (BMVC) abs/1805.04398. External Links: Cited by: §I, §II, §II, §II, §III, TABLE I.
-  (2019-06) Content-aware multi-level guidance for interactive instance segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II, §II, §II, §II, TABLE I.
-  (2018) Deep extreme cut: from extreme points to object segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 616–625. Cited by: §I, §II, §II, §II, §II, TABLE I, §III, §IV-D, TABLE III.
-  (2001-07) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In International Conference on Computer Vision (ICCV), Vol. 2, pp. 416–423. Cited by: Fig. 3, Fig. 7, Fig. 9, §IV-B, TABLE II, TABLE III.
-  (1995) Intelligent scissors for image composition. In Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), SIGGRAPH ’95, New York, NY, USA, pp. 191–198. External Links: Cited by: §II.
-  (2017-10) Extreme clicking for efficient object annotation. In International Conference on Computer Vision (ICCV), pp. 4940–4949. External Links: Cited by: §I, §II, §IV-B.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241. Cited by: §IV-C.
-  (2004) ”GrabCut”: interactive foreground extraction using iterated graph cuts.. ACM Transactions on Graphics 23 (3), pp. 309–314. External Links: Cited by: Fig. 7.
-  (2020-04) FAIRS – soft focus generator and attention for robust object segmentation from extreme points. In arXiv preprint, pp. . Cited by: §II, §II, §II, TABLE I.
-  (2020-01) F-brs: rethinking backpropagating refinement for interactive segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. . Cited by: §II, TABLE I.
Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §IV-C.
-  (2019) Object instance annotation with deep extreme level set evolution. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II, §II, §V.
-  (1991) Smallest enclosing disks (balls and ellipsoids). In New Results and New Trends in Computer Science, Cited by: §III.
-  (2016-06) Deep interactive object selection. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II, §II, §II, §II, TABLE I.
-  (2020-06) Interactive object segmentation with inside-outside guidance. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.