UI-Net: Interactive Artificial Neural Networks for Iterative Image Segmentation Based on a User Model

by   Mario Amrehn, et al.

For complex segmentation tasks, fully automatic systems are inherently limited in their achievable accuracy for extracting relevant objects. Especially in cases where only few data sets need to be processed for a highly accurate result, semi-automatic segmentation techniques exhibit a clear benefit for the user. One area of application is medical image processing during an intervention for a single patient. We propose a learning-based cooperative segmentation approach which includes the computing entity as well as the user into the task. Our system builds upon a state-of-the-art fully convolutional artificial neural network (FCN) as well as an active user model for training. During the segmentation process, a user of the trained system can iteratively add additional hints in form of pictorial scribbles as seed points into the FCN system to achieve an interactive and precise segmentation result. The segmentation quality of interactive FCNs is evaluated. Iterative FCN approaches can yield superior results compared to networks without the user input channel component, due to a consistent improvement in segmentation quality after each interaction.



page 1

page 3


A Semi-Automated Usability Evaluation Framework for Interactive Image Segmentation Systems

For complex segmentation tasks, the achievable accuracy of fully automat...

Iteratively Trained Interactive Segmentation

Deep learning requires large amounts of training data to be effective. F...

Searching Learning Strategy with Reinforcement Learning for 3D Medical Image Segmentation

Deep neural network (DNN) based approaches have been widely investigated...

Focal FCN: Towards Small Object Segmentation with Limited Training Data

Small object segmentation is a common task in medical image analysis. Tr...

Iterative Interaction Training for Segmentation Editing Networks

Automatic segmentation has great potential to facilitate morphological m...

SwipeCut: Interactive Segmentation with Diversified Seed Proposals

Interactive image segmentation algorithms rely on the user to provide an...

Deep Interactive Object Selection

Interactive object selection is a very important research problem and ha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Trans-catheter arterial chemoembolization (TACE) [LGLS11] is a minimally invasive treatment for liver cancer, utilizing image guidance. Vessels which supply the hepatocellular carcinoma (HCC) with oxygenated blood are induced with chemotherapeutic agent and subsequently occluded by a physician [KCL05]. During the intervention, several 2-D projection images are acquired by a cone-beam C-arm CT scanner in order to reconstruct a volumetric DynaCT image of the patient’s abdomen. A segmentation of the cancerous tissue is performed on the image data. Subsequently, the outline of the segmented tumor is used to find collateral vessels.

The more accurate the segmentation, the less healthy tissue surrounding the lesion is occluded, minimizing the toxicity of the procedure. In addition, a higher percentage of the tumor can be treated with chemotherapeutic agent, increasing the efficacy of the therapy [LNT02]. Segmenting the tumor with high accuracy is especially challenging due to variations in shape, size (Fig. 1 a–f), and a high diversity in X-ray attenuation (Fig. 1 a), as depicted in Fig. 1.






Figure 1: Challenges of hepatic lesion segmentation are (a) high diversity in gray-values, no typical shape, (b) intensity overlaps between tumor and surrounding tissue, (c) intensity patches due to necrotic regions, and (d–f) varying appearance of the same tumor between 2-D slices.

Current approaches for intra-procedural tumor segmentation can be divided into three main categories depending on the involvement of the operating user in the segmentation process: manual, automatic, and interactive. (1) With manual segmentation schemes, users draw the complete contour line of the object to be segmented with minimal assistance by the system. A perfect manual segmentation of hepatic lesions is feasible, but would take several minutes until an appropriate result is reached during the intervention due to the primitive tools provided to the user. (2) Fully automated segmentation approaches may also exhibit long runtimes attributable to a lack of domain knowledge of the system. If learning-based, such methods may also need a large amount of training data in order to achieve an acceptable accuracy of their outcome. Still, a perfect segmentation may not be reachable. Users do not have control over the process. However, trained physicians could substantially assist in reaching the goal of a fast and exact segmentation, due to their knowledge of a very good estimate of the true tumor extent. (3) Interactive segmentation methods are applicable, particularly in situations where only few or even no annotated data sets of similar segmentation tasks are available, or the task is to produce only a few new, but accurate segmentations. The limiting factor for scaling this approach is the time spent by users to provide input during each image segmentation task. Therefore, interactive methods are not a replacement for fully automated approaches, but can supersede them in certain niches on account of their high accuracy reached by efficient use of their operators’ expertise.

1.1 User Input

According to [OS01], user interactions can be categorized depending on the interactive segmentation system’s interface: (1) A menu-driven user input scheme as in [RPN15] limits the user’s scope of action, trading their control over the segmentation outcome for more guidance by the system. (2) Setting parameter values directly demands an insight of the user into the algorithm. (3) Pictorial input on the image , is the most intuitive case for the user. is the number of elements in the image. This method mimics human behavior during knowledge transfer via a visual medium. For the scope of this paper, a pictorial user input is utilized. This is the most challenging class of user simulation, but also the most intuitive interaction scheme for the human operator.

According to Nickisch et al. [NRKR10]

, there are three different approaches to include this user-dependent pictorial data into the evaluation process. Given a predefined task, several human participants interact with the system in (1) user studies or by (2) crowd sourcing in order to gather plausible hints at every step of the iterative segmentation. (3) An active user model (also called robot user) aims at a fast and highly scalable method to simulate plausible user interactions with the segmentation system. Such a model may be learned from a sufficiently large user interaction database compiled utilizing data from (1) or (2). Alternatively, the model can be defined by a rule-based system such as

[NRKR10, ZND11] or the one proposed here.

1.2 State-of-the-art

Upconvolutional network topologies such as the FCN are a promising technique for solving element-wise (dense) prediction problems on image data [LSD15, LBH15, WGCM16]

. Classical convolutional neural networks (CNNs) typically append fully-connected layers or multilayer perceptrons to their contracting path. In contrast, FCNs solely consist of convolutional and pooling layers. The missing layer types are substituted by unpooling/upsampling and deconvolutional/upconvolutional layers in an expanding path. Shift-invariant filter operations are therefore applied in each step of the segmentation computation, forming hierarchies of learned features. In this paper, we utilize the FCN topology for pixel level classification, that is commonly referred to as the U-net


. In CNNs, pooling is performed to introduce a hierarchy of features, preserving only a condensed version of the former neighborhood’s information. Some localization information is lost during each pooling operation due to an increasingly coarser image representation. The U-net architecture recovers spatial information by preserving spatial resolution from previous layers and linking it to later neurons in less fine-grained layers. The U-net architecture in combination with augmentation of the input image data

[SSP03] allows for particularly high accuracy segmentation results from a relatively small set of training data.

FCNs have been successfully applied to several segmentation tasks [LSD15, RFB15], but were so far only considered in a fully automated context, consequently, omitting valuable prior knowledge of trained personnel. In this paper, we extend the use of FCNs by an interactive component during training. The resulting fully trained network is then able to improve its segmentation suggestions depending on user-defined seed points during the segmentation process. This property is achieved by simulating plausible user inputs during the training phase of the artificial neural network (ANN) by an active user model which reacts to segmentation suggestions. The interactive learning-based system is evaluated w. r. t. the well-known U-net [RFB15] without a user model and GrowCut [VK05, AGSM16] segmentation methods via the Sørensen-Dice coefficient (Dice) [Dic45].





Figure 2: Active user model: (a) from the current seed mask , (b) a segmentation is computed (cyan). Ground truth is depicted in green. (c) The difference mask (red) is used to randomly select misclassified image elements, in this case a single element . (d) The seed mask is updated by the user model. (e) Improved segmentation is obtained (magenta) w. r. t. the previous segmentation (b) in blue.

2 Methods

2.1 Interactive Network Architecture

Pictorial scribbles (seed points, lines, and shapes) are drawn by the user as an overlay mask on the visualization of the image to segment. Lines and complex shapes are represented as a set of seed points. A seed point denotes a tuple where is a position in the image space and backgroundforeground represents the label at this position in a binary segmentation system. Seed points are defined by the user in order to act as a representative subset of the segmentation ground truth , . The image with same dimensions as and values at image coordinates , where tuple , is called the seed mask . In each iteration, active user models add labeled scribbles to based on the difference of the current segmentation to the ground truth in order to define the next interaction with the system, a strategy human users pursue as well.

2.2 User Model

We propose a rule-based user model as a surrogate operator of the interactive segmentation system during training. The user model simulates a human user during the training phase of the neural network by altering the input of the network for each epoch

. User input is considered additive.

For the initial interaction , binary erosion and binary dilation are performed on the voxel data [RRW09]. The foreground labeled image elements after iterations of foreground erosion , are combined with the background labeled image elements after iterations of foreground dilation . This method prevents initial seed placement near the true contour line and mimics a quickly drawn rough estimate of the object to segment, as shown in Fig. 2 (a), where the outline of is depicted.

At iterations , the active user model takes the current binary segmentation and ground truth as input, as depicted in Fig. 2 (b). The user model extracts the set of incorrect label assignments. It selects a subset , where , uniformly at random (Fig. 2 (c)). Subsequently, these seed points are added to the current seeds to create . Fig. 2 (d) depicts a choice of seeds , which is utilized to generate an improved segmentation in Fig. 2 (e). As shown in Fig. 3 (right), the proposed neural network uses the gray-valued image data as input as well as user information in form of the seed mask for each input image. and are incorporated into the system as two separate input channels as depicted in Fig. 4.

Figure 3: Traditional FCN training procedure (left) and proposed training method by user simulation (right).
Figure 4: Schematic FCN computation including user information as additional input (blue). Purple arrows represent further computational layers based on [RFB15] topology.

3 Experiments

3.1 Data

The data set used in this paper consists of volumetric images. They correspond to reconstructions of DYNA-CT acquisitions of human patients’ abdomina, with voxel resolutions from to . The hepatic lesions are fully annotated by two medical experts and can be fully embedded in manually selected cubic volumes of

voxels. This defines the fixed output size of the FCN. For the input volumes of interest (VOI), the dimensions of the output images have to be increased, to compensate the reduction of input image dimensions in each consecutive hidden layer, due to the border handling during individual convolution operations. Since the lesions are not on the border of the abdominal image volume, these cubic volumes can be padded with

voxels of surrounding gray-valued image data to VOIs of voxels. Due to the small amount of fully annotated data sets available and to reduce the time of the learning process as well as the number of trainable weights of the system, we use 2-D slices of the 3-D volumes as input for the FCN. Therefore, parts of the spatial context information is neglected by the current system. The VOI cubes are sliced in transverse, coronal, and sagittal orientations. Planes which do not contain any tumor object information are discarded to preserve a more balanced label distribution over all input slices. The data is divided (per patient and volumetric image) into 2-D images for training, for validation, and images for testing ( in total).

3.2 UI-Net Parameters

We decided on a network depth of , initial filters of size , a batch size of , and epochs for training without early-stopping. The learning rate of and momentum are set after training several FCNs on the same data set and varying parameters by an evaluation of their accuracy progression per epoch w. r. t. smoothness, overall slope and position of the minimal validation loss value. Data augmentation via elastic deformations [SSP03]

is used to increase the amount of training and validation data by a factor of four as an additional regularizer, counteracting the risk of over-fitting during training. A standard deviation

of the Gaussian in pixels and scaling factor to control the deformations’ intensity are chosen. The active user model’s fraction of new input data to sample is set to , a reasonable value to simulate a human user. Users will not place seeds in all erroneously segmented areas of the image (manual segmentation), but rather sparsely add more seed points. Several variants are tested in order to observe the networks behavior given more domain knowledge via the user input channel. The values for a user input mask are set to for background, for undecided, and for foreground. Due to the same distance to the value, the network does not inherently favor object or background labels while computing weighted sums during training. Values of are normalized between and accordingly.

3.3 UI-Net Seeds

Three experiments are conducted utilizing the UI-net architecture: (1) A varying number of erosion and dilation operations are used to generate inputs in order to examine segmentation quality w. r. t. additional domain knowledge inserted into the input layer. The smaller , the more information is provided. During training, no update step by the user model is utilized here, since the networks are trained with the same data in each epoch. We will refer to a network with property as static during training. (2) To infer, whether additional input data provided by the rule-based user model improves the segmentation quality, seeds are generated with an alternative system to (1). A varying number of seeds () are sampled uniformly at random from , where . Static training is used. (3) Multiple simulated user interactions with UI-nets are evaluated. The UI-nets are trained with as initial seeds and the user model proposed in Sec. 2.2.

4 Results and Discussion

UI-nets were trained and evaluated with different seeding approaches. As depicted in Fig. 5 (a, b), the number of given seed points correlates with the overall segmentation quality (experiments (, )). For an evaluation with the actual user model, the interactive user input version of the UI-net performs best as depicted in Fig. 5 (c) (experiment ()). The UI-net trained with an interacting user model consistently performs better with each additional input provided by the user, continuously improving its segmentation results. Training an FCN with a user model which reacts on deficiencies in current segmentation results during training can therefore improve the overall segmentation result. As visualized in Fig. 5 (c,d), UI-net yields superior segmentation results w. r. t. the interactive and non-learning based GrowCut approach. We found a average improvement in Dice score given the same images and active user model.




Figure 5: UI-nets trained with (a) varying contour width and (b) randomized seed masks to infer the general ability to learn from (static) user input on first iteration data only. The evaluation of several iterations, during an interactive segmentation by an active user model, is displayed in the bottom row (c, d). The test set for iteration is always the same. Interactive seed changes occur only after the first iteration. Legend: widthfraction of random initial seeds,iteration, as in (b, iteration).

5 Conclusion and Outlook

We described a method to incorporate user scribbles as additional input for a semi-automatic neural network image segmentation. A user model simulates plausible interactions of a user during the learning phase of the network. In contrast to traditional FCNs, during each classification, new user input ground truth is generated by an operator and included into the input of the network. The UI-net can be subsequently trained with this information. The UI-net learns to incorporate the user information into the process of classification. The proposed UI-net architecture can be superior to fully automated approaches in terms of highly accurate segmentation results, especially in medical applications where only few data sets need to be processed and only a small database of fully annotated images is available for training. The interactive user input version would need more training epochs than a network with static user input for equivalent results, if the test setup is non-interactive as in Fig. 5 (a).

The described technique to include user information into an FCN segmentation system can also be implemented via transfer learning from pre-trained non-interactive FCNs. Here, a second FCN is trained to fine-tune the existing model with user data as an additional input besides the output of the first net. The first FCN then acts as a feature extractor

[YCBL14] and can be trained separately from the second net for user interaction. The proposed user model is a rule-based system to simulate a user’s behavior. Another set of rules [NRKR10, ZND11] and learning-based systems are to be evaluated for different active user models to further improve the segmentation process by reducing the amount of interactions needed from the user to achieve the same segmentation results by even more adapted user models utilized during training.

Disclaimer: The concept and software presented in this paper are based on research and are not commercially available. Due to regulatory reasons its future availability cannot be guaranteed.


  • [AGSM16] Amrehn M. P., Glasbrenner J., Steidl S., Maier A. K.: Comparative evaluation of interactive segmentation approaches. Bildverarbeitung für die Medizin (BVM) (2016), 68–73.
  • [Dic45] Dice L. R.: Measures of the amount of ecologic association between species. Ecology (1945), 297–302.
  • [KCL05] Kim H.-C., Chung J. W., Lee W., Jae H. J., Park J. H.: Recognizing extrahepatic collateral vessels that supply hepatocellular carcinoma to avoid complications of transcatheter arterial chemoembolization 1. Radiographics (2005), 25–39.
  • [LBH15] LeCun Y., Bengio Y., Hinton G.: Deep learning. Nature (2015), 436–444.
  • [LGLS11] Lewandowski R. J., Geschwind J.-F., Liapi E., Salem R.: Transcatheter intraarterial therapies: rationale and overview. Radiology (2011), 641–657.
  • [LNT02] Lo C.-M., Ngan H., Tso W.-K., Liu C.-L., Lam C.-M., Poon R. T.-P., Fan S.-T., Wong J.: Randomized controlled trial of transarterial lipiodol chemoembolization for unresectable hepatocellular carcinoma. Hepatology (2002), 1164–1171.
  • [LSD15] Long J., Shelhamer E., Darrell T.: Fully convolutional networks for semantic segmentation. Computer Vision and Pattern Recognition (CVPR) (2015), 3431–3440.
  • [NRKR10] Nickisch H., Rother C., Kohli P., Rhemann C.: Learning an interactive segmentation system. Computer Vision, Graphics and Image Processing (ICVGIP) (2010), 274–281.
  • [OS01] Olabarriaga S. D., Smeulders A. W. M.: Interaction in the segmentation of medical images: A survey. Medical Image Analysis (MIA) (2001), 127–142.
  • [RFB15] Ronneberger O., Fischer P., Brox T.: U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015), 234–241.
  • [RPN15] Rupprecht C., Peter L., Navab N.: Image segmentation in twenty questions. Computer Vision and Pattern Recognition (CVPR) (2015), 3314–3322.
  • [RRW09] Rhemann C., Rother C., Wang J., Gelautz M., Kohli P., Rott P.: A perceptually motivated online benchmark for image matting. Computer Vision and Pattern Recognition (CVPR) (2009), 1826–1833.
  • [SSP03] Simard P. Y., Steinkraus D., Platt J. C.: Best practices for convolutional neural networks applied to visual document analysis. Document Analysis and Recognition (ICDAR) (2003), 958–962.
  • [VK05] Vezhnevets V., Konouchine V.: GrowCut: Interactive multi-label ND image segmentation by cellular automata. Computer Graphics and Applications (Graphicon) (2005), 150–156.
  • [WGCM16] Würfl T., Ghesu F. C., Christlein V., Maier A. K.: Deep learning computed tomography. Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2016), 432–440.
  • [YCBL14] Yosinski J., Clune J., Bengio Y., Lipson H.: How transferable are features in deep neural networks? Neural Information Processing Systems (NIPS) (2014), 3320–3328.
  • [ZND11] Zhao Y., Nie X., Duan Y., Huang Y., Luo S.: A benchmark for interactive image segmentation algorithms. Person-Oriented Vision (POV) (2011), 33–38.