Neurofibromatosis type-1 (NF1) is an autosomal dominant neurogenetic disorder characterized by the development of both benign and malignant tumors. The hallmark tumors are neurofibromas, which are histologically benign tumors that arise from the peripheral nerve sheath and involve any body part. Despite benign histology, they can cause significant morbidity due to compression and invasion of nerves and other vital anatomical organs. Neurofibromas can be located deep inside the body and, if asymptomatic, are usually detected by whole-body magnetic resonance imaging (WBMRI) using short tau inversion recovery (STIR) sequences. Based on tumor morphology on MRI, plexiform (invasive or involving multiple nerves) neurofibromas (PNFs) carry an increased risk for transformation into malignant peripheral nerve sheath tumors.
Fig. 1 depicts two NF1 cases of WBMRI with the ground truth segmentation of tumor regions contoured in yellow. Accurate detection and evaluation of tumor burden on WBMRI are important for longitudinal tracking of tumor size, which enables accurate assessment of tumor growth and treatment response. However, the detection and segmentation of neurofibroma on WBMRI, particularly PNFs, is associated with three technical challenges.
Large number of tumors across the entire body and variable anatomical locations of tumors. Neurofibromas can develop anywhere along peripheral nerves. Their appearance across individuals can vary in number (from none to hundreds) and sizes (from several cubic centimeters to several thousand cubic centimeters). Traditional interactive segmentation methods cannot reasonably handle such a large number of tumors at a time. Conventional segmentation methods for neurofibromas have been proposed in the literature, including histogram thresholding , region growing  using histogram templates, 3D dynamic thresholding level-set . These interactive segmentation methods are labor-intensive and time-consuming since they involve manual identification of individual tumors by raters and interactive contour correction due to the imperfect segmentation obtained from automated methods. In general, it may take a few minutes to 1-2 hours to complete segmentation on WBMRI.
Heterogeneous and diffuse tumor architecture. PNFs can be elongated in shape with a characteristic ringlike or septate pattern that typically has a target-like appearance on MRI, with central low signal intensity and peripheral high signal intensity. Recently, deep convolutional neural networks (CNNs) have achieved great success in medical image segmentation, such as U-Net , V-Net , DeepMedic , and nnU-Net . CNNs have brought a breakthrough for tumor segmentation in the brain , lung , liver , and other organs. Nevertheless, their application in neurofibroma segmentation on WBMRI has been minimal. Besides, CNN-based approaches tend not to generalize well to new data because the targeted neurofibromas may be substantially different in size, shape, intensity, and boundary to adjacent organs in the training and testing data sets. Here, we explore how to embed user interactions into CNNs to improve generalizability for obtaining an accurate and efficient interactive segmentation approach on WBMRI.
Guide maps suffer from the distribution shift for variable image sizes. Some CNN-based interactive segmentation approaches [39, 34] have been proposed to extract foreground objects interactively. These methods convert user interactions into distance maps utilizing either Euclidean distance transform  (EDT) or geodesic distance transform  (GDT). Typically, training CNNs with image patches and fine-tune/test on the whole image is a common trade-off between GPU memory and accuracy/inference speed [29, 26, 23]. However, both transformations are sensitive to image sizes, which leads to the distribution shift for guide maps with various sizes and a performance decrease when applied to neurofibroma data.
This paper proposes deep interactive neural networks (DINs) for interactive neurofibroma segmentation on WBMRI. We first adapt popular 3D U-Net 
to the neurofibroma data on WBMRI by introducing anisotropic convolutional kernels for more accurate tumor-related feature extraction. Then, user interactions are encoded into guide maps as inputs with a distance transformation and embedded into multiple layers of the model for reserving user knowledge in deeper layers. The guide maps are regarded as the local appearance prior and the spatial prior. To avoid the effect of variable image sizes, we propose using exponential distance transform (ExpDT), whose intensity distribution is size-agnostic. With the guide maps, DINs embed users’ prior knowledge into neural networks for correcting segmentation results. Furthermore, to reduce the interaction effort in training and testing processes of DINs, we develop a strategy to simulate user interactions to synthesize various user interactive methods and rapidly explore the best hyper-parameters. It is well known that medical images acquired from different devices have a distribution shift problem that can not be neglected and may lead to poor generalization for CNN models. In this situation, we experimentally demonstrated that DINs have stronger robustness and stability than previous automated and interactive methods.
To evaluate DINs, we collected two WBMRI data sets and a local-region MRI (LRMRI) data set from NF1 patients, obtained using different MRI acquisition parameters. Experiments showed that DINs significantly outperformed automated methods by 44% in Dice Similarity Score (DSC), which demonstrated the effectiveness of the CNN-based interactive segmentation. DINs outperformed other CNN-based interactive methods by 29% and ExpDT outperformed other distance transforms (DTs) by 14% in DSC. Furthermore, comparison results on conventional interactive methods suggested that DINs significantly reduced user interactions and running time.
The main contributions of the work are summarized as follow:
We propose DINs to cope with the challenges of neurofibromas for interactive segmentation on WBMRI.
We introduce ExpDT for integrating user interactions into neural networks. ExpDT is size-independent comparing with other common DTs and therefore is more suitable for WBMRI.
We propose a deep interactive module to integrate user knowledge into the deeper layers of the model, which effectively enhances the learned features about neurofibromas and improves the segmentation performance.
We develop a strategy to simulate user interactions for training 3D interactive neural network models.
DINs outperform automated and interactive methods by 44% and 29% in DSC, and ExpDT outperforms other DTs by 14% in DSC. Furthermore, DINs reduce user interactions and running time comparing with conventional methods.
Ii Related Work
Ii-a Neurofibroma segmentation
A few interactive and semi-automated segmentation methods have been developed for NF1 in the literature. Solomon et al.  developed an interactive 2D segmentation method for PNF that detects tumor regions within a manually defined area on each slice using a histogram-based threshold. This method failed if the histogram is unimodal or close to unimodal. Thus, manual contouring was frequently required to correct the resulting contours. Following this idea, Weizman et al.  proposed 15 histogram templates of various distributions from bimodal to unimodal to identify the optimal threshold. Cai et al.  developed the 3DQI system for semi-automated neurofibroma segmentation performed by the dynamic-threshold level set method starting from a seed region. These existing methods required a large amount of time and effort of interaction from users, either slice-by-slice or tumor-by-tumor with user-provided scribbles or initial seeds. However, as shown in Fig. 1, there might be dozens or hundreds of tumors in a single study, which put a heavy burden of interaction on users. Some work [37, 11] introduced neural networks into the segmentation of neurofibromas. Wu et al.  integrated CNNs into the Active Contour Model for predicting the parametric maps, while Ho et al. 
compared a multi-spectral neural network classifier with manual segmentation on diffusion-weighted imaging data. However, these methods still relied on conventional segmentation methods, and state-of-the-art deep CNNs were not explored for neurofibroma segmentation yet. Therefore, a highly accurate and efficient neurofibroma segmentation method remained a technical challenge.
Ii-B CNNs in medical image segmentation
CNNs have been successfully adopted in various medical image segmentation applications. In particular, U-Net  was designed for medical image semantic segmentation with the symmetric encoder and decoder paths. Long skip connections between the two paths enhanced the fusion of multi-level features. Based on the same network, many encoder-decoder CNNs have subsequently been introduced for 2D and 3D medical image segmentation. 3D U-Net  expanded U-Net from 2D image segmentation to 3D volumetric image segmentation using 3D convolutions and a training strategy with sparse annotation. VNet  and HighRes3DNet  incorporated residual modules 
into their network structures. The difference between VNet and HighRes3DNet was that the VNet enlarged the receptive field by downsampling the feature map with large-stride convolution while HighRes3DNet adopted dilated convolution. In addition, cascaded networks [5, 23], were also explored for improving the learning ability of the model. H-DenseUNet  adopted a multi-stage strategy, cascaded 2D and 3D networks to jointly fuse and optimize the learned intra-slice and inter-slice features for better liver and tumor segmentation. These methods have achieved promising results in various applications of tumor and organ segmentation. However, existing CNN models did not generalize well to segment tumors in new data sets with different spatial and density distributions (such as variable location and tumor morphology in NF1) than in the training set.
Ii-C Interaction in CNNs
Inspired by the semi-automated image segmentation techniques, which achieved a higher accuracy with minimal effort of interaction to provide cues that guide segmentation algorithms, such as clicks or scribbles used in Graph Cut  and Random Walk , recent works have introduced user interactions as extra channels of the input images in CNN [39, 34]. The user interactions were typically transformed into guide maps by DT. We reviewed two well-known DTs:
Geodesic distance transform. Wang et al.  proposed two networks for 2D placenta segmentation: a proposal network (P-Net) for initial segmentation, and a refinement network (R-Net) for refined segmentation. GDT was used to transform user interactions into intensity-related distance maps that could provide auxiliary information for accurate segmentation.
Both methods transformed the user clicks into guide maps using a DT. However, the intensity distribution of the two distance maps closely relied on the image size which was commonly variable in 3D medical image segmentation. Inconsistent intensity distribution of the input guide maps significantly affected the segmentation accuracy of CNN-based models. Instead of interacting in the training stage, image-specific fine-tuning  incorporating user interactions at the test stage was another solution. Nevertheless, fine-tuning a deep neural network at runtime required massive computational resources, which was unavailable in most of the situations.
In the context of neurofibroma segmentation on WBMRI, we propose deep interactive neural networks (DINs) for 3D medical image semantic segmentation, which is inspired by deep object selection  for interactive segmentation of natural images and feature modulation [27, 40] for conditional controlling of the neural networks. DINs employed an encoder-decoder backbone embedded within the deep interactive module (DIM). The DIM influences neural network segmentation via the image-specific information generated from user interactions represented by the distance map. The structure of DINs is shown in Fig. 2. Furthermore, for efficient training of DINs, we propose a strategy to simulate user interactions in the training process, thereby avoiding creating thousands of training samples manually. This strategy can also be used to evaluate the performance of DINs and adjust the hyper-parameters quickly.
Iii-a Exponential distance transform
The DT of a binary mask, specifies the minimum distances from each pixel to the boundaries of non-zero regions, where the distances may be signed to distinguish between the inside and outside of the non-zero regions. Instead of the boundaries, unsigned DTs compute minimum distances to the whole masked regions. Various unsigned DTs have been studied for image segmentation in the literature [7, 39, 34]. Given an N-Dimension gray image and a corresponding binary mask , represents the image intensity at point and . Point set is defined as that is considered the point set of user interactions. The DT of concerning is formulated as:
where is a specific distance function between two points in an image. For Euclidean distance and geodesic distance, can be uniformly defined as:
where is the set of all paths between the point and , and is such a path, parameterized by . is the derivative of with respect to and
is a unit vector along the tangent direction of. If , Equation (1) is called Euclidean distance transform, which is not conditioned on the image intensity and thus degenerates to . If , it becomes a geodesic distance transform. When and , Equation (1) is a combination of the two distances. A common characteristic of the two distance functions is that they strongly rely on the image size, which means that for images with different sizes, the intensity distribution is significantly different. We name the DT with this feature as “global transform”. An illustration of EDT, GDT and their blended DT is shown in Fig. 3 (c-e). It can be seen that the grayscale (shown as the color bar) significantly varies from the image size (shown as the dotted box).
Compared with popular CNN models taking fixed size images as input for classification problems [30, 10], FCN-like models, such as U-Net, remove densely connected layers and thus can accept input images of arbitrary sizes and produce correspondingly-sized outputs 
. Furthermore, FCN-like models allow the size of inference images to be different from that of training images, which is crucial in the segmentation of large 3D medical images such as WBMRI due to limited GPU memory and insufficient training samples. For example, we may need to train a 3D U-Net with small volume patches due to the limited GPU memory, but inference with the whole volume as the inference saves the memory of storing gradients of parameters. Most of the operators in CNNs such as addition, multiplication, ReLU, convolution, and max pooling are either element-wise or window-wise, with which the predictions are hardly affected if the image patches were expanded or clipped (i.e., the size changes). However, the distribution of the integrated global DT corresponds to the actual feature size, and therefore inconsistent image sizes in training and inference stages lead to distribution inconsistency, which may impact the segmentation performance. Therefore, we propose a “local transform”, exponential distance transform, to avoid being affected by the variable image sizes.
The ExpDT is formulated as:
with the scale parameter, and controlling the influence of the points in on surrounding points. As shown in Fig. 3 (f), the ExpDT is a local-enhanced distance map that means pixels with high gray levels are tightly gathered near the points in , and therefore ExpDT is hardly affected by the variable image sizes. When , ExpDT tends to form spikes at the point in ; and when , ExpDT becomes flat and loses locality. Different from the previous two transforms that use min to compute distances to , ExpDT turns to use max due to the negative sign in equation (4). If we only considered from the perspective of DTs, ExpDT neither has global attributes nor combines image intensity. We argue that, with the proposed DINs framework, CNNs can still learn discriminative features from the local-enhanced ExpDT.
Iii-B Structure of DINs
The structure of DINs follows the encoder-decoder scheme with skip connection like 3D U-Net . But 3D U-Net is originally evaluated to segment organs with fixed size and balance pixel spacing, such as Xenopus kidney, and not suitable for the scenario of various neurofibromas. Considering that WBMRI has an approximate average shape of with a pixel spacing of , we fix the size of the input image to , which is about half the size of the original image. It should be noted that when the pixel spacing of an image differs significantly, isotropic resampling is not a good choice, due to the missing inter-slice information. Therefore, we build the backbone network with convolutional layers that have different kernel sizes and strides. There are four downsamplings in the coronal plane, but only once in the orthogonal direction. Instead of using max pooling layers, downsampling is implemented by large-stride convolution to save GPU memory and apply a larger batch size. Upsampling is performed by deconvolutional layers 
. Commonly, batch normalization (BN), which is used to reduce internal covariate shift and stabilize training, has poor performance when confronted with small batch size . In addition, Isensee et al.  experimentally demonstrate instance normalization  (IN) performs better than BN in medical images. Therefore, we apply IN after the convolutional layers, followed by ReLU activation. For clarity, we list all the details of the internal layers of DINs in Table I.
|Modules||Details of the layers|
|Input||1 channel concat: DIM output 1|
|E1||[conv k: s: ]|
|Output||conv* k: s: softmax|
|“k: 133, 30”–kernel size (1, 3, 3) and output channel 30|
|“s: 111”–stride (1, 1, 1)|
To incorporate user interactions into deep neural networks, we develop a deep interactive module (DIM). By leveraging feature modulation , the DIM embeds additional image-specific information into the network backbone and guides the model to focus on the features that are enhanced by the distance maps, or the so-called guide maps, as shown in Fig. 2. The DIM consists of an ExpDT, a max pooling layer, and a convolutional layer, transforming user interactions into guided maps of two different sizes (DIM output 1 and DIM output 2). Concretely, user interactions are transformed into a foreground guide map and a background guide map by ExpDT and are then integrated into the input layer by concatenating with the raw image as a three-channel input. The two guided maps are further encoded and integrated into the encoder’s deepest layer to avoid the guide information being gradually diluted as more complex features are extracted. A detailed experiment regarding the position where the DIM outputs are inserted is placed in Section V-C3. As shown in Fig. 4
, downsampled guide maps are added to the output of the first normalization layer in the deepest layer of the encoder and followed by a ReLU activation function. The layers in decoder path do not need more integration due to the guide information passed from the skip connections.
Iii-C Simulating strategy
Simulating user interactions in the training stage and evaluation stage can not only free users from the burdensome interactive work for generating thousands of training samples, but also accelerate the process of exploring the optimal hyperparameters. Our strategy of simulating user interactions is based on the work in and extends to the setting of 3D images. Let denote the ground truth segmentation of an image , and denote the set of pixels of foreground objects satisfying . We define background regions surrounding objects as:
where is the Euclidean distance between the point and the set , and is the bandwidth.
When processing 2D natural images, Xu et al.  proposed to sample positive clicks from
randomly, and the number of sampled points followed a discrete uniform distribution from 1 to. Negative clicks were randomly selected from the whole background (random selection) and evenly select from (uniform selection). The number of negative points did not exceed but could be zero. However, if we directly use the same upper bound and in 3D images, user interactions will become quite sparse as the additional axis, which is experimentally demonstrated to be harmful to the model performance. Therefore, we adapt and to the 3D version as follow:
This strategy maybe not the best choice, but it indeed is a simple and effective way to determine a better upper bound of the click numbers sampled in the training stage. In addition, positive and negative clicks from the -pixel region near the boundaries should be avoided. Considering the fact of large inter-slice spacing of WBMRIs and infiltrative MRI appearance of neurofibromas, the restriction of was only applied to individual slices. Finally, at least pixels should be kept between any two points in each dimension.
During the evaluation, simulation is performed by placing the next positive/negative click on the center of the largest error region acquired from the symmetric difference between the current prediction and ground truth. Specifically, if the largest error region is part of a foreground object, then the next click is a positive point, whose coordinate is . If , we replace by:
where is the skeleton  of the region . It means that is the nearest point of in the skeleton of . This situation may occur when is concave. In this way, we guarantee that the positive points will not be placed in the wrong region (background) and vice versa. The maximum number of clicks on a single study is limited to . we set a threshold DSC (see Section IV-B) of in cross-validation experiments. If the target threshold can not be achieved in clicks, we will terminate the interaction of the current study.
|(EDT + GDT)-half||0.5||0.5||-||0.70||0.43||0.55||5.6||6.4||12.0|
|(EDT + GDT)-half||0.5||0.5||-||0.43||0.68||30.65||4.1||15.9||<0.01|
is the p-values of t-test between the results of ExpDT and other DTs.
Iv Experiments settings
Iv-a Data set and preprocessing
We collected two WBMRI data sets and an LRMRI data set from NF1 patients; one WBMRI data set was used as the training set and the remaining two as testing sets. Both WBMRI data sets were acquired on 1.5-T MR scanners (MAGETOM Avanto fit, Siemens Medical Systems, USA) using different software (Syngo MR 2004 V for the training set, Syngo MR E11 for the testing set). We did not shuffle and reassigned the training and testing sets to evaluate the DINs’ ability to handle such a complicated situation. The training set contained 125 studies with 1156 NF1 tumors manually contoured by clinicians with expertise in identifying peripheral nerve sheath tumors. Their sizes and pixel spacings are described in section III-B. We adopted online data augmentation to reduce overfitting, including randomly cropping from MRI scans, scaling between 1.0 and 1.25, rotating with a degree
sampled from Gaussian distribution, flipping in both three dimensions, and gamma transformation with a range of . The WBMRI testing set was composed of 33 studies with an approximate dimension of and unified spacing of . A total of 224 tumors were manually contoured in this data set. The LRMRI testing set contained 45 studies with various dimensions (from to ) and spacing (from to ).
Iv-B Implementation details
We used the weighted cross-entropy as the loss function. The weighting factors were 1.0 and 3.0 for background pixels and foreground pixels, respectively. Adam optimizer with and was used to update model parameters. The learning rate was set to
at first and reduced by 0.2 once the validation loss did not decrease for 30 epochs. The minimum learning rate was set to
. We trained 200 batches per epoch with a batch size of 8 and terminated training after 250 epochs. We forced 50% of the images in one batch to include tumors while the others were randomly cropped without restriction. We implemented DINs with the Tensorflow package in Python, and conducted experiments on a single Tesla V100 GPU with 32GB memory. We utilized ITK-SNAP , 3DQI  and imcut  for comparing with Active Contour, Random Walk, and Graph Cut, respectively.
Foreground and background clicks were simulated following the strategies described in Section III-C. In DIOS, is set to in 2D image segmentation. We therefore set
The maximum number of background clicks was set to the same value as for simplicity. It should be noted that this is a crucial hyper-parameter for acceptable performance, and we will discuss it in section V-C. was set to 3 pixels by default. To preserve tumors less than 6 pixels in either direction, we remove the restriction of for these small tumors. and were set to 10, and was liberalized to 1 as a compromise of the huge inter-slice spacing. The bandwidth of was set to 40 pixels.
In the evaluation, dice similarity coefficient (DSC), the total number of clicks, Volumetric overlap error (VOE), and absolute relative volume difference (ARVD) were the primary evaluation metrics. The number of positive clicks and negative clicks were also logged for comparison. Letand denote the binary prediction and ground truth, respectively. Then, the three metrics are formulated as:
where denotes the number of non-zero elements and means the absolute value. For the interactive evaluation process, if not specified, user interactions were continuously provided until either of 20 clicks or the threshold of 0.8 DSC was reached in all of the following experiments.
V Results and Discussion
V-a Comparison of ExpDT with EDT and GDT
In this section, we compare ExpDT with EDT, GDT, and other three variants. We remove the DIM output 2 in this group of experiments to highlight the contributions of DT. The suffix “half” denotes the input images downsampled to half of the original image patches. The cross-validation results are listed in Table II. When using settings of , ExpDT outperforms EDT by 12% DSC and reduces the average number of interactions from 15.3 to 10.2. Compared with GDT, ExpDT does not involve image intensity but achieves better results in all metrics. Interestingly, with a DSC threshold of 80%, EDT-half and GDT-half achieve fewer interactions than EDT and GDT, respectively. We conjecture that the smaller image sizes alleviated the problem of inconsistency of image sizes, which affects the performance of “global transformation”, between the training and evaluation stages. Finally, ExpDT with achieves the best results on all the four primary metrics. The last three rows of Table II indicate that a successive improvement can be made by subtly adjusting even if the model parameters are fixed after training. The comparison between ExpDT with other DTs demonstrates the effectiveness and flexibility of ExpDT.
Table III compares the performance of different DTs on the more challenging testing set. We report three accuracy metrics and the numbers of foreground and background points with 20 clicks provided. One can observe that the overall performance is inferior to those in the training set. Potential reasons include the difference in image size, pixel spacing, and the distribution shift between training and test set. We observe that the accuracy of EDT and GDT decreases more than ExpDT. The number of background points is much more than the number of foreground points, and the ARVD is notably larger than that of ExpDT. It indicates EDT and GDT predict many false-positive regions on account of the properties of their “global transformation”. On the contrary, ExpDT yields improved results, and the numbers of positive clicks and negative clicks are relatively balanced. We also report the p-values of t-test between ExpDT and other DTs, which indicate the substantial improvement of ExpDT. Therefore, as a “local transformation”, ExpDT is more generalizable.
Fig. 5 displays four segmentation cases by DINs with different transform functions. For each case, only one positive click is provided to perform segmentation. We observe that with only one click, DINs with ExpDT achieved better performance in the segmentation of discrete neurofibromas (first row and second row), whereas EDT and GDT produce more false-positive regions as well as false-negative regions. In plexiform neurofibromas, ExpDT achieves consistent performance to segment with good accuracy (third row and fourth row). However, EDT and GDT may miss tumor regions far from the clicked object (third row). Their high sensitivity to image size is the main reason for such instability. For large plexiform neurofibromas (fourth row), GDT may miss part of the lesions due to heterogeneous and diffuse tumor architecture, while ExpDT and show better performance. Fig. 6 shows some results on WBMRI by DINs.
|NCI-3DQI ||-21 (-406 to 114)||4.5 (0.3 to 28.4)||43|
|MGH-3DQI ||-30 (-782 to 353)||9.5 (0.1 to 48.8)||34|
|DINs||-3 (-101 to 360)||16.6 (7.9 to 29.2)||40|
|3D U-Net ||0.06||0.97||77.62||<0.01|
|I||nnU-Net + DIM output 1||0.40||0.72||6.70||<0.01|
|DINs (DIM output 1)||0.67||0.46||0.79||0.58|
|Methods||Type of interactions||DSC (20 inters)||# of inters (0.8 DSC)||Running time/inter (in minute)|
|RW ||boxes, points||0.47||19.5||2.12|
|GC ||boxes, points||0.66||17.9||0.71|
|AC ||boxes, thresholds, bubbles||0.71||15.2||5.32|
|DINs-box (ours)||boxes, points||0.80||12.1||0.14|
V-B Comparison with other methods
V-B1 State-of-the-art interactive methods
Cai et al.  perform volume measurements on the LRMRI data set using 3DQI software at Massachusetts General Hospital (MGH) and National Cancer Institute (NCI) and MEDx software at NCI. Table IV shows the results of NCI-3DQI - NCI-MEDx and MGH-3DQI - NCI-MEDx at the first two rows. We take the segmentation results of NCI-MEDx as the ground truth and use DINs on the LRMRI data set. Results in the last row indicate that DINs can achieve similar performance comparing with 3DQI. Notice that the results of NCI-3DQI and MGH-3DQI are finalized with various editing tools, while DINs are not for a fair comparison.
V-B2 Deep CNN-based methods
A comparison between DINs and state-of-the-art medical image segmentation methods is shown in Table V. One can observe that DINs outperform automated methods by 44%–63% and outperform “nnU-Net + Dim output 1” by 29%. Automated methods get low scores, while interactive methods have better performance with the help of user knowledge. The comparison between “nnU-Net + Dim output 1” and “DINs (DIM output 1)” indicates that DINs benefit from the proposed structure of feature extractor adapted to WBMRI. Finally, with the DIM output 2, DINs further increase the DSC by 2%, which suggests the effectiveness of integrating user knowledge into deeper layers. In addition, the p-values of t-test between the results of DINs and other methods are listed for reference.
V-B3 Conventional interactive methods
We compare DINs with some conventional interactive segmentation methods, including Random Walk (RW) , Graph Cut (GC)  and Active Contour (AC) . RW and GC treat the volume as a discrete static graph, performing segmentation with many positive and negative clicks by solving the linear system (RW) or min-cut (GC) problem. Commonly, to save computation time and reduce irrelevant information, a bounding box is provided before running these conventional approaches. Then the search space can be restricted to a smaller region. Therefore, we implement two versions of DINs:
DINs-full. Feeding the whole 3D volume into DINs for evaluation.
DINs-box. We manually create several bounding boxes in each volume with three criteria: (1) Spatially closer tumors are grouped into the same bounding box. (2) The heights and the widths of bounding boxes are at least 128 pixels, while depths are set as tight to the tumor boundaries as possible that is consistent with users’ behavior. (3) There are no more than five bounding boxes in one case.
Fig. 7 (a) presents the trend of DSC for DINs and conventional methods when the interactive points are continuously provided. RW, GC, AC, and DINs-box ran the algorithms within the volume of interest while DINs-full employs the entire 3D volume. In Fig. 7, we observe that RW and GC show a low DSC, which is caused by the limited information used to compute features. We notice that DINs-full only has a modest improvement compared with AC since AC needs to set a bounding box and adjust two thresholds for filtering backgrounds, which requires much more user efforts to precisely tune the thresholds. With additional bounding boxes, DINs-box significantly exceeds all of the other three conventional interactive methods. For a detailed comparison, we summarize the type of interactions, DSC, number of interactions, and running time in Table VI. DINs-full has the minimum requirement of interactions to perform the segmentation, which substantially reduces the complexity of the interaction. With extra bounding boxes, DINs-box further improves segmentation accuracy and reduces both the user burden and the running time.
Overall, as the number of interactions increases, DINs improve the segmentation accuracy more stably and successively. Furthermore, a substantial improvement can be made by providing a few bounding boxes for some difficult cases (DINs-box). This implies that DINs are more effective and flexible for interactive segmentation.
V-C Ablation studies
In this section, we conduct ablation studies to investigate the influence of three crucial parameters of DINs: (1) the upper bound of the number of interactions sampled during training, (2) scale parameter of ExpDT, and (3) the DIM module.
V-C1 Number of interactions during training
Training DINs with different upper bound , the number of foreground clicks and background clicks may significantly influence the performance of the resulting model. We conduct experiments with four different to train DINs and evaluate it on the test set. The comparison is shown in Fig. 7 (b). We find that the model trained with has a higher DSC than when the number of clicks exceeds 3. The models trained with and are significantly inferior to . Too many points in training may lead the model to severely rely on user interactions and become more conservative, while too few points may be not adequate to help the model choose discriminative features. It indicates that this hyper-parameter is crucial to train a well-performed model, and the strategy of linking the upper bound of interaction number and image dimension (see Equation (6)) is reasonable and effective.
V-C2 Scale parameter of ExpDT
The scale factor is another key parameter to train a well-performed model, where are correlated to the anatomical anterior-posterior, superior-inferior and left-right, respectively. We fix to 1 and compared different scale values of and . The results are presented in Fig. 7 (c), which indicate that scale factor is a important parameter for optimal model performance. The potential reason is the varied sizes of neurofibromas. However, the sensitivity can also be seen as the flexibility that users can use different scale factors for tumors with various sizes to achieve higher accuracy. The scale factor can also be adjusted in the inference stage for better segmentation results. A comparison is shown in Table II and Table III. It indicates that a slightly larger gives better results. Furthermore, we can adjust the for each case to improve performance.
V-C3 Deep interactive module
We conduct experiments to present the effectiveness of each part of the DIM. Several variants of the DIM are compared: (1) DIM-input: DIM with only the output 1 branch. (2) DIM-highest: DIM with only the output 2 branch. (3) DIM-v2: The path to output 2 is implemented with a max-pooling layer whose kernel size is and two large-stride convolutional layers. The results are shown in Fig. 8 (a). Intuitively, the user interactions are additional features to discriminate against the tumor from the background. We observe that DIM-highest gets a poor DSC and only has subtle improvement as the number of clicks increase, while DIM-input has a higher DSC. By combining DIM-highest with DIM-input together, we achieve further improvement. The comparison among DIM-highest, DIM-input, and DIM indicates that guide maps with spatial information help neural networks learn more discriminative features, and the guide maps in the highest layer of encoder exactly enhance the corresponding features. A comparison between DIM and DIM-v2 indicates that a single convolutional layer is adequate to pass the extra features to deeper layers. More layers introduce more parameters that increase the risk of overfitting.
We also compare the effect of the position where the DIM output 2 is inserted in the encoder. Let DIM- denote the variant of connecting DIM output 2 to the th layer of the encoder, where . DIM- is exactly the proposed structure. The results are shown in Fig. 8 (b). Overall, the DSC increases as the guide maps are integrated into deeper layers. The reason is that guide maps’ information is limited compared with the abundant features from the images and is easily diluted as more features are extracted. Therefore, it is the best choice to integrate the guide maps into the encoder’s deepest layer for enhanced learning of the features. Besides, DIM-input outperforms DIM-1 and DIM-2, which is presented in Fig. 8 (c) for clarity. It indicates that integrating guide maps into shallow layers multiple times (including the input layer) of the encoder hurts the feature learning because of overfitting, which impacts generalization.
V-C4 Effect of click positions
Click positions will affect the interactive segmentation accuracy. To show the effect of click positions when using DINs, we randomly select ten neurofibromas from the training set, and each neurofibroma is clicked once. Each example is evaluated five times with different click positions. The standard deviations of the five results in each neurofibroma are computed. The median (range) of the ten standard deviations is 0.015 (0.005 to 0.044), which indicates that the click positions affect the performance within a reasonable range. As the click number increases to 2 and 3, segmentation results become more stable, and the median of the standard deviations decrease to 0.008 (0.001 to 0.038) and 0.005 (0.001 to 0.026), respectively.
V-D Interactive results
Two interactive segmentation results of plexiform neurofibromas with DINs are displayed in Fig. 9. The ground truth contours (manual segmentation) are red, and the prediction contours are yellow. The positive and negative interactive clicks are marked by red and yellow points, respectively. DINs achieve accurate segmentation of multiple tumors with one click and iteratively improves segmentation with additional interactions. Notice that the is set to by default, which is suitable for most neurofibroma segmentation situations.
In Fig. 10, we compare three interactive methods with the same clicks. (Note: A negative click in the second image has a deviated location compared with the other two methods due to the empty prediction with the original location.) Random walk tends to lead to undersegmentation, while graph cut cannot distinguish neurofibromas from normal organs and tends to result in oversegmentation. In comparison, DINs can recognize neurofibromas accurately. The two groups of segmentation results support the advantages of the DINs.
In conclusion, we propose the effective and flexible Deep Interactive Networks with a novel Exponential Distance Transform for neurofibroma segmentation on WBMRIs. The DINs framework efficiently extracts discriminative features of tumors by incorporating user interactions into low-level features and high-level features. The “local transformation” ExpDT is better equipped to address biased data distribution in medical images. Experiments on the training set and the test set show that the proposed method outperforms conventional interactive methods and performs significantly better than automated and interactive CNN-based methods. Limitations of DINs include the following parts: (1) Like conventional semi-automatic segmentation methods, DINs also need extra editing tools to achieve a acceptable volume measurement; (2) The ExpDT generates guide maps ignoring the image intensities, which may help improve the quality of the guide maps; (3) DINs may failed in some cases such as neurofibromas near the orbits due to the similar intensity. Considering these limitations, integrating the anatomical structure into neural networks and combining the image intensity into guide maps may be the future directions for developing high-performance interactive neural network methods.
Wei Chen is supported by The National Key R&D Program of China under grant No.2019YFB1404802 and National Natural Science Foundation of China (61772456). Wenli Cai is supported by Grant R42CA192600 and R42CA189637 from the National Institute of Health and Children Tumor Foundation. Pengyi Hao is supported by National Natural Science Foundation of China under grants No.61801428. Scott Plotkin received support from the Department of Defense (W81 XWH-06-1-0739) and philanthropic funds.
Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §IV-B.
Interactive graph cuts for optimal boundary region segmentation of objects in N-D images.
Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 1, pp. 105–112 vol.1. External Links: Cited by: §II-C, §V-B3, TABLE VI.
-  (2018-02) Volumetric MRI analysis of plexiform neurofibromas in neurofibromatosis type 1: Comparison of 2 methods. Acad. Radiol. 25 (2), pp. 144–152. External Links: Cited by: §I, §II-A, §IV-B, §V-B1, TABLE IV.
-  (2018-04) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. External Links: Cited by: §II-B.
-  (2016) Automatic liver and lesion segmentation in CT using cascaded fully convolutional neural networks and 3D conditional random fields. In Medical Image Computing and Computer-Assisted Intervention, Cham, pp. 415–423 (en). External Links: Cited by: §II-B.
-  (2016) 3D U-Net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention, Cham, pp. 424–432. External Links: Cited by: §I, §II-B, §III-B, TABLE V.
-  (2008) Geos: geodesic image segmentation. In European Conference on Computer Vision, pp. 99–112. Cited by: §III-A.
-  (2006-11) Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (11), pp. 1768–1783. External Links: Cited by: §II-C, §V-B3, TABLE VI.
-  (2017-01) Brain tumor segmentation with deep neural networks. Med. Image Anal. 35, pp. 18–31. External Links: Cited by: §I.
Deep residual learning for image recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §II-B, §III-A.
-  (2020) Image segmentation of plexiform neurofibromas from a deep neural network using multiple b-value diffusion data. Sci Rep 10 (1), pp. 1–10. External Links: Cited by: §II-A.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv: 1502.03167. External Links: Cited by: §III-B.
-  (2020) Automated design of deep learning methods for biomedical image segmentation. arXiv:1904.08128. External Links: Cited by: §I, TABLE V.
-  (2019) nnU-Net: breaking the spell on successful medical image segmentation. arXiv: 1904.08128. External Links: Cited by: §III-B.
Multiple resolution residually connected feature streams for automatic lung tumor segmentation from CT images. IEEE Trans. Med. Imaging 38 (1), pp. 134–144. External Links: Cited by: §I.
-  (2013) Image segmentation in medical imaging via graph-cuts.. In 11th International Conference on Pattern Recognition and Image Analysis: New Information Technologies, Cited by: §IV-B.
-  (2017-02) Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, pp. 61–78 (en). External Links: Cited by: §I, TABLE V.
-  (1988-01) Snakes: active contour models. Int. J. Comput. Vis. 1 (4), pp. 321–331 (en). External Links: Cited by: §V-B3, TABLE VI.
-  (2014) Adam: a method for stochastic optimization. arXiv:1412.6980. External Links: Cited by: §IV-B.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §III-A.
-  (1994) Building skeleton models via 3-D medial surface axis thinning algorithms. CVGIP: Graphical Models and Image Processing 56 (6), pp. 462–478. External Links: Cited by: §III-C.
-  (2017) On the compactness, efficiency, and representation of 3D convolutional networks: brain parcellation as a pretext task. In Information Processing in Medical Imaging, Cham, pp. 348–360. External Links: Cited by: §II-B.
-  (2018-12) H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes. IEEE Trans. Med. Imaging 37 (12), pp. 2663–2674. External Links: Cited by: §I, §I, §II-B.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: 1st item, §III-A.
-  (2016-10) V-Net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, pp. 565–571. External Links: Cited by: §I, §II-B.
-  (2018) Two-stage convolutional neural network for breast cancer histology image classification. In International Conference Image Analysis and Recognition, pp. 717–726. External Links: Cited by: §I.
FiLM: visual reasoning with a general conditioning layer.
Proceedings of the AAAI Conference on Artificial Intelligence32 (1) (en). External Links: Cited by: §III-B, §III.
-  (2015) U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §I, §II-B.
-  (2019) From patch to image segmentation using fully convolutional networks–application to retinal images. arXiv: 1904.03892. External Links: Cited by: §I.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556. External Links: Cited by: §III-A.
-  (2004) Automated detection and volume measurement of plexiform neurofibromas in neurofibromatosis 1 using magnetic resonance imaging. Comput. Med. Imaging Graph. 28 (5), pp. 257–265. External Links: Cited by: §II-A.
-  (2016) Instance normalization: the missing ingredient for fast stylization. arXiv: 1607.08022. External Links: Cited by: §III-B.
-  (2018-07) Interactive medical image segmentation using deep learning with image-specific fine tuning. IEEE Trans. Med. Imaging 37 (7), pp. 1562–1573. External Links: Cited by: §II-C.
-  (2019-07) DeepIGeoS: a deep interactive geodesic framework for medical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 41 (7), pp. 1559–1572. External Links: Cited by: §I, 2nd item, §II-C, §III-A, TABLE II, TABLE III.
-  (2014) PNist: interactive volumetric measurements of plexiform neurofibromas in MRI scans. Int. J. Comput. Assist. Radiol. Surg. 9 (4), pp. 683–693. External Links: Cited by: §I, §II-A.
-  (2012) Interactive segmentation of plexiform neurofibroma tissue: method and preliminary performance evaluation. Med. Biol. Eng. Comput. 50 (8), pp. 877–884. External Links: Cited by: §I.
-  (2020) Deep parametric active contour model for neurofibromatosis segmentation. Future Generation Computer Systems 112, pp. 58–66. External Links: Cited by: §II-A.
-  (2018-09) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §III-B.
-  (2016) Deep interactive object selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 373–381. Cited by: §I, 1st item, §II-C, §III-A, §III-C, §III-C, TABLE II, TABLE III, §III.
-  (2018-06) Efficient video object segmentation via network modulation. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §III.
-  (2006) User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage 31 (3), pp. 1116–1128. External Links: Cited by: §IV-B.
-  (2011) Adaptive deconvolutional networks for mid and high level feature learning. In 2011 International Conference on Computer Vision, pp. 2018–2025. External Links: Cited by: §III-B.