Accurate Weakly Supervised Deep Lesion Segmentation on CT Scans: Self-Paced 3D Mask Generation from RECIST

01/25/2018 ∙ by Jinzheng Cai, et al. ∙ University of Florida National Institutes of Health Ping An Bank Nvidia 0

Volumetric lesion segmentation via medical imaging is a powerful means to precisely assess multiple time-point lesion/tumor changes. Because manual 3D segmentation is prohibitively time consuming and requires radiological experience, current practices rely on an imprecise surrogate called response evaluation criteria in solid tumors (RECIST). Despite their coarseness, RECIST marks are commonly found in current hospital picture and archiving systems (PACS), meaning they can provide a potentially powerful, yet extraordinarily challenging, source of weak supervision for full 3D segmentation. Toward this end, we introduce a convolutional neural network based weakly supervised self-paced segmentation (WSSS) method to 1) generate the initial lesion segmentation on the axial RECIST-slice; 2) learn the data distribution on RECIST-slices; 3) adapt to segment the whole volume slice by slice to finally obtain a volumetric segmentation. In addition, we explore how super-resolution images (2 5 times beyond the physical CT imaging), generated from a proposed stacked generative adversarial network, can aid the WSSS performance. We employ the DeepLesion dataset, a comprehensive CT-image lesion dataset of 32,735 PACS-bookmarked findings, which include lesions, tumors, and lymph nodes of varying sizes, categories, body regions and surrounding contexts. These are drawn from 10,594 studies of 4,459 patients. We also validate on a lymph-node dataset, where 3D ground truth masks are available for all images. For the DeepLesion dataset, we report mean Dice coefficients of 93 and 76 study, where an experienced radiologist accepted our WSSS-generated lesion segmentation results with a high probability of 92.4

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 9

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of the proposed weakly supervised self-paced segmentation with CNN (Sec.3) for 3D lesion segmentation.

Assessing lesion or tumor growth rates of multiple time-point scans of the same patient represents one of the most critical problems in imaging-based precision medicine. To track lesion progression using current clinical protocols, radiologists manually conduct this task using response evaluation criteria in solid tumors (RECIST) [8] in computed tomography (CT) or magnetic resonance imaging (MRI) scans. A radiologist first selects an axial image slice where the lesion has the longest spatial extent, then he or she measures the diameters of the in-plane longest axis and the orthogonal short axis. Fig. 2 provides some visual examples of RECIST marks provided in DeepLesion dataset [42]. Typically, the selected RECIST-slice represents the lesion at its maximum cross area. RECIST is often subjective and prone to inconsistency among different observers, especially when selecting the corresponding slices at different time-points where diameters are measured. However, consistency is critical in assessing actual lesion growth rates, which directly impact patient treatment options. On the other hand, if the volumetric lesion measurements between baseline and follow-ups can be accurately computed and compared, this would avoid the subjective selection of RECIST-slices and allow more holistic and accurate quantitative assessment of lesion growth rates. Unfortunately, full volumetric lesion measurements are too labor intensive to obtain. For this reason, RECIST is treated as the default, but imperfect, clinical surrogate of measuring lesion progression.

Since the clinical adoption of the RECIST criteria, many modern hospitals’ picture archiving and communication systems (PACS) and radiology information systems (RIS) have stored tremendous amounts of lesion diameter measurements linked to lesion CT and MRI images. In this paper, we tackle the challenging problem of leveraging existing RECIST diameters to produce fully volumetric lesion segmentation in 3D. Our approach is a weakly supervised semantic segmentation setup, using convolutional neural networks (CNNs) as the core building block. The overall method flow-chart is given in Fig. 1. DeepLesion database [42] that we exploit is composed of a very large number of significant clinical radiology findings (lesions, tumors, lymph nodes, etc) from studies of patients, bookmarked and measured via RECIST diameters by physicians as part of their day-to-day work. The lesion instances are also clustered into eight different anatomical categories as shown in Fig. 2. Without loss of generality, here we only consider CT lesion images.

From any input CT image with RECIST-measured diameters, we aim to segment the lesion region on the RECIST-selected image first in a weakly supervised manner, followed by generalizing the process into other successive slices to finally obtain the lesion’s full volume segmentation. More specifically, with the bookmarked long and short RECIST diameters, we initialize the segmentation using unsupervised learning methods (

e.g., GrabCut [31]). Afterwards we employ an iterative segmentation refinement via a supervised deep neural network model [16], which can segment the lesion with good accuracy on the RECIST-slice. Importantly, the resulting CNN lesion segmentation model, trained from all training instances of CT RECIST-slices, can capture the lesion image appearance distribution. Thus, the model is capable of detecting lesion regions from images off the corresponding RECIST-slice. With more slices segmented, more image data can be extracted and used to fine-tune the deep CNN lesion segmentation model. As such, the proposed weakly supervised segmentation model is a self-paced label-map propagation process, from the RECIST-slice to the whole lesion volume. Therefore, we leverage a large amount of retrospective (yet clinically annotated) imaging data to automatically achieve the final 3D lesion volume measurement and segmentation.

To further enhance our WSSS model, we also investigate the role of image enhancement. A large portion of clinically significant findings in DeepLesion are spatially small ( mm), which often require significant zooming by radiologists to make precise diameter measurements. To emulate this radiological practice, we develop a stacked generative adversarial network (SGAN) [22] model to perform resolution magnification, noise reduction, contrast adjustment and boundary enhancement from the original resolution CT images. We stack two GANs together, where the first reduces the noise from original CT image and the second generates a higher resolution image with enhanced boundaries and high contrast. Because the CT modality can only be imaged at a limited range spatial resolution ( mm/pixel) and super-resolution CT images (e.g.,

mm/pixel) do not physically exist, we train our SGAN on a large quantity of paired low- and high-quality natural images and use transfer learning to apply the model to CT images.

We evaluate the proposed WSSS segmentation method using all lesion categories [42]. To the best of our knowledge, this is the first work to develop a class-agnostic lesion segmentation approach. For quantitative evaluation, we manually annotated 1,000 RECIST-slices and 200 lesion volumes, where the proposed WSSS achieves mean Dice coefficients of and , respectively. To compare WSSS against a fully-supervised approach, we also validate on a lymph node (LN) dataset [29, 33], consisting of LNs with full pixel-wise annotations. Finally, we also conduct a subjective user study, and demonstrate that radiologist accept the WSSS generated lesion masks with high probability of 92.4%. In summary, we present a weakly supervised semantic segmentation approach for accurate lesion segmentation and volume measurements in the wild, using a comprehensive dataset [42]. With no extra human annotation efforts required, we convert tens of thousands of recorded RECIST 2D diameter based measurements in PACS/RIS into accurate 3D lesion volume segmentation assessments.

[2mm]

Figure 2: The DeepLesion dataset [42] consists of categories of lesions. From left to right and top to bottom, the lesion categories lung, mediastinum, liver, soft-tissue, abdomen, kidney, pelvis and bone, respectively. For all images, the RECIST-slices are shown with manually delineated boundary in red and bookmarked RECIST diameters in white.

2 Related Work

Weakly supervised semantic segmentation is a challenging computer vision task that has drawn considerable interest. Recently promising results have been reported 

[25, 16, 23, 26, 6, 3, 39, 12]. Most weakly supervised semantic segmentation methods update pixel labels and CNN models using training iterations. The pixel-level training label is often initialized from weak annotations, e.g., bounding boxes of foreground objects. Subsequently, CNNs are trained to capture the initial masks. Although many pixels are possibly initially mislabelled, the CNNs are expected to robustly handle label noise (to some extent) and model the complete data distribution. Once the CNN model converges, its output is used to obtain more accurate pixel labels. Updates between CNN and pixel training labels loop iteratively until no further changes can be measured. However, as observed in [16], very high-quality initial pixel labels can allow the CNN model to converge in as few as iterations.

Thus, it is essential to have the object mask “correctly” initialized and updated. A popular mask initialization uses multiscale combinatorial grouping (MCG) [27] to group superpixels into plausible object parts, using a bounding box to delineate foreground, background, and uncertain regions during CNN training [6, 16]. RECIST diameters are similar to scribbles over foreground objects, and these can be used to coarsely delineate superpixels into foreground and background [23]. However, empirically MCG does not perform well under the CT imaging modality (compared to natural images) and thus is not suitable for our lesion segmentation task. Semantic edge based methods, like GrabCut [16, 25], are also not applicable because unsupervised edge detection methods, such as gPb [27], suffer from abundant false positives or perform unstably across the whole CT image dataset. Another stream of related work is extending weakly supervised semantic segmentation from static images to video [12]. In videos, the same object often retains similar appearance across different temporal frames even though the locations may vary. Using this prior, the training images of object-wise labels can be extracted and propagated across video frames to facilitate CNN training. However lesions exhibit not only positional movements but also drastic morphological changes across CT slices, adding further challenges to lesion volume segmentation from RECIST diameters. Hence, our work differs from prior art by employing a principled trimap mask initialization and GrabCut [31], coupled with self-paced learning [20] to progressively refine the 3D lesion segmentations.

Recently, deep CNN based image super-resolution methods [7, 17, 18, 22, 32, 34, 15, 21, 37, 38, 5] have achieved state-of-the-art performance, due to CNN’s capability to learn powerful features and model long-range contextual information from a collection of low resolution images. Mappings between low and high resolution images are learned by minimizing the mean squared error (MSE) loss or perceptual loss [15, 22, 32]. Adversarial learning strategies [22] are also employed to train CNN models for better reconstruction of fine details and edges. In this work, we adopt the perceptual loss and adversarial learning to simulate high-resolution images ( mm per-pixel) via CNNs directly from the original CT images ( mm per-pixel). Unlike other works, we use the enhanced images to improve the WSSS performance.

3 Method

Figure 3: Our weakly supervised deep lesion segmentation framework. We use CNN output to gradually generate extra training data for self-paced learning. Regions colored with red, orange, green, and blue inside the trimap present FG, PFG, PBG, and BG in respective.

In DeepLesion dataset [42], manually selected axial CT slices contain RECIST diameters that represent the longest axis and its perpendicular counterpart. Given RECIST diameters as weak supervision, we leverage weakly supervised principles to learn a deep lesion CNN segmentation model using both 2D slice and 3D volumes with no extra pixel-wise manual annotations. Our pipeline for WSSS contains four main steps as follows.

3.1 From RECIST to 2D Training Masks

We denote elements in the DeepLesion dataset [42] as for , where is the number of lesions, is the CT region of interest, and is the corresponding RECIST diameters. To create the 2D training data for the WSSS model, the image and label map pairs, and , respectively, must be generated. Together they comprise the set , for . For notational clarity, we drop the superscript . is simply the RECIST-slice, i.e., the axial slice that contains , and is its slice number. Both GrabCut [31] and densely-connected conditional random fields (DCRF) [19] can be adopted to produce the training mask . When GrabCut starts with a good quality initial trimap (explained below), we observe that it produces better results, which is consistent with the findings in [16].

More specifically, GrabCut is initialized with the image foreground and background regions and produces a segmentation using iterative energy minimization. Good initializations located near the global energy minimum may aid GrabCut’s convergence. We use the spatial prior information, provided by , to compute a high quality initial trimap from , i.e., regions of probable background (PBG), probable foreground (PFG), background (BG), and foreground (FG). Note that unlike the original trimap definition [31], we define four region types. More specifically, if the lesion bounding box tightly around the RECIST axes is , a ROI is cropped from the RECIST-slice. The trimap assigns the outer as BG and allocates of the image region, dilated from as FG. The remaining is divided in half between PFG and PBG based on the distances to FG and BG. Finally, the long/short axes of roughly clamp the size of GrabCut segmentation, ensuring that the segmentation will not shrink into a small region when the CT image intensity distribution is homogeneous and no clear lesion boundary is presented. Fig. 3 visually depicts the mask generation process.

3.2 CNN Lesion Appearance Model

Many image classification or segmentation CNN models have been presented in recent works [35, 11, 13, 40, 28]. Without loss of generality, we use the holistically-nested network (HNN) [40] and UNet [28] as our baselines, both of which provide state-of-the-art, yet straightforward semantic architectures. We represent our CNN models as a mapping function , with the goal of optimizing to minimize the differences between the current model outputs and ground truth training labels . Although the GrabCut masks are imperfect, we use them as for the next step of CNN model training, expecting that the CNN will generalize well even with considerable label noise. Thus, we aim to minimize

(1)

where is the pixel index inside , is the cardinality of , and is the cross-entropy function.

3.3 Self-Pacing for Volume Segmentation

For obtaining 3D volume segmentations, we follow a similar strategy as with the RECIST slices, except in the 3D case, we must infer for off-RECIST slices and also incorporate inferences from the CNN model. These two priors are used together for self-paced CNN training.

3D RECIST Estimation: A simple way to generate off-RECIST-slice diameters is to take advantage of the fact that RECIST-slice

lies on the maximal cross-sectional area of the lesion. The rate of reduction of off-RECIST-slice endpoints is then calculated by their relative distance to the intersection point of the major and minor axes. Estimated 3D RECIST endpoints are then projected from the actual RECIST endpoints by Pythagorean theorem using physical Euclidean distance.

Trimap from CNN Output: Different from Sec. 3.1, trimap generation now takes both the CNN output and the estimated RECIST as inputs: . The

is first binarized by adjusting the threshold so that it covers at least

of ’s pixels. Regions in that associate with high foreground probability values and overlap with will be set as FG together with . Similarly, regions with high background probabilities and that have no overlap with will be assigned as BG. The remaining pixels are left as uncertain using the same distance criteria as in the 2D mask generation case and fed into GrabCut for lesion segmentation. In the limited cases where the CNN fails to detect any foreground regions, we generate solely from the estimated RECIST . We observe that when possible, GrabCut initialized by the CNN output performs better than being solely initialized from .

Self-Paced CNN Training: To generate 3D lesion segmentations from 2D RECIST annotations, we train the CNN model in a self-paced or self-taught mechanism. It begins by learning the lesion appearances via CNN on RECIST-slices. After the model converges, is used to generate the training masks on successive slices so that it can be further fine-tuned on more newly “harvested” images. We progressively expand the extent of the harvested images, to include more CT slices along the longitudinal axis. Taking one lesion volume for example, the CNN is first trained on its RECIST slice until convergence, we then apply this CNN model to slices to compute the predicted probability maps . Given these probability maps, we create trimaps to employ GrabCut refinement [16] on these slices. The GrabCut method consequently generates the training labels that are used by the CNN model in the next training round on the slices slices. As this procedure proceeds iteratively, we can gradually obtain the converged lesion segmentation result in 3D. We summarize our WSSS method in Algorithm 1 and visually depict its process in Fig. 3. Note that it is possible that gives no or very few foreground areas or pixels if the initial trimap and CT image are of bad quality. In these cases, we set up foreground, background and ignored uncertain pixels by combining the and the CNN output. CNN predicted foreground regions that overlap with are assigned as foreground; predicted background regions that have no overlap with RECIST are set as background; with the remaining pixels ignored during training.

Input: for Deep-Lesion dataset Input: 3D RECIST Estimation:
Input: Maximum iteration: , initial CNN model: .

1: RECIST-Slice extraction
2: Initial Train-Mask
3: RECIST-slices training set
4: Training of initial 2D segmentation
5:for  to  do
6:     
7:     for  to  do
8:          , CNN inference
9:          ,
10:          ,
11:          .
12:     end for
13:     , CNN training with multi-slices
14:end for
15:return Return trained CNN model
Algorithm 1 Weakly Supervised Self-Paced CNN Training

3.4 Stacked-GAN Assisted Segmentation

We next investigate improving lesion segmentation accuracy by enhancing the quality and resolution of the input CT images via GANs. A direct application of super-resolution generative adversarial networks (SRGANs) [22] on CT images produces many artificial, noisy edges and even deteriorates the final lesion segmentation outcome. Spurred by this observation, we propose a two-stage stacked GAN (SGAN) process that first reduces image noise and then performs object boundary enhancement. More specifically, each stage is conducted using an independent SRGAN [22]. During model inference, the output of the first GAN will be fed as input to the second GAN. For lesion segmentation, we use both the denoised and enhanced outputs from SGAN and the original CT image as three-channel inputs since they may contain complementary information.

Synthesized Training Data: Normally the SGAN or SRGAN models are trained with pairs of low- and high-resolution images. This can be obtained easily in natural images (by down-sampling). However physical CT images are imaged by medical scanners at roughly fixed in-plane resolutions of mm per-pixel and CT imaging at ultra-high spatial resolutions does not exist. For the sake of SGAN training, we leverage transfer learning using a large-scale synthesized natural image dataset: DIV2K [2] where all images are converted into gray scale and down-sampled to produce training pairs. For the training of the denoising GAN, we randomly crop

sub-images from distinct training images of DIV2K. White Gaussian noise at different intensity variance levels

are added to the cropped images to construct the paired model inputs. For training the image-enhancement GAN, the input images are cropped as patches and we perform the following steps: 1) down-sample the cropped image with scale , 2) implement Gaussian spatial smoothing with , 3) execute contrast compression with rates of , and 4) conduct up-sampling with the scale to generate images pairs.

To fine-tune using CT images, we process training RECIST slices using the current trained SGAN and select a subset of up to slices by subjectively inspecting the CT super-resolution results. The selected CT images are subsequently added to the training for the next round of SGAN fine-tuning. This iterative process finishes when no more visual improvement could be observed.

4 Experiments

Datasets: The DeepLesion dataset [42] is composed of bookmarked CT lesion instances (with RECIST measurements) from studies of patients. Lesions have been categorized into the subtypes of lung, mediastinum, liver, soft-tissue, abdomen, kidney, pelvis, and bone. To facilitate quantitative evaluation, we segmented testing lesion RECIST-slices manually. Out of these , lesions ( annotated slices) are fully segmented in 3D as well. Additionally, we also employ the lymph node (LN) dataset111https://wiki.cancerimagingarchive.net/display/Public/CT+Lymph+Nodes [29, 33], which consists of mediastinal and abdominal LNs from

CT scans with complete pixel-wise annotations. Enlarged LNs are a lesion subtype and producing accurate segmentations is quite challenging even with fully supervised learning

[24]. Importantly, the fully annotated LN dataset can be used to evaluate our WSSS method against an upper performance limit, by comparing results with a fully supervised approach [24].

Pre-processing: For the LN dataset, annotation masks are converted into RECIST diameters by measuring the major/minor axes after localizing the successive axial slices nearest to the largest LN cross-section. For robustness, up to random noise is injected in the RECIST diameter lengths to mimic the uncertainty of manual annotation by radiologists. For both datasets, based on the location of RECIST bookmarks, CT ROIs are cropped at two times the extent of the lesion’s longest diameters so that sufficient visual context is preserved. The dynamic range of each lesion ROI is then intensity-windowed properly using the CT windowing meta-information in [42]. The LN dataset is separated at the patient level, using the split of and for training and testing, respectively. For the DeepLesion dataset [42], we use lesion volumes for training and the rest for testing.

Evaluation: The mean dice similarity coefficient (mDICE),

and the pixel-wise precision and recall scores are used to evaluate the quantitative segmentation accuracy. We measure volumetric similarity (VS)

and averaged Hausdorff distance (AVD) to better capture and evaluate the small but vital changes near object boundaries [36].

Our baseline CNN model is the holistically-nested network (HNN), originally proposed for natural image edge detection [41], which has been adapted successfully for lymph node [24], pancreas [4, 30], and lung segmentation [9]

. In all experiments, CNN training is implemented in Tensorflow 

[1] and Tensorpack 222https://github.com/ppwwyyxx/tensorpack

with Adam Optimizer and initialized from the pre-trained ImageNet model

[41]. The learning rate is set as and it drops to when the model training-validation plot plateaus.

Method Recall Precision mDICE
Lymph Node
RECIST 0.350.09 0.990.05 0.510.09
DCRF 0.290.20 0.980.05 0.410.21
GrabCut 0.100.25 0.320.37 0.110.26
GrabCut 0.530.24 0.920.10 0.630.17
GrabCut-R 0.830.11 0.860.11 0.830.06
Deep Lesion RECIST-Slice (Testing Images)
RECIST 0.390.13 0.920.14 0.530.14
DCRF 0.720.26 0.900.15 0.770.20
GrabCut 0.620.46 0.680.44 0.620.46
GrabCut 0.940.11 0.810.16 0.860.11
GrabCut-R 0.940.10 0.890.10 0.910.08
Table 1: Methods to Generate Training Label from RECIST:

pixel-wise precision, recall, and mean DICE (mDICE) are reported with standard deviation (

stdv.). 5 different setups are compared, 1) RECIST: dilated RECIST, 2) DCRF: dense CRF, 3) GrabCut: uses only RECIST bbox, 4) GrabCut: uses bbox and interior foreground, 5) GrabCut-R: uses bbox and dilated RECIST. See Sec. 3.1 and experiment settings for details.

Label Map Initializations: Both GrabCut [16, 31] and densely-connected conditional random fields (DCRF) [19] are extensively evaluated for initializing training label maps from RECIST diameters. We test two alternatives to the trimap approach explained in Sec. 3.1, which we denote GrabCut-R. Both alternatives are based on a tight bounding box (bbox), which matches the extent of the lesion RECIST marks with padding (against the lesion’s spatial extent). The first alternative (GrabCut), sets the area outside bbox as background and area inside bbox as probable foreground. The second alternative (GrabCut) sets the central bbox region as foreground, regions outside the bbox as background, and the rest as uncertain. This is similar to the setting of bbox in [16]. We also test DCRF, using bbox as the unary potentials and intensities to compute pairwise potentials [19]. We empirically found that DCRF is moderately sensitive to parameter variations. The optimal DCRF performance is reported in Table 6. Finally, we also report results when we directly use the RECIST diameters, but dilated to of bbox area, which unsurprisingly produces the best precision, but at the cost of very low recall. However, as can be seen in the table, GrabCut-R significantly outperforms all alternatives, demonstrating the validity of our mask initialization process.

Method CNN CNN-GC
Lymph-Node
Full-Sup (HNN) 0.7100.18 0.8450.06
RECIST 0.6140.17 0.8440.06
GC-Mask 0.7020.17 0.8440.06
Deep-Lesion RECIST-Slice
Full-Sup (UNet) 0.7280.18 0.8380.16
Full-Sup (HNN) 0.8370.16 0.9090.10
RECIST 0.6440.14 0.8010.12
GC-Mask 0.9060.09 0.9150.10
Table 2: Initial Masks to Train the CNN: all results are reported as (mDICEstd.). For CNN training, GC-Mask uses GrabCut-R as mask initialization, whereas RECIST uses the trimap of Sec. 3.1. The performance achieved by the fully supervised baseline of HNN [40] and UNet [28] is denoted as Full-Sup. CNN-GC is the result post-processed by GrabCut using the CNN outputs to initialize the trimap.

CNN Training under Different Levels of Supervision: Following Sec. 3.1, there are three ways of generating initial lesion masks on the RECIST-slice: using the trimap from the dilated RECIST (ignoring uncertain regions in training), GrabCut-R (processed from the trimap) and the full pixel-wise manual annotation (when available). Using the LN dataset [29, 33] with manual ground truth, HNNs trained from these three initializations of label maps achieve 61%, 70%, and 71% mDICE scores, with increased levels of supervision respectively. No extra segmentation refinement options like adaptive sample mining, or weighted training loss are applied. This observation demonstrates the robustness and effectiveness of using GrabCut-R label map initialization, which only performs slightly worse than the fully annotated pixel-wise masks. On the DeepLesion [42] testset of annotated testing RECIST-slices, HNN trained on GrabCut-R initialization outperforms the deep model learned from dilated RECIST maps by a margin of 25% in mDICE (90.6% versus 64.4%). GrabCut post-processing further improves the results from 90.6% to 91.5%. We also aim to demonstrate that our WSSS approach, trained on a large quantity of weakly supervised or “imperfectly-labeled” object masks, can outperform fully-supervised models trained on less data. To do this, we separated the annotated testing images into five folds and report the mDICE scores over 5-fold cross-validation using fully-supervised HNN [40] and UNet [28] models. Impressively, the dice score of WSSS considerably outperforms the fully supervised HNN and UNet mDICE scores of and , respectively. This demonstrates the importance of training CNNs from a large-scale “imperfectly-labeled” dataset. All the results are described in Table 2.

Figure 4: Volume Segmentation with Offsets: The x-axis represents the voxel distance of axial slices to the RECIST-Slice. GrabCut-3DE is GrabCut segmentation performed with 3D RECIST estimation. HNN is trained on the RECIST-slice. WSSS (mid) is the HNN self-paced with axial slices, and WSSS is the HNN further self-paced with axial slices.

RECIST-Slice PR-Curve    Lesion Volume PR-Curve    

Figure 5: Dataset Precision-Recall: The precision and recall curve of CNN output on 1,000 testing RECIST slices, and 200 lesion volumes. CNN, WSSS (mid), and WSSS share the same definition as in Fig.4, and its F-measure is presented in brackets in the legend.
Metric with SR w/o SR
Recall 0.9110.097 0.9330.095
Precision 0.9400.091 0.8930.111
AVD 0.1891.030 0.2301.070
VS 0.9510.067 0.9420.073
DICE 0.9200.082 0.9060.089
Table 3: Ablation Study of Stacked-GAN Assisted Lesion Segmentation with and w/o Super-Resolution (SR).
Method Bone Abdomen Mediastinum Liver Lung Kidney Soft-Tissue Pelvis Total
GrabCut 0.880.07 0.920.07 0.880.09 0.860.13 0.920.08 0.930.06 0.930.07 0.910.08 0.910.09
HNN 0.880.06 0.910.09 0.890.08 0.850.15 0.910.09 0.930.06 0.930.06 0.910.07 0.910.09
HNN-SR 0.890.06 0.930.09 0.910.08 0.880.14 0.920.07 0.940.05 0.940.05 0.920.08 0.920.08
HNN-GC 0.900.08 0.920.11 0.900.08 0.870.14 0.920.11 0.930.11 0.940.06 0.920.07 0.920.10
HNN-SR-GC 0.920.07 0.930.06 0.900.08 0.880.11 0.940.07 0.940.06 0.950.06 0.920.08 0.930.07
Table 4: Category-Wise RECIST-Slice Segmentation Comparison. HNN-SR and HNN-GC denotes HNN augmented with super resolution images and using GrabCut post-processing, respectively, whereas HNN-SR-GC uses both enhancements.

3D Segmentation: In Fig. 4, we show the segmentation results on 2D CT slices arranged in the order of voxel-distance with respect to the RECIST-selected slice. GrabCut with 3D RECIST estimation (GrabCut-3DE) produces good segmentations (91%) on the RECIST-slice, but can degrade to 55% mDICE when the off-slice distance raises to 4. This is mainly because 3D RECIST approximation often is not a robust and accurate estimation across slices. In contrast, the HNN trained using RECIST-slices generalizes well with large slice offsets. It still gains % of mDICE even when the offset distance range as high as . However, performance is further improved at higher slice offsets when using self-paced learning with axial slices, i.e., WSSS (mid), and even further when using the full self-paced learning with axial slices, i.e., WSSS. These results demonstrate the value of using our self-paced learning approach to generalize beyond 2D RECIST slices into full 3D segmentations. Fig. 5 further demonstrates the model improvements from self-paced learning using precision-recall curves. Unsurprisingly, the WSSS schemes do not provide much improvement on the 2D RECIST slices; however, significant improvements are garnered when considering the full 3D segmentations, again demonstrating the benefits of WSSS to achieve clinically useful volumetric lesion segmentations. The final 3D segmentation results are tabulated in Table 5.

Method Bone Abdomen Mediastinum Liver Lung Kidney Soft-Tissue Pelvis Total
GrabCut-3D 0.2210.13 0.2940.23 0.2680.15 0.3580.18 0.1920.18 0.3400.16 0.3640.15 0.2340.16 0.2920.18
GrabCut-3DE 0.6540.08 0.6280.20 0.6930.15 0.6970.15 0.6670.14 0.7470.15 0.7260.13 0.5800.14 0.6750.16
HNN 0.6660.11 0.7660.12 0.7450.11 0.7680.07 0.7420.15 0.7770.07 0.7910.08 0.7360.08 0.7560.11
WSSS 0.6850.10 0.7660.14 0.7760.10 0.7730.06 0.7570.15 0.8000.06 0.7800.10 0.7280.09 0.7620.11
WSSS-GC 0.6830.12 0.7740.15 0.7710.07 0.7650.08 0.7730.15 0.8000.08 0.7870.10 0.7220.10 0.7640.11
Table 5: Category-Wise Lesion 3D Segmentation Comparison. GrabCut-3D and GrabCut-3DE denotes GrabCut uses the 2D RECIST and the estimated 3D RECIST, respectively, to initialize the trimap. WSSS is the HNN self-paced with 5 axial slices, and WSSS-GC denotes WSSS using GrabCut post-processing. mDICE ( stdv.) scores are presented across all lesion categories.

[2mm]

Figure 6: Visual Examples of accepted and rejected lesion segmentation examples from the subjective user study. The first row presents good cases that are accepted the observer. In the second row, failed cases (among rejected segmentations) are shown. The red curves delineate segmented lesion boundaries. The under-segmented lesion regions are illustrated by yellow arrows and the over-segmented healthy tissue areas are indicated by blue arrows. Better viewed in color version.

Ablation Study of Stacked-GAN: When we train the HNN using two input channels of original CT and super resolution (SR) images, there are significant improvements in the precision, AVD, VS, and mDICE metrics (see Table.3). In particular, the AVD scores drop by from to , likely implying better lesion boundary delineation. Category-wise 2D lesion segmentation results are reported in Table 4. The SGAN assisted HNN (HNN-SR) improves over standard HNN, and, with GrabCut post-processing, the best result of mDICE score is achieved.

4.1 Subjective Clinical Acceptance Evaluation

The main motivation and goal of this work is exploring the feasibility of replacing and converting the current 2D RECIST diameter based lesion measurements into 3D volumetric scores. Consequently, it is critical to evaluate clinician’s subjective feedback on the acceptance rate of our WSSS produced lesion segmentations. We measure the acceptance rate of a US board-certificated radiologist with over 25 years of clinical practice experience under two different scenarios.

1) The testing RECIST-measured lesion CT images (overlaid with manual or automatic segmentation masks) are randomly displayed to the experienced radiologist to judge whether to accept the segmentation or not. The acceptance rate for WSSS segmentations is , which is close to the acceptance rate of the manually segmented masks (which were performed by 2 trainees under the supervision of a radiologist). Examples of accepted and rejected automatic segmentations are displayed in Fig. 14.

2) We show both manual and WSSS computed lesion segmentation results simultaneously in shuffled orders to the radiologist and let him pick the more preferable result (i.e., the human judge does not know which segmentation mask is the manual ground truth). Consequently, the radiologist picks WSSS-over-manual segmentation cases, designates another instances as inseparable in segmentation quality, and marks the remaining as manual-over-WSSS.

Figure 7: Volume Measurements: Volume changes measured using manual segmentations are graphed vs. WSSS and RECIST-based segmentation. Lines of best fit are also rendered.

Finally, we also measure how well the WSSS method can track lesion changes over time. Toward this end, pairs of lesion volumes with follow-ups are selected from our test dataset. They are used to evaluate the volume changes at two time points via the means of manual segmentation, RECIST measurements, or the proposed WSSS segmentation. For RECIST, the lesion volume is estimated by ellipsoid volume as where and represent the long and short axes, respectively. In Fig. 7, the volume change plots against the follow-up cases are illustrated. Manual and computerized segmentation results correlate well (with ), whereas the ellipsoid RECIST estimations tend to report smaller-than-actual volume changes (with ).

5 Discussions & Conclusion

We present a simple yet surprisingly effective weakly supervised deep segmentation approach on turning massive amounts of RECIST-based lesion diameter measurements (retrospectively stored in hospitals’ digital repositories) to full 3D lesion volume segmentation and measurements. The radiologist’s zooming factor when depicting the lesion diameters on CT images is emulated via our proposed stacked GAN models for image denoising and super-resolution. Importantly, our approach does not require pre-existing RECIST measurements on processing new cases.

Our method is fully automatic and learned from a large quantity of partially-labeled clinical annotations. The lesion segmentation results are validated through both quantitative evaluation (e.g., mean DICE on RECIST-slices and for 3D lesion volume segmentation) and subjective user study. We demonstrate that our self-paced learning improves performance over state-of-the-art CNNs. Moreover, we demonstrate how leveraging the weakly supervised, but large-scale data, allows us to outperform fully-supervised approaches that can only be trained on subsets where full masks are available. Our 3D lesion segmentation also produces more accurate estimations of lesion volume changes than the RECIST criteria. Our work is potentially of high importance for automated and large-scale tumor volume measurement/management in the domain of precision quantitative radiology imaging.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

    Software available from tensorflow.org.
  • [2] E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    , July 2017.
  • [3] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-fei. What’s the point : Semantic segmentation with point supervision. In ECCV, pages (7)549–565, 2016.
  • [4] J. Cai, L. Lu, Y. Xie, F. Xing, and L. Yang. Improving deep pancreas segmentation in CT and MRI images via recurrent neural contextual learning and direct loss function. MICCAI, abs/1707.04912, 2017.
  • [5] R. Dahl, M. Norouzi, and J. Shlens. Pixel recursive super resolution. arXiv:1702.00783, 2017.
  • [6] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1635–1643, Dec 2015.
  • [7] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE Trans. on Pat. Anal. and Mach. Intell., 38(2):295–307, 2016.
  • [8] E. Eisenhauer, P. Therasse, J. Bogaerts, L. Schwartz, D. Sargent, R. Ford, J. Dancey, S. Arbuck, S. Gwyther, M. Mooney, L. Rubinstein, L. Shankar, L. Dodd, R. Kaplan, D. Lacombe, and J. Verweij. New response evaluation criteria in solid tumours: revised recist guideline (version 1.1). Eur. J. Cancer, pages 45:228–247, 2009.
  • [9] A. P. Harrison, Z. Xu, K. George, L. Lu, R. M. Summers, and D. J. Mollura. Progressive and multi-path holistically nested neural networks for pathological lung segmentation from CT images. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2017 - 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part III, pages 621–629, 2017.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In IEEE ICCV, pages 1026–1034, 2015.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE CVPR, pages 770–778, 2016.
  • [12] S. Hong, D. Yeo, S. Kwak, H. Lee, and B. Han. Weakly supervised semantic segmentation using web-crawled videos. In IEEE CVPR, pages 7322–7330, 2017.
  • [13] G. Huang, Z. Liu, L. van der Maaten, and K. Weinberger. Densely connected convolutional networks. In IEEE CVPR, pages 4700–4708, 2017.
  • [14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
  • [15] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711, 2016.
  • [16] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [17] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
  • [18] J. Kim, J. Kwon Lee, and K. Mu Lee. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1637–1645, 2016.
  • [19] P. Krähenbühl and V. Koltun. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. NIPS, pages 1–9, 2012.
  • [20] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems 23, pages 1189–1197. 2010.
  • [21] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. arXiv preprint arXiv:1704.03915, 2017.
  • [22] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv:1609.04802, 2016.
  • [23] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3159–3167, June 2016.
  • [24] I. Nogues, L. Lu, X. Wang, H. Roth, G. Bertasius, N. Lay, J. Shi, Y. Tsehay, and R. M. Summers. Automatic Lymph Node Cluster Segmentation Using Holistically-Nested Neural Networks and Structured Optimization in CT Images, pages 388–397. Springer International Publishing, Cham, 2016.
  • [25] D. P. Papadopoulos, J. R. R. Uijlings, F. Keller, and V. Ferrari. Extreme clicking for efficient object annotation. CoRR, abs/1708.02750, 2017.
  • [26] G. Papandreou, L. C. Chen, K. P. Murphy, and A. L. Yuille.

    Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation.

    In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1742–1750, Dec 2015.
  • [27] J. Pont-Tuset, P. Arbeláez, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1):128–140, Jan 2017.
  • [28] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241, 2015.
  • [29] H. Roth, L. Lu, A. Seff, K. Cherry, J. Hoffman, J. Liu, S. Wang, E. Turkbey, and R. Summers. A new 2.5d representation for lymph node detection using random sets of deep convolutional neural network observations. In MICCAI, pages 520–527, 2014.
  • [30] H. R. Roth, L. Lu, A. Farag, A. Sohn, and R. M. Summers. Spatial aggregation of holistically-nested networks for automated pancreas segmentation. MICCAI, 9901 LNCS:451–459, 2016.
  • [31] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23(3), pages 309–314. ACM, 2004.
  • [32] M. S. Sajjadi, B. Schölkopf, and M. Hirsch. Enhancenet: Single image super-resolution through automated texture synthesis. In arXiv:1612.07919, 2016.
  • [33] A. Seff, L. Lu, K. Cherry, H. Roth, J. Liu, S. Wang, E. Turkbey, and R. Summers.

    2d view aggregation for lymph node detection using a shallow hierarchy of linear classifiers.

    In MICCAI, pages 544–552, 2014.
  • [34] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE CVPR, pages 1874–1883, 2016.
  • [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [36] A. A. Taha and A. Hanbury. Metrics for evaluating 3d medical image segmentation: analysis, selection, and tool. BMC Medical Imaging, 15(1):29, Aug 2015.
  • [37] Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network. In IEEE CVPR, 2017.
  • [38] T. Tong, G. Li, X. Liu, and Q. Gao. Image super-resolution using dense skip connections. In IEEE CVPR, pages 4799–4807, 2017.
  • [39] Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng, J. Feng, Y. Zhao, and S. Yan. Stc: A simple to complex framework for weakly-supervised semantic segmentation. In IEEE Trans. on Pat. Anal. and Mach. Intell., pages 2314–2320, 2017.
  • [40] S. Xie and Z. Tu. Holistically-nested edge detection. In IEEE ICCV, pages 1395–1403, 2015.
  • [41] S. Xie and Z. Tu. Holistically-nested edge detection. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1395–1403, Dec 2015.
  • [42] K. Yan, X. Wang, L. Lu, and R. M. Summers. Deeplesion: Automated deep mining, categorization and detection of significant radiology image findings using large-scale clinical lesion annotations. In arXiv:1710.01766, 2017.

Supplementary Material

[1mm]

[1mm]

Figure 8: Examples of our Stacked GAN based super-resolution approach on lesion images.

The first row illustrates the zoomed images using the bi-cubic interpolation algorithm on original CT lesion images, and the second and third rows demonstrate the corresponding intermediate image de-noising outputs (after the first GAN) and final super resolution results (after the second GAN), respectively.

Stacked GAN: In [22], generative adversarial networks (GAN) have been successfully used for natural-image super resolution, which produces high-quality images with more low-level visual details and edges from their low-resolution counterparts. For lesion segmentation, if we can improve the visual clarity and contrast of image edges of lesions, the segmentation performance and accuracy may be subsequently improved. As such, our work is partially inspired by [22]. As shown in the first row of Fig. 8, CT images are often noisy and suffer from low contrast due to radiation dosage limits. Directly applying GAN-based super resolution methods on such images generates undesirable visual artifacts and edges that are harmful for lesion segmentation accuracy. To address this problem, the CT imaging noise needs to be reduced before super resolution or spatial zooming. It is challenging to train a single GAN model which directly outputs high resolution images with high visual quality (e.g., clear object-level boundaries from noisy images) from the original lesion CT images. Therefore our proposed stacked GAN operates in two stages, breaking the CT-image super resolution process into two sub-tasks: denoising followed by spatial zooming with enhancement.

Given a CT lesion image (as shown in the first row of Fig. 8), we first generate a denoised version of the input image by employing our first GAN model (consisting of a generator and a discriminator ) that focuses on removing the random image noise. has the same size as . Although the noise has been reduced in the generated image (as demonstrated in the second row of Fig. 8), lesions have blurry edges and the imaging contrast between lesion and background regions are generally low. However, clear edges and high contrast are desirable and important for achieving high precision lesion segmentation results. As well, a considerable amount of lesions are quite small in size (mm or less than 10 pixels according to their long axis diameters) and human observers typically apply zooming (via commercial clinical PACS workstations). If we intend to develop an effective CNN model to learn discriminative imaging features for lesion segmentation, image resolutions should be sufficiently higher than the physical CT imaging resolution (approximately in the range of 1mm per pixel). To solve this issue, our second GAN model, which also contains a generator and a discriminator , is built upon the output from the first GAN to produce a high resolution version (as illustrated in the third row of Fig. 8). This high-resolution image provides both clear lesion boundaries and high contrast. Since the three resulting images, i.e. the original, denoised, and high-resolution variants, as a triplet may have complementary information, we compose them together into a three-channel image that is fed into the next lesion segmentation stage.

We adapt similar architectures as [22] for the generators and discriminators. In [22], the generator has 16 identical residual blocks and 2 sub-pixel convolutional layers [34], which are used to increase the resolution. Each block contains two convolutional layers with 64 kernels followed by batch-normalization [14] and ParametricReLU [10] layers. For a trained model, the method [22] can only enlarge the input image by fixed amounts. In the DeepLesion dataset, the lesion sizes vary considerably. Different lesions have to be enlarged with correspondingly different zooming factors. Therefore the sub-pixel layers are removed in the high-resolution generator of our stacked GAN model. Because it is an easier subtask, a simpler architecture that contains just 9 identical residual blocks is designed for the denoising generator of denoising. and are fully convolutional neural networks and can take input images with arbitrary sizes. For the discriminator design, and , we use the same architecture as [22], which consists of 8 convolutional layers with kernels, LeakyReLU activations

and two densely connected layers followed by a final sigmoid layer. The stride settings and numbers of kernels for 8 convolutional layers are

and , respectively.

Figure 9: 3D RECIST Estimation: L-AXIS-OFFSET-0/1/2 present the lengths of (estimated) long axes on the RECIST/offset-1/offset-2 slices, and S-AXIS-OFFSET-0/1/2 presents the length measurements of their corresponding (estimated) short axes.

3D RECIST Estimation: As described in the main text, the estimated RECIST of off-RECIST-slices is projected from the actual RECIST diameters via Pythagorean theorem using physical Euclidean distance. In Fig. 9, we visualize the process of RECIST projection, where the length of the long/short axis on the off-RECIST-slice is calculated based on the long/short axis ratio of the actual RECIST diameters. Meanwhile, the intersection of each pair of long and short axes is fixed to the same position as the intersection of the actual RECIST.

Method Recall Precision mDICE
Deep Lesion RECIST-Slice (Testing Images)
RECIST 0.390.13 0.920.14 0.530.14
DCRF 0.720.26 0.900.15 0.770.20
GrabCut 0.620.46 0.680.44 0.620.46
GrabCut 0.940.11 0.810.16 0.860.11
GrabCut-R 0.940.10 0.890.10 0.910.08
GrabCut-R-SR 0.940.11 0.900.10 0.910.09
Table 6: Methods to Generate Training Label from RECIST: pixel-wise precision, recall, and mean DICE (mDICE) are reported with standard deviation (stdv.). Six different setup configurations are compared, 1) RECIST: dilated RECIST, 2) DCRF: dense CRF, 3) GrabCut: uses only RECIST bbox, 4) GrabCut: uses b-box and interior foreground, 5) GrabCut-R: uses b-box and dilated RECIST, 6) GrabCut-R-SR: implements GrabCut-R on the super-resolution images. See Sec. 3.1 and experiment settings for details.

Label Map Initializations: We also consider generating the initial training label maps from the images after super-resolution. As shown in Table 6, the row of GrabCut-R-SR presents recall, precision, and mDICE of the training labels produced from applying GrabCut-R on the super-resolution images. However, we observe that the differences between results produced by GrabCut-R or GrabCut-R-SR are not significant. This may be because GrabCut [31] cannot fully make use of the augmented image information compared to deep CNN models. Therefore we only use the original CT images to generate training label maps in the main submission.

Figure 10: Volume Segmentation with Offsets: The x-axis represents the voxel distance of axial slices to the RECIST-Slice. GrabCut-3DE is GrabCut segmentation performed with 3D RECIST estimation. HNN is trained on the RECIST-slice. WSSS-3/WSSS-5/WSSS-7 is the HNN self-paced with , , and axial slices, respectively.

RECIST-Slice PR-Curve    Lesion Volume PR-Curve    

Figure 11: Dataset Precision-Recall: The precision and recall curves of CNN output on 1,000 testing RECIST slices and 200 lesion volumes. HNN, WSSS-3, WSSS-5, and WSSS-7 share the same definition as in Fig. 10, and its F-measure is presented in brackets in the legend, respectively.

3D Segmentation with WSSS: More quantitative evaluations of the proposed weakly supervised self-paced segmentation (WSSS) are presented in Fig. 10 and Fig. 11. These results are based on the HNN based lesion segmentation models that are trained on the RECIST-slice, and under WSSS-3/5/7 settings, which means that the HNN model is trained in a self-paced manner with 3, 5, and 7 axial slices, respectively. Limited improvement is observed from WSSS-7 against WSSS-5 in Fig. 10, which indicates the convergence of model training. Therefore, we report the outputs of WSSS-5 as the best 3D segmentation results in the main submission.

Figure 12: GUI for Subjective Acceptance Rate Study of human readers: The GUI displays the RECIST-slice, zoomed segmentation and RECIST marks, from left to right. The input interface is designed: Key ’f’ - To accept, Key ’j’ - To reject, Key ’,’ - To go back, Key ’Esc’ - To quit, and Key ’h’ - for help.
Figure 13: GUI for Visual Comparison Study: The GUI displays the RECIST-slice and zoomed RECIST marks in left and right, respectively, on the first row. On the bottom row, the GUI displays automatic segmentation and manual annotation in a randomly shuffled order. The input interface is designed: Key ’f’ - To select left, Key ’j’ - To select right, Key ’b’ - To select both, Key ’n’ - To deny both, Key ’,’ - To go back, Key ’Esc’ - To quit, and Key ’h’ - for help.

Subjective Study: GUI snapshots of our subjective study are displayed in Fig. 12 and Fig. 13. Of note, we allow the radiologist to reject both manual and automatic segmentation in Visual Comparison Study, and this explains why the sum of WSSS-over-manual (72), manual-over-WSSS (269), and inseparable (645) is 986, which is less than 1,000. In our study, 14 pairs of both WSSS and manual segmentation results were rejected at the same time.

RECIST-Slice Segmentation with Super-resolution CT Images: Finally in Fig. 14, we compare the results of RECIST-slice segmentation with and without image super resolution augmentation via our proposed stacked GAN models. We find, in the lesion categories of abdomen and soft-tissue, the super-resolution images preserve sharper lesion boundaries than the original CT images and their corresponding segmentation results are indeed superior in delineating better lesion boundaries with higher accuracy than raw CT images.

[1mm]

[4mm]

[1mm]

Figure 14: Lesion segmentation comparison on RECIST-Slices with original images versus super-resolution images. 11 examples are presented where each example consists of 4 sub-images. In each 4 sub-image group, the top two images are the pair of original CT image and its corresponding HNN segmentation, and the bottom two images are the combined (original+denoised+enhanced) SR-image and the corresponding HNN segmentation. The manual and automatic segmentation are delineated in green and red curves, respectively. Best viewed in color with zooming.