Quantitative assessment of changes in lesion/tumor growth remains a challenging problem within medical imaging and precision medicine. In the clinical setting, radiologists must manually assert tumor size using response evaluation criteria in solid tumors (RECIST). This requires them to find and annotate areas which roughly have the greatest cross area. Such a process is subjective and results in a lack of consistency that is needed to make proper diagnoses. Thus, the ability to directly compute lesion segmentations would be crucial in planning treatments as well as track and record lesion growth rates. Since manual segmentation is laborious in time and cost, RECIST is used as the default clinical standard throughout hospitals. This has resulted in a substantial number of RECIST measurements being stored in hospitals’ picture archiving and communications systems (PACS) that correspond to patient CT scans. Inspired by the successes of deep learning in recent years, many medical image analysis applications[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] have utilized deep learning to improve their performance. Some of them have been developed for lesion detection [6, 7], segmentation [3, 4]
and RECIST estimation using RECIST as supervision. In , we are the first to leverage existing RECIST diameters as weak supervision for a segmentation method that employs an attention-based co-segmentation design to obtain final lesion masks. In this paper, we build upon  by modifying our architecture to obtain and preserve higher resolution feature maps and experimenting with different training methods and attention mechanisms to improve performance.
In , GrabCut  is used to create initial lesion segmentations on RECIST slices for training a holistically nested network to generate the final masks. However, it has been shown that co-segmentation [12, 14]
, which is the task of jointly segmenting common objects in a pair of images, can leverage similarities in appearance, background and semantic information to produce more accurate segmentations. Inspired by this intuition, this work leverages RECIST diameters as weak supervision for a convolutional neural network based co-segmentation method that uses attention to obtain refined lesion masks. Due to significant variance between different lesions in size and appearance, this work proposes utilizing lesion embeddings to cluster the lesions into 200 classes prior to training the model.
2 Our Lesion Co-Segmentation Method
In this work, we present a weakly-supervised co-segmentation approach to generate 2D lesion segmentations. Our two-tiered approach is shown in Fig. 1. First, initial lesion masks are generated using RECIST markers. We then train a robust attention-based co-segmentation model on these masks and use it to produce the final segmentations. RECIST diameters (indicated by the purple cross in Fig. 1) act as the weakly-supervised training data. The details are described below.
2.1 Pseudo-mask Generation
The NIH DeepLesion dataset 
only provides the RECIST annotation for each lesion region due to the high cost of manual generation of full lesion segmentation masks. Thus, it is necessary to create pseudo-masks from the RECIST annotations using unsupervised learning methods that can then be used to train a supervised model. A popular method for this task is GrabCut, which is first initialized with seeds for the image foreground and background regions and then uses iterative energy minimization to produce initial masks. Following, we leverage RECIST-slice information and compute the seeds using GrabCut from the RECIST markers to obtain initial lesion masks(as shown in the top part of Fig. 1). We then use these masks as training data for the co-segmentation model.
2.2 Lesion Co-segmentation
After the initial lesion masks are generated, we train the co-segmentation model for lesion segmentation. The bottom part of Fig. 1 shows a diagram of how the model is trained using the initial masks. We adopt the co-segmentation model from . The model consists of a Siamese encoder and decoder and utilizes an attention module in the bottleneck layers. Using a pair of images as input, the model produces a mask for each lesion that is then fed to a densely-connected conditional random field  (DCRF) to acquire the final masks.
A channel-attention module and a spatial-attention module comprise our attention mechanism. The right side of Fig. 1 shows the channel-spatial attention mechanism (CSA). We take the feature maps from the encoder network and apply the attention modules to emphasize common semantic information and suppress the rest. By preserving channels that contribute the most to both feature maps, the channel-attention module enhances shared information. The spatial-attention module captures spatial details in the feature maps.
, creating feature maps with an output stride of 8 (DRN-101). Thus, we can attain higher-resolution feature maps that contain greater context. We also examine using a multi-grid strategy for the encoder (DRN-MG-101) in order to extract feature maps at different scales. For our decoder, we experiment with two models. The first is adopted from with slight modifications (i.e. less convolutional layers, D1), while the second is adopted from  and takes in both the attended and lower level feature maps to create a more robust segmentation mask (D2). Please refer to  for additional details. When using the improved decoder (D2), we refrain from using DCRF as it performs worse.
|DRN-101 FCN-32||0.917 0.13||0.882 0.11||0.893 0.10||0.396 1.39||0.938 0.07|
|DRN-101 CSA + D1||✓||0.915 0.11||0.895 0.10||0.898 0.10||0.349 1.20||0.942 0.08|
|DRN-101 CSA + D1||✓||0.918 0.11||0.890 0.10||0.898 0.10||0.357 1.18||0.942 0.07|
|DRN-101 CSA + D1||✓||✓||0.899 0.12||0.910 0.10||0.898 0.10||0.355 1.22||0.943 0.08|
|DRN-101 CSA + D1||✓||✓||0.907 0.11||0.906 0.10||0.901 0.09||0.342 1.21||0.946 0.07|
The performance of lesion segmentation using a variety of different strategies. For the co-segmentation models, the original decoder network is used.The means and standard deviations are reported for all strategies.: the larger the better. : the smaller the better.
2.3 Attention Modules
to obtain the channel attention vectors (CA = SE) and calculate the mean value at each spatial location over all feature maps to obtain the spatial attention maps (SA = MSA). In order to improve upon this, we experiment with three methods for obtaining channel and spatial attention maps. Introduced in, atrous spatial pyramid pooling can be used to provide robust multi-scale information. Due to its ability to capture long-range contextual information through a series of varied atrous convolutions, we investigate its performance as a spatial attention mechanism (SA = ASPP). Our second attention mechanism focuses on improvements to channel attention. Following , we modify the SENet architecture, and replace the fully-connected layer with a 1D convolutional layer with an adaptive kernel size to better capture cross-channel interactions (CA = ECA). The third attention mechanism is adopted from  and consists of a position attention module and channel attention module (CSA = DANet). Both modules utilize a similarity matrix to obtain a weighted sum of features across all positions and channels respectively along with the original features. Please refer to  and  respectively for additional details.
2.4 Implementation Details
All models are implemented in PyTorch
. Training is conducted with a batch-size of 20 images for a total of two epochs of 12,000 iterations each. We use the Adam optimizer with a learning rate ofand a weight decay of 0.0005 for L2 regularization and the SGD optimizer with a weight decay of 0.0005 and momentum of 0.99. The initial learning rate is 0.01 and we decay the learning rate following a poly-learning rate policy where the initial learning rate is multiplied by
after each iteration. Prior to training, we pad lesion-mask images before resizing them to 128x128 and then normalize them.
|DRN-MG- 101 CSA||✓||0.921 0.11||0.890 0.10||0.899 0.09||0.350 1.18||0.944 0.07|
|DRN-MG- 101 CSA||✓||✓||0.907 0.11||0.906 0.10||0.901 0.09||0.342 1.21||0.946 0.07|
|DRN-MG- 101 CSA||✓||0.920 0.11||0.898 0.10||0.903 0.09||0.321 1.15||0.943 0.07|
|DRN-MG- 101 CSA||✓||✓||0.905 0.11||0.908 0.09||0.901 0.09||0.326 1.16||0.945 0.07|
|CA = SE||CA = ECA||SA = MSA||SA = ASPP||CSA = DANet||Rec.||Prec.||Dice||AVD||VS|
DRN-MG- 101 + D2
|✓||✓||0.920 0.11||0.898 0.10||0.903 0.09||0.321 1.15||0.943 0.07|
|DRN-MG- 101 + D2||✓||✓||0.924 0.11||0.892 0.10||0.902 0.09||0.335 1.16||0.941 0.07|
|DRN-MG- 101 + D2||✓||✓||0.926 0.11||0.891 0.10||0.902 0.09||0.331 1.16||0.941 0.07|
|DRN-MG- 101 + D2||✓||✓||0.921 0.11||0.895 0.10||0.902 0.09||0.338 1.17||0.943 0.07|
|DRN-MG- 101 + D2||✓||0.921 0.10||0.894 0.10||0.902 0.08||0.333 1.15||0.944 0.07|
3.1 Dataset and Evaluation Criterion
The NIH DeepLesion dataset  is used for performance evaluation and is composed of PACS CT lesion images annotated with RECIST long and short diameters. These are derived from studies of patients. Following the clustering of the lesions into 200 classes using , we split the dataset into 80% training, 10% validation and 10% testing sets using stratified sampling. We pair the images from each cluster to build the co-segmentation dataset, resulting in 270,470 pairs for training, 28,136 pairs for validation and 3,866 pairs for testing. We hold out 1,000 manually annotated segmentations for evaluation and report the recall, precision, Dice similarity coefficient, averaged Hausdorff distance (AVD), and volumetric similarity (VS) for quantitative metrics, which are calculated pixel-wisely by a publicly available segmentation evaluation tool .
3.2 Results and Analyses
We first train a fully-convolutional network (FCN) 
with a dilated ResNet-101 backbone on the initial lesion masks to determine the efficacy of our co-segmentation approach. We then examine methods to improve feature extraction by using a multi-grid strategy and analyze various mechanisms to enhance our attention models’ ability to capture inter-channel and inter-spatial information. The default attention mechanism is when CA = SE and SA = MSA and is denoted as CSA. In order to create robust segmentations from the attended feature maps, we also experiment with two previously mentioned decoder structures. Additionally, we explore how the SGD optimizer compares to the Adam optimizer for this task.
Quantitative Comparisons: First, it can be seen from Table 1 that using co-segmentation with CSA improves upon the baseline DRN-101-backbone FCN with a 0.5% increase in Dice score, thus validating our approach. Furthermore, we find that training using the SGD poly-learning policy enhances performance compared to using the Adam optimizer. The SGD optimized multi-grid model is seen to perform much better than its Adam-trained counterpart, demonstrating that although both optimizers obtain low training errors, adaptive methods may generalize worse on new data. With regards to encoder strategies, we find that a multi-grid strategy (DRN-MG-101) with dilation rates of 2, 4, and 8 improves performance for both training strategies, especially with the poly training policy where a 0.3% increase in Dice score is noted for the SGD trained model. This indicates that extracting features at different scales is key to creating more robust segmentation masks. Additionally, the results in Table 2 validate that using the improved decoder with lower level features leads to more refined segmentations. For both decoders, the DCRF post-processing makes significant gains in precision in exchange for a sharp reduction in recall.
We experiment with three different attention mechanisms to obtain more precise attended features for improved segmentations. The results are shown in Table 3. First, it can be observed that the model using CSA has the highest performance with a Dice score of 90.3%. Comparatively, minor decreases in performance are found with the introduced attention mechanisms. This could be due to better generalization capability as there are less parameters to fit with this attention model and thus there is less overfitting on the training set. The model also has a lower averaged Hausdorff distance of 0.321, revealing the strength of the attention mechanism in obtaining more precise boundary information. We find that using the efficient channel attention achieves the next-best performance, specifically with the ASPP spatial attention. Examining the results of the DANet channel-spatial attention, we do not find any significant increases in performance compared to using pooling-based methods for extracting feature attentions.
: Fig. 2 shows the qualitative results using the base DRN-101 FCN-32 along with three co-segmentation models using the different strategies. From this, it can be seen that the initial segmentations from the fully-convolutional network have significantly more pixels classified as false positives than those generated by the other models. This demonstrates the strength of the co-segmentation models in obtaining more precise boundaries by utilizing joint semantic information. The precision is further improved by using the multi-grid approach and can be attributed to extracting feature maps at different rates to produce more fine segmentation details. Using the improved decoder obtains predictions closest to the ground-truth, demonstrating that using lower level features for each image can better help the model to utilize independent semantic information and suppress possible interference in the attended feature maps.
In this work, a weakly-supervised attention-based deep co-segmentation approach is proposed to obtain accurate lesion segmentations from RECIST measurements. GrabCut is first used to get initial lesion masks using RECIST diameters. Then, we leverage lesion embeddings to cluster lesions for co-segmentation. Our experimental results, both quantitative and qualitative, find that co-segmentation achieves promising results for weakly-supervised lesion segmentation along with an efficient training strategy. This model architecture has the potential to be employed in a slice-propagated manner to generate fully volumetric lesion segmentations.
Acknowledgments. This research was supported by the Intramural Research Program of the National Institutes of Health Clinical Center and by the Ping An Insurance Company through a Cooperative Research and Development Agreement. We thank Nvidia for GPU card donation.
-  D. Jin, et al., “CT-realistic lung nodule simulation from 3D conditional generative adversarial networks for robust lung segmentation,” in MICCAI, 2018, pp. 732–740.
-  Y. Tang, et al., “Semi-automatic RECIST labeling on CT scans with cascaded convolutional neural networks,” in MICCAI, 2018, pp. 405–413.
-  J. Cai, et al., “Accurate weakly-supervised deep lesion segmentation using large-scale clinical annotations: Slice-propagated 3d mask generation from 2d recist,” in MICCAI, 2018, pp. 396–404.
Y. Tang, et al.,
“CT image enhancement using stacked generative adversarial networks and transfer learning for lesion segmentation improvement,”in MLMI, 2018, pp. 46–54.
-  Y. Tang, et al., “Xlsor: A robust and accurate lung segmentor on chest x-rays using criss-cross attention and customized radiorealistic abnormalities generation,” in MIDL, 2019, pp. 457–467.
-  K. Yan, et al., “Mulan: Multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation,” in MICCAI, 2019, pp. 194–202.
-  Y. Tang, et al., “ULDor: A universal lesion detector for CT scans with pseudo masks and hard negative example mining,” in ISBI, 2019, pp. 833–836.
-  Y. Tang, et al., “CT-realistic data augmentation using generative adversarial network for robust lymph node segmentation,” in SPIE MI: CAD, 2019, vol. 10950, p. 109503V.
-  Y. Tang, et al., “Tuna-net: Task-oriented unsupervised adversarial network for disease recognition in cross-domain chest x-rays,” in MICCAI, 2019, pp. 431–440.
-  Y. Tang, et al., “Abnormal chest x-ray identification with generative adversarial one-class classifier,” in ISBI, 2019, pp. 1358–1361.
-  Y. Tang, et al., “Deep adversarial one-class learning for normal and abnormal chest radiograph classification,” in SPIE MI: CAD, 2019, vol. 10950, p. 1095018.
-  V. Agarwal, et al., “Weakly-supervised lesion segmentation on ct scans using co-segmentation,” in SPIE MI: CAD, 2020, pp. 1–6.
-  C. Rother, et al., “Grabcut: Interactive foreground extraction using iterated graph cuts,” in ACM TOG, 2004, vol. 23, pp. 309–314.
-  C. Rother, et al., “Cosegmentation of image pairs by histogram matching-incorporating a global constraint into mrfs,” in CVPR, 2006, pp. 993–1000.
-  K. Yan, et al., “Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning,” JMI, vol. 5, no. 3, pp. 036501, 2018.
-  H. Chen, et al., “Semantic aware attention based deep object co-segmentation,” in ACCV, 2018, pp. 435–450.
-  P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in NeurIPS, 2011, pp. 109–117.
-  K. He, et al., “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
-  L. Chen, et al., “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE TPAMI, vol. 40, no. 4, pp. 834–848, 2017.
-  L. Chen, et al., “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018, pp. 801–818.
-  J. Hu, et al., “Squeeze-and-excitation networks,” in CVPR, 2018, pp. 7132–7141.
-  Q. Wang, et al., “Eca-net: Efficient channel attention for deep convolutional neural networks,” arXiv preprint arXiv:1910.03151, 2019.
-  J. Fu, et al., “Dual attention network for scene segmentation,” in CVPR, 2019, pp. 3146–3154.
-  A. Paszke, et al., “Automatic differentiation in pytorch,” in NeurIPS Autodiff Workshop, 2017.
-  K. Yan, et al., “Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database,” in CVPR, 2018, pp. 9261–9270.
-  A. Taha and A. Hanbury, “Metrics for evaluating 3d medical image segmentation: analysis, selection, and tool,” BMC MI, vol. 15, no. 1, pp. 29, 2015.
-  J. Long, et al., “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.