As clinical protocols use response evaluation criteria in solid tumors (RECIST) in computed tomography (CT) images for cancer patient monitoring, many hospitals’ picture archiving and communication systems (PACS) store a vast number of lesion diameter measurements paired with corresponding CT images. These markers are determined by a radiologist who for each lesion, selects an axial slice where it is the largest. The diameters of the lesion are then determined for both the major and minor axis.
RECIST markers are imperfect as the measurements are relatively subjective and can be flawed due to variation in slices or between different observers. Such inconsistencies are problematic as it is more difficult to properly measure tumor growth over a period of time, thereby impeding treatment options for patients. With a precise measurement, follow-up and quantitative analysis of tumor extents could be performed accurately to provide valuable information for treatment planning and tracking. However, performing full volumetric lesion measurements are highly costly and time-consuming. This establishes the need for computational methods to automate such a task.
Furthermore, there are significant differences between lesions with respect to size, shape and appearance which make segmentation more difficult. To address this issue, we employ a co-segmentation architecture that can extract joint semantic information from a pair of CT scans to achieve a higher accuracy. In order to ensure that the pair of lesions are similar with respect to appearance and background, we cluster the lesions into 200 groups, from which pairs are then extracted. Thus, we leverage existing RECIST diameters as weak supervision for a convolutional neural network based weakly-supervised segmentation method that employs an attention-based co-segmentation method to obtain refined lesion masks. To the best of our knowledge, this is the first work to utilize the co-segmentation strategy for lesion segmentation on CT scans.
2 Related Works
, etc. Despite the recent success of convolutional neural network methods for different tasks, they are flawed by the requirement of needing extensive amounts of highly-annotated training data, especially for segmentation tasks. Weakly-supervised learning methods avoid this issue by utilizing weak annotations for training such as bounding boxes for the objects of interest or scribbles over the foreground. These are then used to produce initial labels that a CNN model can then use to train and improve upon. This presents the need to ensure that proper labels can be extrapolated from weak annotations. Many popular techniques exist for this task such as GrabCut and MCG. Previous work exploring weakly-supervised segmentation for lesion segmentation applied GrabCut to generate the initial masks from the RECIST-slices which acted as weak supervision.
Object co-segmentation is the task of jointly segmenting common objects from a pair of images. Proposed by Ref Vicente , this method was shown to achieve better performance compared to segmenting objects from each image independently. While many works utilized hand-crafted features such as color histograms, SIFT descriptors and HOG features, these were not enough to handle more difficult cases where there is greater variation in the images. Recently, new techniques have been developed that employ deep learning to improve accuracy in comparison to these methods. Ref Mukherjee 
first developed a deep learning strategy which used a Siamese network to obtain feature vectors for each image. They then used the ANNOY library to measure the distance between the vectors and generated a collage of masks using a fully-convolutional network to segment similar object proposals. This was then improved upon by RefLi  which developed a Siamese encoder-decoder network that applied a mutual correlation layer to capture and emphasize similarities between the image features. Recently, Ref Chen and et al.  modified this architecture by replacing the correlation layer with channel-wise and spatial-wise attention to enhance common features and suppress varying ones.
In this work, we present a weakly-supervised segmentation approach to generate improved 2D lesion segmentations. Fig. 1 shows our two-tiered approach, i.e. we first generate the initial lesion masks from RECIST measurements as ground-truth data to then train a robust attention-based co-segmentation model to produce the final refined segmentations. RECIST diameters (indicated by the purple cross in Fig. 1(a)) act as the weakly-supervised training data. The details are described below.
3.1 Initial Lesion Mask Generation
GrabCut is a popular method that takes in seeds for the image foreground and background regions and minimizes an objective energy function to generate segmentations. Following Ref. Cai and et al. , we utilize GrabCut to leverage RECIST-slice information and compute the seeds to obtain rough lesion masks from the RECIST annotations (as shown in the top part of Fig. 1(a)). We then use these masks as training data for the co-segmentation model.
3.2 Lesion Co-Segmentation
With the initial lesion masks, we train a co-segmentation model for joint lesion segmentation. The bottom part of Fig. 1(a) shows a diagram of how we train the co-segmentation model using the initial lesion masks. The co-segmentation model is adopted from Ref. Chen and et al.  and consists of a Siamese encoder-decoder framework coupled with an attention module. The model takes in a pair of images as input and produces segmentations for each lesion, which are then passed to a densely-connected conditional random field  (DCRF) to obtain the final lesion masks.
The attention mechanism consists of a channel-attention module and a spatial-attention model. Fig.1(b) shows only the channel attention mechanism and Fig. 1(c) shows the channel-spatial attention mechanism (CSA). Feature maps obtained from the encoder network are passed to the mechanism to preserve shared semantic information and suppress the rest. The channel-attention module is inspired by the Squeeze and Excitation architecture and enhances common information by retaining channels that have high activations in both images and the spatial-attention module captures inter-spatial information in the feature maps. Please refer to Ref. Chen and et al.  for details about the respective architectures of these attention modules.
for feature extraction. Taking inspiration from DeepLabV2
, our final model’s encoder (denoted DRN-101) is a pre-trained residual network that utilizes atrous convolutions for the last two residual blocks, creating a feature map with an output stride of 8. This helps in obtaining higher-resolution feature representations with greater context. The model utilizes the channel-spatial attention module to retain more spatial information. The decoder network employs up-sampling and produces a two-channel result that represents the foreground and background predictions.
Our model is trained on the NIH DeepLesion
dataset which contains 32,735 lesion images with RECIST annotations. Since there is large variance between different lesions in size and appearance, we utilize the strategy from Ref.Yan and et al. 
and obtain feature vectors for each slice. K-means clustering is then applied to group the lesions into 200 classes empirically. We then use stratified sampling based on the initial clustering to split the dataset into 80% training, 10% validation and 10% testing sets. A co-segmentation dataset is created by pairing images from each cluster. This results in 270,470 pairs for training, 28,136 pairs for validation and 3,866 pairs for testing. We use 1,000 manually annotated segmentations for evaluation and report the recall, precision, Dice similarity coefficient, averaged Hausdorff distance (AVD), and volumetric similarity (VS) for quantitative metrics, which are calculated pixel-wisely by a publicly available segmentation evaluation tool. Pre-processing:
Prior to training, we pad lesion-mask images before resizing them to 128x128 and then normalize them.Training: We adopt the original network’s hyper-parameters, utilizing the Adam optimizer with a learning rate of
and a weight decay of 0.0005 for L2 regularization. For the loss function, we use pixel-wise cross-entropy. We use a batch-size of 20 and train the model for two epochs with 12,000 iterations per epoch. For inference, we use the model that performs the best on the validation set with regards to Dice score.
|FCN-16||0.893 0.13||0.841 0.13||0.858 0.11||0.521 1.42||0.924 0.08|
|VGG-16without clustering||0.800 0.13||0.844 0.11||0.811 0.09||0.683 1.08||0.908 0.08|
|VGG-16||0.870 0.13||0.898 0.09||0.877 0.10||0.403 1.19||0.935 0.08|
|VGG-16 + CSA||0.880 0.12||0.895 0.10||0.880 0.10||0.393 1.22||0.938 0.08|
|ResNet-101||0.857 0.14||0.920 0.09||0.879 0.10||0.420 1.49||0.928 0.09|
|ResNet-101 + CSA||0.862 0.13||0.921 0.09||0.883 0.10||0.385 1.24||0.931 0.08|
|DRN-101||0.851 0.16||0.932 0.08||0.877 0.13||0.490 1.39||0.920 0.12|
|DRN-101 + CSA||0.915 0.11||0.895 0.10||0.898 0.10||0.349 1.20||0.942 0.08|
The performance of lesion segmentation with different training strategies in terms of recall, precision, Dice score, AVD, and VS. The means and standard deviations are reported for all strategies.: the larger the better. : the smaller the better.
4.1 The Performance of Lesion Co-Segmentation
To validate the efficacy of using the co-segmentation model, we first train a fully-convolutional network (FCN)
with a VGG-16 backbone on the initial lesion masks. For the co-segmentation models, we experiment with the VGG-16, ResNet-101 architectures as well as a dilated ResNet-101 model. The default attention mechanism is channel attention, and we apply the same DCRF model for all experiments to obtain the final segmentations. All models are implemented in PyTorch.
Quantitative results of these experiments are shown in Table. 1. Given a 6.5% increase in Dice score, we can see that clustering lesions prior to training greatly improves performance, supporting our methodology of leveraging lesion embeddings to generate lesion classes for training. The results also validate the point that co-segmentation produces better segmentations as the Dice score improves by 1.9% between the FCN model and the VGG-16 co-segmentation model with only channel-wise attention. When we adopt a deeper network ResNet-101, the model gains minor improvements, achieving an 87.9% Dice score. It is worth noting the increase in recall when using channel-spatial attention compared to channel attention. This is especially significant when examining the difference between the DRN-101 and DRN-101 + CSA masks. Thus, it can be seen that adding spatial attention considerably improves performance by reducing the number of pixels incorrectly labeled as part of the background. Additionally, the results show that adding the atrous convolutions improves the precision of the masks with a 1.2% improvement between the ResNet-101 backbone and the DRN-101 backbone models. Our best model uses the convolutional layers of a dilated ResNet-101 with CSA (DRN-101 + CSA) for the encoder and gains a 4.0% increase in Dice score compared to the FCN model.
Qualitative results are shown in Fig. 2
. Firstly, it can be seen that the co-segmentation model trained on the dataset without clustering suffers greatly in performance as it produces masks that are unable to capture the fine boundaries of the lesion. Generally, the model is prone to falsely classifying pixels as part of the background. This is most likely due to the variations in texture and shade between different lesions, which confuse the model. Following clustering using the lesion embeddings, it can be seen that the masks produced by the VGG-16 backbone model have less false positives compared to those produced by the FCN model. This would indicate that using semantic information from both images aids in generating more precise boundaries. Further improvements in recall can be seen between the masks created by the DRN-101 model with channel-spatial attention as less pixels are falsely labeled as the background.
In this work, a weakly-supervised attention-based deep co-segmentation approach is proposed to acquire more accurate lesion segmentations from RECIST measurements. Given a set of RECIST diameters, we first utilize GrabCut to obtain initial lesion masks. We then leverage lesion embeddings to create lesion clusters, which serve as different classes for the co-segmentation model. We experiment with various architectures and attention mechanisms to demonstrate the efficacy of our approach. The quantitative and qualitative experimental results demonstrate that co-segmentation with clustering enhances lesion segmentation performance. Future work can then employ this model architecture in a slice-propagated fashion to obtain a fully volumetric lesion segmentations.
Acknowledgements.This research was supported by the Intramural Research Program of the National Institutes of Health Clinical Center and by the Ping An Insurance Company through a Cooperative Research and Development Agreement. We thank Nvidia for GPU card donation.
-  (2020) Weakly supervised lesion co-segmentation on ct scans. In ISBI, Vol. , pp. 1–4. Cited by: §2.
-  (2018) Accurate weakly-supervised deep lesion segmentation using large-scale clinical annotations: slice-propagated 3D mask generation from 2D RECIST. In MICCAI, pp. 396–404. Cited by: §2, §3.1.
-  (2018) Semantic aware attention based deep object co-segmentation. In ACCV, pp. 435–450. Cited by: §2, §3.2, §3.2.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI 40 (4), pp. 834–848. Cited by: §3.2.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.2.
-  (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §3.2.
-  (2018) CT-realistic lung nodule simulation from 3D conditional generative adversarial networks for robust lung segmentation. In MICCAI, pp. 732–740. Cited by: §2.
-  (2017) Simple does it: weakly supervised instance and semantic segmentation. In CVPR, pp. 876–885. Cited by: §2.
-  (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In NeurIPS, pp. 109–117. Cited by: §3.2.
-  (2018) Deep object co-segmentation. In ACCV, pp. 638–653. Cited by: §2.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §4.1.
-  (2018) Object cosegmentation using deep siamese network. arXiv preprint arXiv:1803.02555. Cited by: §2.
-  (2017) Automatic differentiation in pytorch. In NeurIPS Autodiff Workshop, Cited by: §4.1.
-  (2016) Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE TPAMI 39 (1), pp. 128–140. Cited by: §2.
-  (2004) Grabcut: interactive foreground extraction using iterated graph cuts. In ACM TOG, Vol. 23, pp. 309–314. Cited by: §2, §3.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
-  (2015) Metrics for evaluating 3d medical image segmentation: analysis, selection, and tool. BMC MI 15 (1), pp. 29. Cited by: §4.
-  (2019) Abnormal chest x-ray identification with generative adversarial one-class classifier. In ISBI, Vol. , pp. 1358–1361. Cited by: §2.
-  (2019) ULDor: a universal lesion detector for CT scans with pseudo masks and hard negative example mining. In ISBI, Vol. , pp. 833–836. Cited by: §2.
-  (2019) CT-realistic data augmentation using generative adversarial network for robust lymph node segmentation. In SPIE MI: CAD, Vol. 10950, pp. 109503V. Cited by: §2.
-  (2019) XLSor: a robust and accurate lung segmentor on chest x-rays using criss-cross attention and customized radiorealistic abnormalities generation. In MIDL, pp. 457–467. Cited by: §2.
CT image enhancement using stacked generative adversarial networks and transfer learning for lesion segmentation improvement. In MLMI, pp. 46–54. Cited by: §2.
-  (2018) Semi-automatic RECIST labeling on CT scans with cascaded convolutional neural networks. In MICCAI, pp. 405–413. Cited by: §2.
-  (2019) Deep adversarial one-class learning for normal and abnormal chest radiograph classification. In SPIE MI: CAD, Vol. 10950, pp. 1095018. Cited by: §2.
-  (2019) TUNA-net: task-oriented unsupervised adversarial network for disease recognition in cross-domain chest x-rays. In MICCAI, pp. 431–440. Cited by: §2.
-  (2011) Object cosegmentation. In CVPR, pp. 2217–2224. Cited by: §2.
-  (2018) Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In CVPR, pp. 9261–9270. Cited by: §4.
-  (2018) DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. JMI 5 (3), pp. 036501. Cited by: §4.
-  (2019) Mulan: multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation. In MICCAI, pp. 194–202. Cited by: §2.