Deep convolutional neural networks (CNN) have demonstrated superior performance for medical image analysis tasks, for example, image segmentation, object detection, and classification.
The performance of such methods is however constrained by the quantity and the diversity of the training data available.
Given the imbalanced distribution in the real-world data, the rare and abnormal cases are generally underrepresented in the training data.
It is thus important to improve the data efficiency of the medical machine learning systems by either improving the supervised learning approaches or synthesizing images based on the existing annotated data.
This paper explores the latter path to approach this problem, with specific application to classify benign versus malignant lung nodules in CT images.
Basic data augmentation techniques, such as random cropping, shifting, scaling, flipping and rotations, can be used to introduce a certain level of diversity during training stage, but cannot account for the diversity of nodule morphology and locations.
Some recent studies proposed to use generative adversarial network (GAN) networks to synthesize lesions in medical image patches to augment the training data
Deep convolutional neural networks (CNN) have demonstrated superior performance for medical image analysis tasks, for example, image segmentation, object detection, and classification. The performance of such methods is however constrained by the quantity and the diversity of the training data available. Given the imbalanced distribution in the real-world data, the rare and abnormal cases are generally underrepresented in the training data. It is thus important to improve the data efficiency of the medical machine learning systems by either improving the supervised learning approaches or synthesizing images based on the existing annotated data. This paper explores the latter path to approach this problem, with specific application to classify benign versus malignant lung nodules in CT images. Basic data augmentation techniques, such as random cropping, shifting, scaling, flipping and rotations, can be used to introduce a certain level of diversity during training stage, but cannot account for the diversity of nodule morphology and locations. Some recent studies proposed to use generative adversarial network (GAN) networks to synthesize lesions in medical image patches to augment the training data[1, 2]. Such methods train the GAN networks to in-paint a cropped area with the objects of interests, such as lesions. The generator network is trained with a reconstruction loss between the synthetic patch and the real patch as well as an adversarial loss produced by a discriminator network. They concluded that using synthetic patches could improve the performance of supervised learning tasks. However, such networks were designed to generate objects conditioning based on only the surrounding context and random noises, lacking the capability of generating objects with manipulable properties which we believe to be important for many machine learning applications in medical imaging, such as balancing the datasets for classification tasks. In a more recent study, authors propose to use synthetic shape models to condition the nodule synthesis .
In this paper, we propose an adversarial learning framework to synthesize lung nodules in CT images by conditioning on the target categories and sizes. Hence the nodules can be in-painted at random locations with manipulable attributes. We demonstrate an example application of the class-aware synthesized nodules, which is to augment training images in an unbalanced dataset for improving the assessment of the nodule malignancy risk. A data-driven model that can accurately predict nodule malignancy risk from CT images may improve management decisions and prevent unnecessary imaging or invasive follow-up procedures on benign nodules, thus increase the effectiveness of lung cancer screening programs . By evaluating on the public Lung Image Database Consortium (LIDC-IDRI) dataset [5, 6, 7], we show that our synthesized nodule patches are beneficial for improving the performance on estimating the nodule malignancy risk. The proposed framework has the potential to be generalized to synthesizing other objects of interests in medical images.
The proposed framework to generate synthetic nodules is formulated as an in-painting problem which fills a masked area in a 3D lung CT image with a pulmonary nodule with the specified category and size. The framework contains two major components: 1) two generators to perform coarse-to-fine in-painting by incorporating contextual information; 2) local and global discriminators to enforce the local quality and the global consistency of the generated patches, and auxiliary domain classifiers to constrain the generated nodules with the manipulable properties, such as the malignancy. The overall framework is depicted in Fig. 1.
2.1 Coarse-to-Fine Reconstruction
The 3D input nodule patches are extracted from the lung CT images centering on annotated nodules. The nodule in each patch is replaced with a 3D spherical noise mask generated according to the annotated nodule diameter. The masked patch and a class-label map are firstly fed into a 3D hour-glass CNN , as shown in Fig. 1. is designed to be easier optimized and can reconstruct a coarsely synthesized nodule in the masked region to facilitate subsequent learning. The output of is fed into another network which has a similar architecture as , to refine the details in the output map. and together form a stacked image in-painting generator . Both and are optimized with the reconstruction loss between the reconstructed patch and the real nodule patch
where and are the normalized loss across the masked area and the entire patch, respectively. By optimizing , the stacked generator is trained to reconstruct the nodules in the original patch based on the lung tissue context, random noise mask and the class label. Aside from , is also optimized by an adversarial loss provided by two discriminator networks described in Sec. 2.2.
Features from surrounding regions can be helpful for in-painting the boundary of the nodules in the masked area. In a recent study  , a contextual attention model is proposed to borrow the textures from the background patches to generate missing patches.
We use the contextual attention model to match between the foreground and background textures by measuring the normalized cosine similarity of their features.
An attention map on the background voxels is obtained for reconstructing foreground area with the attention filtered background features.
The contextual attention layer is differentiable and fully-convolutional.
We refer the complete definition the contextual attention to the original paper for brevity.
, a contextual attention model is proposed to borrow the textures from the background patches to generate missing patches. We use the contextual attention model to match between the foreground and background textures by measuring the normalized cosine similarity of their features. An attention map on the background voxels is obtained for reconstructing foreground area with the attention filtered background features. The contextual attention layer is differentiable and fully-convolutional. We refer the complete definition the contextual attention to the original paper for brevity.
2.2 Class-Aware Synthesis
Two discriminator networks and are used to optimize and in an adversarial approach together with the reconstruction losses. is applied to the masked area only to improve the nodule appearance. is applied to the entire patch for the global consistency of the in-painting. We denote both discriminators as here for brevity. We use the conditional Wasserstein GAN objective and enforcing the gradient penalty  to train the local and global discriminators and the stacked generator as
where is sampled from a real patch, is the generator in-painting output, is sampled uniformly between real patches and generated patches. The class label is replicated to the same size as the input patch and is concatenated with as another input channel. Besides the WGAN discriminators , we add an auxiliary domain classifier on the top of each discriminator network to ensure that generates nodules in the targeted class . In this training objective, each attempts to classify the output patch into the domain class (0 = fake, 1 = benign, and 2 = malignant). The label 0 is used to prevent the generator from duplicating nodules that are easy to classify but less diversified. is optimized with the class-aware loss as
where represents a probability distribution over domain classes
represents a probability distribution over domain classes. Though both and are optimized to discriminate fake and real patches, empirically we find it hard for the learning system to converge without . In practice, is added after the generator is well-trained to in-paint real-look nodules. In this adversarial learning problem, tries to in-paint the patch that can be classified as the target domain as well as to fool to misjudge them in the distribution of the real patches. The objective for the whole class-aware nodule synthesis learning can then be summarized as
2.3 Application in 3D Nodule Malignancy Classification using Video Pre-trained Networks
In the context of the action recognition tasks in videos, the spatiotemporal 3D CNNs are recently demonstrated as more effective than the 2D CNNs when large-scale frame-wise annotated datasets such as the Kinetics is available . To classify the CT nodule patches into benign and malignant classes, we use 3D CNNs pre-trained on natural video classifications . The temporal dimension of video data resembles the depth dimension of 3D medical image volumes. It is also easier to scale up the collection of video datasets than the medical imaging datasets. Thus, using video pre-trained networks could be helpful for stabilizing the network training and preventing over-fitting.
We evaluated the proposed methods on the public The Lung Image Database Consortium (LIDC-IDRI) dataset [5, 6, 7] consisting of diagnostic and lung cancer screening thoracic computed tomography (CT) scans with annotated lesions. The LIDC-IDRI dataset consists of 1,010 patients and 1,308 chest CT imaging studies in total. The nodules in LIDC-IDRI were annotated by four radiologists. The likelihood of malignancy of each nodule is assessed, and a score ranging from 1 (highly unlikely) to 5 (highly suspicious) is given by each radiologist. We define the nodules with the majority score to be malignant and the rest to be benign. In our experiments, we extracted the nodule patches from the LIDC dataset with the resolution mm ( voxels). The patches were randomly split into the training set, validation set and testing set according to the patients as shown in Table 1.
|Subset||Benign ()||Malignant ()||Total|
|Raw + Weighted Loss|
|Raw + Synthesis|
We trained the proposed nodule synthesis framework on the training patches only. In Fig. 2, we demonstrate the nodule synthesis results by conditioning on different labels. Given the same background patch, the framework is capable of generating nodules with different specified malignancy labels. We also show in Fig. 3 that the network could generate nodules with different morphology and textures using different noise masks. The trained generator is used for synthesizing 463 patches containing malignant patches from malignant patches randomly sampled from the real training malignant patches since malignant nodules are relatively rare in the original LIDC-IDRI dataset. The synthetic nodule patches are combined with the original training patches to train the 3D classification CNNs. To evaluate the effectiveness of using the synthetic patches on estimating the lung nodules malignancy, we trained four 3D CNN architectures with different capacities: ResNet-50, ResNet-101, ResNet-152 , and ResNext-101 . All the networks were initialized with the weights pre-trained on the Kinetic video dataset [10, 13]. The cross-entropy loss was used for training the CNN classifiers. We also evaluated the differences between the unweighted (Raw) and weighted cross entropy loss (Raw + Weighted Loss) with the weights accounting for training sample class distribution. Traditional data augmentation methods including random cropping and scaling were used for training all the networks. The testing accuracy (ACC), sensitivity (SEN), specificity (SPE), and the area under the ROC curve (AUC) are presented in Table. 2. The models were selected based on the highest AUCs on the validation set. With the synthetic patches (Raw + Synthesis), the 3D ResNet152 achieved the highest accuracy, specificity, and AUC score across all the experiments. Though the weighted loss is typically used in the imbalanced dataset, it did not show better AUC than neither the unweighted loss nor the augmented dataset. The increase of the mean of the AUC scores also indicates that using the synthetic nodule patches are helpful for improving the nodule malignancy classification performance.
In this paper, we propose an adversarial in-painting based framework for synthesizing lung nodules with class-aware manipulations. We demonstrate one example application of the generated lung nodule patches on the classification of nodule malignancy. The qualitative results show that the proposed framework is capable of generating lung nodules with the specified malignancy labels. By evaluating on the nodule patches obtained from the LIDC-IDRI dataset, we show that the generated nodules can be helpful for improving the classification performance on an imbalanced dataset.
Disclaimer: This feature is based on research, and is not commercially available. Due to regulatory reasons, its future availability cannot be guaranteed.
-  Eric Wu, Kevin Wu, David Cox, and William Lotter, “Conditional infilling gans for data augmentation in mammogram classification,” in Image Analysis for Moving Organ, Breast, and Thoracic Images, pp. 98–106. Springer, 2018.
-  Dakai Jin, Ziyue Xu, Youbao Tang, Adam P Harrison, and Daniel J Mollura, “CT-realistic lung nodule simulation from 3D conditional generative adversarial networks for robust lung segmentation,” arXiv preprint arXiv:1806.04051, 2018.
-  Siqi Liu, Eli Gibson, Sasa Grbic, Zhoubing Xu, Arnaud Arindra Adiyoso Setio, Jie Yang, Bogdan Georgescu, and Dorin Comaniciu, “Decompose to manipulate: Manipulable Object Synthesis in 3D Medical Images with Structured Image Decomposition,” arXiv e-prints, p. arXiv:1812.01737, Dec. 2018.
-  Heber MacMahon, John H. M. Austin, Gordon Gamsu, Christian J. Herold, James R. Jett, David P. Naidich, Edward F. Patz, and Stephen J. Swensen, “Guidelines for management of small pulmonary nodules detected on CT scans: A statement from the Fleischner Society,” Radiology, 2005.
-  Samuel G. Armato III and et al., “Data from LIDC-IDRI,” 2015.
-  S. G Armato et al., “The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans,” vol. 38, pp. 915–931, 2011.
-  Kenneth Clark, Bruce Vendt, Kirk Smith, John Freymann, Justin Kirby, Paul Koppel, Stephen Moore, Stanley Phillips, David Maffitt, Michael Pringle, Lawrence Tarbox, and Fred Prior, “The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository,” Journal of Digital Imaging, vol. 26, no. 6, pp. 1045–1057, 2013.
Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang,
“Generative image inpainting with contextual attention,”arXiv preprint, 2018.
-  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville, “Improved training of wasserstein GANs,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh,
“Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?,”in , 2018, pp. 18–22.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He, “Aggregated residual transformations for deep neural networks,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017.
-  Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman, “The Kinetics Human Action Video Dataset,” 5 2017.