Recently, many researchers have found synthetic images generated by deep generative models, especially generative adversarial networks (GANs), can greatly improve performance of many tasks in both natural images and medical images[perez2017effectiveness, goodfellow2020generative, han2018gan]. At the same time, the high-quality synthetic images generated by deep generative models can assist with doctors’ diagnosis. For example, doctors can localize the lesions by comparing abnormal images with synthetic pseudo-normal images. Thus, generating medical images with great quantity and quality has become a necessary and urgent research topic.
The idea of pseudo-normality synthesis is to generate pseudo-normal images (i.e. without lesion) from real abnormal images (i.e. with lesion). The generated pseudo-normal images are important in two aspects. (1) The synthetic pseudo-normal images can assist doctors with diagnosis by comparing the abnormal and pseudo-normal images [tsunoda2014pseudo]. (2) The synthetic pseudo-normal images can be utilized as a data augmentation technique to improve performance of various downstream tasks, (e.g. lesion segmentation, lesion detection, etc.) [alex2017generative, andermatt2018pathology, bowles2017brain, chen2018unsupervised, milletari2016v, sun2020adversarial, tsunoda2014pseudo, ye2013modality]. However, in most of the cases, the paired abnormal and normal images are unavailable for training such a generative network to synthesize pseudo-normal images from abnormal images. Thus, previous works [baumgartner2018visual, sun2020adversarial, xia2020pseudo, yunlong2020generator] leverage unpaired normal and abnormal images to train generative networks (variants of GAN-based models) to generative pseudo-normal images from abnormal images. Nevertheless, they have several limitations: (1) low-quality of the generated pseudo-normal images in the absence of the segmentation labels, (2) high-cost of the required segmentation labels, (3) the dual generation problem (pseudo-abnormal image synthesis) is not yet explored.
To solve the above limitations, we propose a Semi-supervised Medical Image generative LEarning network (SMILE), which can leverage a small set of images with segmentation masks and a large set of images without segmentation masks to achieve robust pseudo-normality synthesis from abnormal ones. To be specific, we propose a confidence enhancer component which maximizes the confidence score of the predicted segmentation labels in our semi-supervised setting. For the third limitation, we incorporate a new component in our framework for pseudo-abnormal image synthesis. Thus, our model is capable of generating both pseudo-normal and pseudo-abnormal images, as shown in Fig. 1. Our contributions are summarized as follows:
We propose a semi-supervised generative modeling framework which uses confidence enhancer to leverage both labeled and unlabeled data to generate synthetic images.
We propose a confidence enhancement technique which enforces maximization of the certainty over the predicted segmentation mask and is especially helpful in the absence of ground-truth segmentation labels.
Extensive experiments show that our model outperforms the best of the state-of-the-art methods by up to 3% in generating high-quality images and 6% for data augmentation task. When using only 50% of labeled data, the proposed model is able to achieve comparable pseudo-normality synthesis results to fully supervised-methods.
2 Related Works
2.1 Pseudo-normality Synthesis
Pseudo-normality synthesis is important in various perspectives, including clinical suggestion for doctors and data augmentation for many downstream tasks [andermatt2018pathology, bowles2017brain, sun2020adversarial, ye2013modality, zhu2017unpaired]. Thanks to the satisfying visual effects of the synthetic images, generative adversarial nets have been dominating in the pseudo-normality synthesis task. In the previous work, [sun2020adversarial]xia2020pseudo], they introduce the idea of adversarial training, which co-trains a predictor to make binary predictions over the synthetic images. The predictor is expected to detect whether the synthetic images are normal (i.e. 0) or abnormal (i.e. 1). The generator is expected to generate realistic pseudo-normality images which fool the predictor to believe that the synthetic pseudo-normality images are real. Additionally, a reconstructor is also introduced to reconstruct back the pseudo-abnormality images taking the segmentation mask and the pseudo-normality images as input. In the most recent work [yunlong2020generator], they extend the idea of adversarial training and co-train a segmentor which provides more detailed information about the lesion than the predictor used in [xia2020pseudo]. Besides GAN-like architectures, in work [chen2018unsupervised], they adopt a variational auto-encoders (VAE) to learn the distribution of the normal images and generate pseudo-normality images based on the learnt constrained latent representations. However, they suffer from the poor reconstruction visual effects of VAE as well as hard to eliminate small lesions.
2.2 Deep Generative Models for Medical Image Synthesis
To alleviate the need of the large amount of data required by deep learning models, deep generative models have been proposed firstly to generate synthetic data for effective data augmentation [goodfellow2020generative]. In medical image analysis, deep generative models are also utilized for data augmentation and anonymization [shin2018medical, han2019combining]. In the previous work, various downstream tasks have been utilized to guide the generator to produce high-quality synthetic images, such as anomaly detection, lesion detection, lesion segmentation, etc. For example, [sun2020adversarial] utilizes anomaly detection, which treats lesions as anomalies, and learn a generator which removes the lesion from the original images. However, since there is no ground-truth for the pseudo-abnormal images, the labels are obtained by simply removing everything within the contour of the brains from the images. Obviously, the assumption is too strong for an abnormal image which should maintain the structures and tissues within the brains. Later work in [xia2020pseudo, yunlong2020generator] eliminate the assumption by providing more information about the lesions in the images (e.g. binary masks).
2.3 Generative Adversarial Learning in Medical Image Analysis
Generative Adversarial Nets (GAN) [goodfellow2020generative] have been proposed to generate realistic images and achieved satisfying results in natural images [zhu2017unpaired]. The fundamental mechanism of the GAN-based approach is to co-train a generator and a discriminator together. The generator is expected to generate realistic images which fool the discriminator to believe the synthetic images are real, while the discriminator is expected to distinguish the real and synthetic images. Later, the idea of adversarial training which one model attempts to fool the other model by generating realistic images is largely adopted in several different forms. For example, conditional GAN [mirza2014conditional] utilizes the idea of adversarial training to make the generator learn to generate images of different classes. In medical image analysis, the idea of adversarial training is also largely utilized [han2018gan, yi2019generative]. Specifically, [chen2017deeplab, tack2018knee, xue2018segan] have shown that adversarial training can improve the performance of various medical image analysis tasks, such as segmentation, detection, etc.
|input image w/ segmentation label|
|unlabeled input image|
|predicted segmentation mask|
|generated pseudo-normality image|
|reconstructed pseudo-abnomality image|
The overall framework includes three main parts (1) pseudo-normality generation, (2) adversarial segmentation and (3) pseudo-abnormality generation. The confidence enhancement is for the semi-supervised learning setting. The list of notations used throughout this section is first introduced in Table 1. Then, all the three parts are illustrated in the following sections.
3.1 Pseudo-normality Generation
We adopt a U-Net [ronneberger2015u]
model, which is a widely used image segmentation architecture, to build our generator because the architecture allows different resolutions of information to flow through and skip connection to pass message along, which both are critical for the generation task. The loss function of the generator consists of two parts, the first part is a pixel-wise measurement on the quality of the generated pseudo-normality image, and the second part measures how good the generated pseudo-normality image is by the prediction from the discriminaotr. Therefore, the loss of the generator is defined as follows,
is the segmentation mask of the expected normal image (i.e. zeros matrix),refers to a pixel-wise loss and refers to the cross-entropy loss. The pixel-wise generation loss enforces the generator to maintain the identity of the no-lesion part. Notably, in order to keep the model consistent and aware of the presence or absence of the lesion, we also train the model to generate pseudo-normality image with the input real normal image. In this case, the model is expected to output the original image.
3.2 Adversarial Segmentation
In terms of the segmentor, we also adopt a U-Net architecture due to its excellent performance in segmentation task [ronneberger2015u] and a softmax function in the end to produce segmentation masks. The main objective for the segmentor is to recognize the lesion behind the generated pseudo-normality images and recognize the fake-lesion in the generated pseudo-abnormality images, as:
where is the segmentation mask and is the reconstructed pseudo-abnormality image.
3.3 Pseudo-abnormality Generation
To fully utilize the segmentation labels to provide more supervised signals to the model as well as enrich the dataset better during data argumentation, we introduce a reconstructor at the end of our model to learn to reconstruct the input image with the segmentation mask and the pseudo-normality image. As in the generator, we adopt a similar U-Net structure, except with a two-channel input. The objective function is to minimize the pixel-wise loss between the reconstructed image and the original input image.
3.4 Semi-supervised Objective
In our semi-supervised training setting, the input image is with unknown type, either normal or abnormal. The input image first goes through the segmenter. Two following things will happen: (1) the image trains the segmentor without label by enhancing the confidence of the segmentor (as in Eq. 4), (2) the segmentor assigns a pseudo-label to the image. After that, the training is same as it is in the supervised setting, with the assigned pseudo-label. In this setting, the segmentor is trained to be more confident about the segmentation mask, which in turns provides a more precise pseudo-label and leads to a better solution for both the segmentor and the generator. The loss function for the confidence enhancement of the segmentor is:
where is a desired confidence threshold, which we set as in our model. It is also recommended to dynamically update
by the average confidence value every epoch. In this way, the confidence of the segmentor with respect to the accuracy of the predicted segmentation mask is maximized. It is especially useful under the semi-supervised learning which no ground-truth is present to guide the training of the framework.
3.5 Training Scheme
We adopt a step-by-step pre-training and fine-tuning training scheme as shown in Fig. 2, We first train and in an adversarial setting. Then, we train by fixing and . Next, we train and again in an adversarial setting. Finally, we fine-tune the framework with , , together. We adopt a similar training scheme in the semi-supervised setting, except the segmentation mask is predicted from the segmentor .
We consider the publicly available dataset, Multimodal Brain Tumor Segmentation Challenge 2019 (BraTS19) [bakas2017advancing, menze2014multimodal]
to evaluate our framework. We take the training set of the BraTS19 challenge dataset. There are 259 GBM (i.e. glioblastoma) and 76 LGG (i.e. lower-grade glioma) volumes of magnetic resonance imaging (MRI) in the dataset. Each of them is skull-stripped, interpolated to an isotropic spacing ofand co-registered to the same anatomical template. In every volume, there are four different modalities (i.e. T1, T2, T1c and Flair) with 240x240 slice available. We take the T2 modality of the GBM and split train/validation/test set by 130/104/25 volumes, respectively. We clip the intensities to [0, ] of each volume, where accounts for top 99.5% pixel values of the volume.
4.2 Comparison Methods
We consider four related frameworks for the supervised learning task of pseudo-normality synthesis. The models are introduced below. We take the baseline implementation from their open-source codes.VA-GAN[baumgartner2018visual] uses a generator which can predict an additive maps and then generate a pseudo-normality images and a discriminator which attempts to distinguish between the generated images and the real images. Additionally, they aim to learn a least required additive map to achieve the realistic pseudo-normality image generation by minimizing a loss. ANT-GAN[sun2020adversarial] is a variant of a CycleGAN which it enforces the consistency loss between the pseudo-normality and pseudo-abnormality images. Compared with our approach, they do not intend adversarial training in the network, while they only use the pixel-wise normality consistency loss. PHS-GAN[xia2020pseudo] introduces adversarial training with a discriminator to distinguish between normal and abnormal images, while GVS[yunlong2020generator] takes segmentation label as input and co-train a discriminator to segment the lesion in the normal and abnormal images.
4.3 Evaluation Metrics
For quantitative evaluation, we select two metrics proposed in [xia2020pseudo], to evaluate the healthiness and identity, in which healthiness refers to the effects of the erased tumor, while identity refers to the maintenance of the original structure. Healthiness () measures how normal the generated pseudo-normality image is. The evaluation function of is defined as:
Identity () measures the structure maintenance of the generated pseudo-normality image. The evaluation function of is defined as:
where stands for multiscale structural similarity [wang2003multiscale].
4.4 Experiment Details
The framework is trained with an Adam Optimizer. The batch size is set as 8 and the learning rate is set as 1e-4. The model is trained 10 epochs for each phase of the training loop. All the experiments are conducted on a 64-bit machine with two NVIDIA GPUs (TITAN RTX, 1770MHz, 24GB GDDR6).
5.1 Evaluation on Generation Performance
In Figure 3, we visualize randomly selected triplets of (abnormal images, pseudo-normal images, pseudo-abnormal images). In Table 2, we evaluate the and evaluation metrics of our proposed model as well as the baseline models. In Table 2, it is clear to observe that our model outperforms other models in both evaluation metrics. Notably, GVS ranks the second, even though the framework is very simple. That demonstrates the adversarial training with segmentation task is important to provide detailed information (e.g. shape, location) about the lesion and further improve the quality of the synthetic images. The performance of ANT-GAN and PHS-GAN are comparable and way better than VA-GAN, which illustrates that adversarial training could help improve the quality of the synthetic images. In summary, utilizing segmentation task as an adversarial training task could greatly improve the quality of the synthetic images. Rather, the additional pseudo-abnormality reconstructor further improve the quality of the synthetic pseudo-normal images by 2%. For the semi-supervised setting, we use only 75% segmentation-labeled data, along with 25% unlabeled data to train out network, and achieve comparable results with the fully-supervised learning setting. In practice, we do not involve any extra unlabeled data. Rather, we remove the label of the 25% labeled data from the same training set to produce the unlabeled data, thanks to the proposed confidence maximization technique, which continues to train the model, even when the labels are missing. Besides, we evaluate how the amount of labeled data influence our semi-supervised learning performance in table 2 (a). It is worth noting that the labeled data we use is at least 50% of the whole labeled training dataset because the framework crashes with only few labeled data and we leave it as a future study.
5.2 Evaluation on Data Augmentation
We evaluate the effectiveness of the method when working as a data augmentation task performance by providing synthetic pseudo-normality and pseudo-abnormality data for the downstream lesion segmentation tasks. (Note that we do not apply traditional data augmentation (e.g. rotation, crop, etc.) here.) We show the performance (Dice score) of the models in Table 3. To be specific, we set up the experiments by utilizing our training dataset to generate the data-augmented dataset, which we generate two types of data (i.e. pseudo-normality, pseudo-abnormality) with the same amount as in the training dataset. We compare the performance of each different type of augmented-dataset, namely, without augmentation (None), with pseudo-normality only (PN or GVS [yunlong2020generator]), with pseudo-abnormality only (PA) and with both pseudo-normality and -abnormality (Both). Note, the setting PN is the baseline model, GVS [yunlong2020generator], which produces the best performance over the current state-of-the-art models. For each of the data-augmented experiment, adding pseudo-images means incorporating the synthetic data into the training data. The result shows that both the generated pseudo-normality and pseudo-abnormality data can improve the performance of the downstream task (i.e. segmentation) when used for data augmentation.
In this paper, we proposed a Semi-supervised Medical Image generative LEaning (SMILE) framework to generate pseudo-normality and pseudo-abnormality images to assist in medical image analysis tasks through eye screening and data augmentation. A confidence enhancement technique is introduced for semi-supervised generative learning. Extensive experiment results suggest that our proposed SMILE model can generate images with better quality and support better data augmentation than the state-of-the-art models. We plan to study more data-efficient ways (e.g. self-supervised learning) for generative learning in medical images in the future.