Diabetic retinopathy (DR) is a common disease causing vision loss or even blindness among people with diabetes. Human ophthalmologists usually identify and grade the DR severity based on the type and amount of related lesions. According to the international protocol [gulshan2016development, drgrading], DR severity can be graded into five levels: normal, mild, moderate, severe non-proliferative diabetic retinopathy (NPDR) and PDR. The related lesions consist of hard exudates, soft exudates, hemorrhages, microaneurysms, laser marks, proliferate membranes, etc. It is time-consuming and difficult even for experienced ophthalmologists to diagnose DR, so automatic DR grading models [seoud2015automatic, pratt2016convolutional, yang2017lesion] have begun to be explored over the past decades.
Several previous works, [wang2017zoom, lin2018framework, Zhou_2019_CVPR]
adopt deep models to implement DR grading and obtain substantial improvement over other methods. Compared with hand-crafted feature extraction and traditional machine learning methods, deep convolutional neural networks (CNNs) have achieved great success for many vision tasks, such as image classification[he2016deep], object detection [liu2018deep], semantic segmentation [garcia2017review] and image synthesis [yi2018generative]. Training an effective deep CNN model usually requires a large amount of diverse and balanced data. However, the DR data distribution over different grades is extremely imbalanced since abnormal fundus images only make up a small portion. For example, as far as we know, for the largest public DR dataset EyePACS [kaggle], images of DR levels 3 and 4 only account for 2.35% and 2.16%, respectively, while normal images of level 0 account for 73.67%. Adopting such imbalanced data will make the model less sensitive to samples with higher DR severity levels and lead to overfitting. Although common data augmentation methods such as flipping and random cropping and rotation can mitigate the problem, the poor diversity of samples from those levels still limits model performance. Thus, in this paper, we propose an image generation model that synthesizes more miscellaneous DR images with different grading levels, and use these generated images to help train a grading model.
Generative Adversarial Networks (GANs) [goodfellow2014generative] are widely used in image generation tasks. The framework of GAN usually consists of a generative model and a discriminative model competing against each other in a min-max game, which has led to great progress in synthesizing photorealistic images. One neural network tries to generate realistic data, and the other network tries to discriminate between real data and synthesized data. DCGAN [radford2015unsupervised] extends GAN by using a deconvolutional layer, which acts as an upscaling to transform low-resolution images into higher ones. CGAN [Mirza2014ConditionalGA] aims to concatenate a one-hot vector to the random noise vector to build conditions into the generator. Moreover, CycleGAN [CycleGAN2017] performs unpaired image to image translation from a source domain to a target domain. BigGAN [brock2018large] is an approach that pulls together various of the best recent practices for training class-conditional images and scaling-up the batch size and number of model parameters. The result is the routine generation of both high-resolution and high-fidelity images. These well-designed conditional GAN frameworks inspired us to propose a retina generator with the ability of synthesizing high-resolution realistic images.
Specifically, our proposed model consists of a high-resolution retina image generator conditioned on vessel, optic disk and lesion masks, a multi-scale spatial and channel attention module, multi-scale discriminators with multi-task learning losses, and an adaptive grading manipulation module to control the severity level of DR images synthesized by the generation model. The main contributions of our method are highlighted as follows:
1. A conditional high-resolution image generator is proposed to synthesize retina images with controllable lesion and grading information. In addition to the normal encoder-decoder design, the multi-scale spatial and channel attention and progressive generation design aims to synthesize more realistic local details. Moreover, multi-scale discriminators are optimized by the adversarial loss, feature matching loss and grading classification loss, simultaneously.
2. Adopting real DR images with available grading annotations, we learn different latent grading spaces for randomly sampling adaptive grading vectors. These vectors can be regarded as grading embeddings that help manipulate multi-scale synthesis blocks for more effective generation.
3. Both qualitative and quantitive experiments have been conducted to evaluate our model. Not only are the fidelity and diversity of the generated samples promising, but we also find that the synthesized images can be used for data augmentation to train better DR grading and lesion segmentation models. Extensive ablation studies and comparison experiments on Kaggle EyePACS [kaggle] and our private dataset demonstrate the effectiveness and superiority of our method.
Our preliminary work was presented in [Zhou_2019_DRGAN]. Compared to the previous work, in this paper, we make five major extensions. 1) A multi-scale spatial and channel attention (SCA) module is proposed in the synthesis blocks to better enhance generation qualities in fine-details. Then, the grading performance can be improved as well. 2) We introduce two more important lesions: laser marks and proliferate membranes, which are particularly related to the identification of level 3 and 4 DR images. We use the weakly-supervised trained activation maps as the lesion mask inputs for these two lesions since we do not have their annotated ground-truths. 3) The contours of optic disks in fundus images synthesized by our previous DR-GAN model are usually blurred. In this version, an optic disk mask is added to address this problem and make the generated images more realistic. 4) A private dataset containing 1,834 images is collected with fine-grained pixel-level annotations. The performance of DR-GAN has been further improved. 5) More solid evaluation experiments and results are added, including human evaluation, Freshet Inception Distance (FID) and Sliced Wasserstein Distance (SWD) for synthesis fidelity, and true positive rate per class for DR grading.
Ii Related Work
Ii-a GANs in Medical Images Synthesis.
Using GANs for medical image synthesis [medicalGAN] could potentially address the shortage of large and diverse annotated databases. Methods have been proposed for a variety of medical imaging domains such as Computed Tomography (CT) [Nie2016MedicalIS], Magnetic Resonance Imaging (MRI) [Jiang2018TumorAwareAD], and chest X-Rays [efficientGAN]. For example, CT imaging causes the risk of cancer due to radiation exposure. A cascade of 3D fully convolutional networks was presented by [Nie2016MedicalIS] to synthesize CT images from MR acquisitions. A pixel-wise reconstruction loss and an image gradient loss were adopted for generation in addition to adversarial learning. Mahapatra et al. [efficientGAN]
exploited the conditional GAN and Bayesian Neural Networks for active learning to synthesize chest X-Ray images based on segmented masks. Moreover, Positron Emission Tomography (PET) images are well adopted in diagnosis and staging in oncology but are expensive and radioactive. Thus, a cascade of two conditional GANs was proposed by[MyelinGAN], where the generators are based on 3D U-Nets and the discriminators are based on basic 3D CNNs, to synthesize PET images from various MR volumes.
Ii-B GANs in Retinal Image Synthesis.
Recently, some researchers have also exploited GANs to synthesize retinal fundus images. Costa et al. [costa2017towards] first adopted a U-Net architecture to transfer vessel segmentation masks to fundus images using a vanilla GAN architecture. However, the generated samples have block defects and do not have controllable grading information. Tub-sGAN [zhao2018synthesizing] was proposed to extend style transfer to the generator and thus increase the diversity of synthesized samples. Though somewhat successful, the DR-related lesions and physiological retina details cannot be synthesized clearly. More recently, Niu et al. [niu2018pathological] tried to generate fundus images given pathological descriptors and vessel segmentation masks. The position and quantity of lesions can be adjusted. However, this method only introduced lesion manipulation, without considering a global-wise representation for discriminating grades. The synthesized images still need to be evaluated by ophthalmologists to determine whether they are sufficiently gradable to benefit a grading model. In this work, our generation model can be manipulated with arbitrary grading and lesion information to synthesize corresponding high-resolution images. Thus, the generated samples can be directly exploited to help train DR grading models and improve the grading accuracy.
Iii Proposed Methods
Overview. Our model aims to synthesize high-resolution () DR images with controllable grading and lesion information. As shown in Figure 1, our approach consists of three key components. A retina generator employs a two-stage encoder-decoder architecture to generate fundus images, which is conditioned on the inputs of vessel and lesion masks. The optimization of the generator is based on an adversarial learning framework with a proposed multi-scale discriminator which combines multi-task losses for training. Moreover, to better manipulate the grading information of a synthesized image, we propose to learn adaptive grading vectors and embed them into the image synthesis blocks to control the appearance of generated lesions. We also find that the adaptive grading manipulation can increase the diversity of the synthesized data due to the effective random sampling from the pre-learned latent grading spaces.
Iii-a High-Resolution DR Image Generation.
The resolution requirement for medical images is high since some lesions tend to appear in extremely small regions, such as microaneurysms in fundus images. Inspired by [wang2018high], we design a two-stage encoder-decoder model to synthesize images. As illustrated in Figure 1, the building blocks in blue of the retina generator, denoted as , aim to generate a resolution of . Then, the blocks in yellow, denoted as , can further synthesize more realistic local details to increase the resolution to .
consists of three components: encoding blocks, residual blocks and synthesis blocks. The encoding blocks employ a fully-convolutional module with four convolutional () layers. The kernel size of the first layer is set as 7 and that of the remaining
layers is 3. We configure the convolutional operation with a stride of 2 instead of using pooling for downsampling. The padding type is set as the same and a ReLU activation and batch normalization are adopted after each layer. The residual blocks increase the depth of the network and are proposed to learn better transformation functions and representations using a deeper perceptron. Finally, the synthesis blocks are the most significant component, whose basic units employ transposed convolutional operations. We embed an adaptive grading manipulation operation into these blocks, which is explained in Sec. 2.3. The hyper-parameter settings of the synthesis blocks are similar to the encoding blocks. In this extension work, we design a multi-scale spatial and channel attention (SCA) module, inspired by[Fu_2019_CVPR], in the synthesis blocks to better enhance generation qualities in fine-details. This part is explained in the next subsection.
has a much simpler design, only including two layers and two corresponding transposed layers. The input to the first transposed layer is the element-wise sum of the feature maps of the last layer of and the feature maps of the last transposed layer of . Such a design helps directly inherit the learned global features from and further progressively synthesize local details based on mask inputs, with higher resolution. Please note that is first pre-trained and then is added for fine-tuning the whole generator.
The proposed retina generator for DR images is conditioned on the input of structural and lesion masks. We adopt a U-Net architecture [ronneberger2015u] to train the vessel, optic disk and various lesion segmentation models. Thus, the pairs of real fundus images and corresponding structural and lesion masks can be adopted to train the generator. Since only a small amount of public pixel-wise annotated data is available for training the vessel (DRIVE [staal2004ridge]), optic disk and lesion (IDRID [porwal2018indian]) segmentation models, we use the trained model to predict masks on the large-scale DR dataset (EyePACS [kaggle]) and then adopt them to train our generator. We have also collected a larger private dataset with pixel-level annotations to improve the performance of segmentation for better lesion masks. Although the predicted masks are not the real ground-truths of data from EyePACS, the noise is tolerable and their large-scale amount does contribute to the synthesis performance.
Iii-B Multi-Scale Spatial and Channel Attention Module
For the retina generator, proposed in our preliminary work [Zhou_2019_DRGAN], the architecture follows the common encoder-decoder framework, which is unable to effectively preserve small details, such as tiny lesions and fine blood vessels, from the input of masks, and synthesize them clearly in the output fundus images. Thus, in this extension work, we propose a spatial and channel attention module in multi-scale synthesis blocks, as shown in the purple blocks in Figure 1. After adaptive grading manipulation in each synthesis block (which will be explained in the next section), the corresponding feature maps from the encoding and synthesis blocks are concatenated, together with the resized input of the structural and lesion masks. The combined features are further passed through one convolutional layer which is taken as the input of the spatial and channel attention module. Then, upsampling is conducted for the next synthesis block.
The network of the spatial and channel attention (SCA) module is illustrated in Figure 2. It consists of two parts, which operate the spatial and channel attention, respectively. The motivation behind using this self-attention mechanism is to both spatially model richer contextual representations and enhance specific semantics, such as input masks, to improve the dependencies between channel maps. Therefore, small details can be synthesized better.
Spatial Attention: The integrated input feature for the SCA is defined as , where denote the channel, width and height, respectively. Three convolutional layers are operated on the input feature for three branches. The first two branches reduce the channel dimension to , where is used for decreasing the computational complexity. Then, and from the two branches are reshaped into and , separately, and then multiplied. A following Softmax is applied to obtain the spatial attention map . The final spatial attended feature is computed as . In our implementation, we set to 8, 16, 32 for different scales , , , respectively.
Channel Attention: The input is first reshaped into and , separately. Then, matrix multiplication is conducted with a Softmax to obtain the channel attention map . The final channel attended feature is computed as . The overall output of the SCA module is the weighted element-wise sum of and with one more convolutional layer for each.
Iii-C Multi-scale Discriminators with Multi-task Optimization.
To better optimize the high-resolution image generator, the discriminator needs to have receptive fields of different scales to differentiate between real and synthesized images. The most effective method is to design multi-scale discriminators with identical network structures. In this work, we adopt the original size, , and downsample it to two scales, and . The three scales of images are passed forward to the three discriminators . The discriminator applied to the smallest scale provides the largest receptive field to focus on the holistic fundus image structure and some big lesion patterns. In contrast, the one applied to the largest scale provides the smallest receptive field to generate more local details, particularly for small lesion regions. The discriminator structure consists of four convolutional layers with a kernel size of 4 and a stride of 2. A leaky ReLU is used after each layer, with a slope of 0.2. Global average pooling is applied at the end for fitting different scales.
In this work, a multi-task loss is carefully designed for training the generator and the discriminator in an adversarial learning architecture. In addition to employing the standard adversarial loss , which maximizes the output of the discriminator for generated data, we also adopt the feature matching loss [improveGAN] to optimize the generator and match the statistics of features in the intermediate layers of the discriminator. The feature matching loss aims to address the instability of training GANs to prevent the generator from overtraining on the discriminator. Moreover, we also incorporate an auxiliary classification loss (adopting focal loss [lin2017focal] due to the imbalanced data distribution) to enable the discriminator to learn discriminative representations for DR grading, on both the synthesized and real data. For the largest-scale input, an additional perceptual network based on the VGG-19 backbone is equipped to contribute to the training by .
For the vessel and lesion masks input to the generator , which combines and , we fuse them into one conditional map denoted as . The corresponding real fundus image and grading label are indicated as and , respectively. The overall training loss is defined in the following equation:
where and denote the and intermediate layer in and , respectively. During implementation, we compute all the layers for the feature matching and perceptual losses. Moreover, , and control the weights of different losses, and are set as 10, 10 and 1 to make the magnitude of each loss the same size, for the best result. For the discriminator input, a channel-wise concatenation of the conditional maps and the real/synthesized images is conducted.
Iii-D Adaptive Grading Manipulation.
Our final aim is to generate fundus images with controllable DR severity levels which can be used for data augmentation to improve the performance of DR grading models. In this paper, we propose to learn adaptive grading vectors that can be used for manipulation in the synthesis blocks. The overall idea is illustrated in the bottom part of Figure 1. The adaptive grading vectors are learned and sampled from the latent grading spaces.
We first employ ResNet-50 to train a DR grading model based on the fundus images to extract discriminative features. After visualizing the extracted features using tSNE [maaten2008visualizing], we can clearly achieve five clusters corresponding to the five DR grading levels. defines the five feature sets, where . For each set, we compute the mean
to obtain the corresponding normal distribution space. Once the five latent grading distributions are learned, we randomly sample latent vectors, where is subject to . In the training phase, based on the grading ground-truth of an input pair consisting of a mask and real image, the latent vector is sampled from the corresponding space.
We take the latent grading vectors as inputs to the synthesis blocks of the generator. Inspired by [karras2018style], a four-layer non-linear mapping network is first devised to encode the affine transformations, which can benefit the generator manipulated by grading. In each synthesis block, the feature maps are first concatenated with a random noise volume whose number of channels is a quarter that of . Then, a is employed to fuse the features. Adaptive instance normalization (AdaIN) [huang2017arbitrary] embeds the grading vectors to transform the original features. The AdaIN is defined as the following function:
where the fused feature is normalized independently. and
are the scale and bias vectors learned from the adaptive grading vectors, rather than vectors learned fromlike in normal batch normalization. In each synthesis block with a different channel-wise dimension of feature maps, the corresponding grading vector is split into and .
Iii-E Testing Phase and Implementation Details.
Based on the input vessel and lesion masks in the testing phase, we need to decide which grading distribution to sample the latent vector from for the grading manipulation. To address this problem, an additional convolution and global average pooling are operated after the last encoding block to further train the grading function. During testing, the predicted grading label is used for the selection. Moreover, once the whole model is trained, we can automatically imitate multiple lesion masks for one vessel mask to synthesize various fundus images with controllable grading labels.
The training scheme for our method consists of two steps. In the first step, five latent spaces for different grading levels are learned by pre-training a grading model. Then, the proposed DR-GAN is optimized in an end-to-end manner. The number of residual blocks in the generator is set to 7 for the best performance. The ADAM optimizer is adopted with a learning rate of 0.0001 and momentum of 0.5. The mini-batch size is set to 16. The model
is first trained over 10 epochs, thenis added for fine-tuning over 10 epochs. All the experiments are conducted on an NVIDIA DGX-1.
Iv Experimental Evaluation
Iv-a Datasets and Pre-processing
To train the whole DR-GAN model, large-scale data with pixel-wise annotated structural and lesion masks is required. However, the largest public DR dataset is EyePACS [kaggle], which consists of 35,126 training images and 53,576 testing images only containing grading labels. The images are collected from different sources with various lighting conditions and weak annotation quality. Thus, we adopt a small-scale dataset with pixel-level annotations to train the vessel, optic disk and lesion segmentation models, and then perform inference on EyePACS to obtain masks, which are coarsely used as weak ground-truths for training DR-GAN. In the previous version, we used DRIVE [staal2004ridge] and IDRID [porwal2018indian] (which contain four lesions: microaneurysms, haemorrhages, hard exudates and soft exudates) to train the segmentation models. However, IDRID only has 81 annotated images. In our extended model, we collect a private dataset with much more annotated data to enhance model training. Details are as follows:
Private Dataset: Our private dataset (which we name SKA) is collected from cooperative local hospitals. We select a subset of 1,834 DR images containing various lesion appearances for fine-grained pixel-wise lesion annotation as well as DR grading by ophthalmologists. Inspired by [porwal2018indian], microaneurysms, haemorrhages, hard exudates and soft exudates are annotated. Moreover, we also add image-level labels for laser marks and proliferate membranes, since these two lesions are very useful for identifying level 3 and 4 DR images but usually appear with holistic features that are difficult to annotate at a pixel level. Figure 3 provides some examples from the SKA dataset with annotations. In this extended work, we adopt our private dataset to train the lesion mask segmentation models. We split the 1,834 annotated images into 1,500 images for training and 334 images for testing. The segmentation performance is reported in terms of the area under curve (AUC) of the precision-recall (PR) curve. We obtain 0.3364 for microaneurysms, 0.6650 for haemorrhages, 0.7950 for hard exudates and 0.6306 for soft exudates. Moreover, we also test our model on the IDRID dataset [idrid] for fair comparison and consistently achieve state-of-the-art results: 0.5182 for microaneurysms, 0.7089 for haemorrhages, 0.8916 for hard exudates and 0.7577 for soft exudates.
Laser Mark and Proliferate Membrane Activation Maps: Laser marks and proliferate membranes are important lesions that usually appear in severe DR images (i.e. DR-3 and DR-4). However, they are global features, which are not suitable for pixel-wise annotation. Thus, only image-level labels for these two lesions are provided which indicate whether or not an image has the lesion. Learning discriminative localization [zhou2016learning]
, a weakly supervised method is adopted to obtain the activation maps. We select a subset containing all the available images with laser marks and proliferate membranes to train their classifiers. For laser mark classification, 1450 training images and 398 testing images (60% of which have positive lesions) are evaluated to get a classification accuracy of 94.97%. For proliferate membrane classification, 1382 training images and 398 testing images (20% of which have positive lesions) are evaluated to get a classification accuracy of 92.21%. Once the models are trained, the activation maps are extracted for all the images from EyePACS as weak mask ground-truths. Figure4 shows some examples of the results of the weakly supervised activation maps.
Iv-B Qualitative Evaluation of Image Synthesis
Before evaluating the quantitative improvement of the DR grading performance by data augmentation with synthesized data, we first both qualitatively and quantitively demonstrate the generated image fidelity and evaluate the influence of grading and lesion conditions.
In this work, we mainly adopt training samples from EyePACS, as well as the extracted vessel, optic disk and lesion masks, to train the model. Once this is done, for each structural mask, we can arbitrarily control the quantity and position of the lesion spots within the corresponding masks to synthesize different DR-level images. Specifically, by manipulating the lesion masks, the corresponding grading labels can be coarsely predicted. Thus, we synthesize 10,000 images for each grading level to augment the data. The upper part of Figure 5 provides examples of synthesized images with different DR levels, for a given input with vessel and optic disk structures. We find that the fidelity of generated images and the manipulation performance using lesions and grading are highly promising. Moreover, the lower part of Figure 5 shows more detailed examples of synthesized lesion appearances, including microaneurysms, hemorrhages, hard exudates, soft exudates, laser marks and proliferate membranes. The limitation of our method is that the input of structural and lesion masks are not real ground-truths but inferred from the segmentation models, due to the expense of data annotation. Thus, the performance of the generator will be affected by the correctness of input masks. For example, false alarms for the segmentation of microaneurysms have an influence on synthesizing this kind of tiny lesions by the generator, which results in only a small improvement of the grading model discriminating between level 1 and 2.
|Methods||AVG. FID||AVG. SWD|
|DR-GAN w SCA||4.24||6.17|
|w/o Lesion Masks||8.68||16.12|
|Tr-Real||Tr-Real & Fake|
|DR-GAN w SCA||89.45||88.11||88.06||85.90|
|w/o Lesion Masks||76.31||72.40||84.21||80.61|
|Grade||Tr-Real / Te-Real||Tr-Real & Fake / Te-Real|
Iv-C Quantitive Evaluation of Image Synthesis
To better evaluate the fidelity of the synthesized images, we both conduct a human experts review and compute the Freshet Inception Distance and Sliced Wasserstein Distance score.
Iv-C1 Evaluation by Ophthalmologists
We ask three professional ophthalmologists to independently evaluate the synthesized images. First, 500 images are randomly selected from the 50,000 generated images that include different DR levels. Human scoring, ranging from 1 to 10, is used to describe the realness of the fundus image, in terms of details such as vessels, lesion textures and colors, where a higher value is better. Moreover, we select another 500 real fundus images, also displaying different DR levels, and mix them with the fake ones for discrimination by ophthalmologists. The average fidelity score and discrimination accuracy are reported in Table. I. For comparison, the images synthesized by Tub-sGAN [zhao2018synthesizing] and real images are also evaluated for realness, since some real images do not have perfect image quality. As shown in the results, only 66% of the mixture real and fake images generated by DR-GAN can be correctly identified by ophthalmologists, compared to 92% for Tub-sGAN. Theoretically, the random guess accuracy should be around 50%, which means the fidelity of our synthesized images is high. For the human-rated image fidelity score, the images synthesized by our method obtain 7.97, compared to 8.89 for real images, which proves the promising ability of synthesis. The images synthesized by Tub-sGAN receive a very poor score of 5.36.
Iv-C2 Evaluation with Freshet Inception Distance
To evaluate the performance of image generation models, Freshet Inception Distance (FID) [heusel2017gans] is usually adopted to measure the similarity between the real image set and the synthesized image set. It has been shown to correlate well with human judgement of visual quality and is most often used to evaluate the quality of samples from Generative Adversarial Networks. FID is calculated by computing the Freshet distance between two Gaussians fitted to feature representations of the Inception network. A lower score indicates better performance. Table II shows the average FID scores over the five levels, obtained by different methods, as well as some ablation comparisons, which are explained in sub-section Our DR-GAN, without adding the multi-scale spatial and channel attention module, achieves a score of 4.53, which is a large improvement over Tub-sGAN. With the additional SCA (DR-GAN w SCA), a slight decrease in the FID score can be further obtained. Moreover, the result of DR-GAN w/o Lesion Masks also shows that the input of the lesion masks is the most important factor for synthesizing good images. If the input lesion masks are not given, the fidelity of the synthesized lesion patterns is largely decreased.
Iv-C3 Evaluation with Sliced Wasserstein Distance
In addition to the FID, Sliced Wasserstein Distance (SWD) [Shmelkov_2018_ECCV] is also a suitable metric to evaluate images generated by GANs. SWD is usually adopted to evaluate high-resolution GANs, computing a multi-scale statistical similarity based on local image patches extracted from the Laplacian pyramid representation between generated and real images. Similar to FID, a lower SWD score is better. As shown in Table II, our DR-GAN with the multi-scale SCA achieves the best score 6.17, which is much lower than that obtained when removing the input lesion masks. Moreover, compared with the DR-GAN without AGM, the SWD score can also be decreased by 2.5, demonstrating that the adaptive instance normalization by the grading vectors can contribute to the synthesis blocks.
Iv-D Data Augmentation by Synthesis for DR Grading
Our biggest concern is to evaluate whether the synthesized data can mitigate the unbalanced data distribution over different grading levels and be beneficial for training grading models. We train the baselines for the grading model, which adopt three different classic backbones VGG-16, ResNet-50 and Inception-v3, with and without using the synthesized data for augmentation. Some state-of-the-art approaches are also re-implemented and compared.
We configure two experimental settings. For the first setting, we train the grading models only using the real samples (denoted as Tr-Real) from the EyePACS training set and test on the synthesized data (denoted as Te-Fake). For the second one, we combine the training set of EyePACS and the synthesized data for training (denoted as Tr-Real & Fake), and evaluate on the real testing set of EyePACS (denoted as Te-Real). As illustrated in the Table. IV, both the classification accuracy and the quadratic weighted kappa metric [kaggle] are employed for evaluation. First, adopting the model trained on the real data with grading ground-truths, we observe that a highly promising grading performance on the synthesized data can be achieved. A classification accuracy of 90.46% and kappa value of 89.15% are obtained by the best-performing AFN [lin2018framework] model. Moreover, for the evaluation on the setting of real test images from the EyePACS, the synthesized data are added into the training set for augmentation. The results show that consistent improvement is achieved over all five compared approaches. The accuracy is increased on average by 1.75% and the kappa is increased by 1.87%. We believe that once we obtain more training data with more accurate lesion masks, the proposed DR-GAN will be further enhanced and contribute to the grading performance even more significantly.
In addition to the kappa and average accuracy metrics, we also calculate the true positive rate (TPR) per class to be able to determine whether the synthesized samples improve the performance on each class. We mainly evaluate the effectiveness of adding the fake data into the training set. The fake data are synthesized by DR-GAN with the multi-scale SCA module. The results are reported in Table. V. The evaluated model is based on the ResNet-50 backbone. We find that improvements can be achieved over all the classes, particularly for grades 3 and 4, due to the lack of real data.
Iv-E Ablation Studies
To separately evaluate the effectiveness of manipulation by lesion and grading information and the contribution of the multi-loss training, we conduct five ablation studies based on the baseline ResNet-50. Without (w/o) Lesion Masks: We first study the effect of dropping the input lesion masks and only using the grading manipulation module, by arbitrarily selecting the latent grading space. w/o Adaptive Grading Manipulation (AGM): Oppositely, we investigate the effectiveness of the AGM module by detaching it while keeping the input lesion masks. w/o and w/o
are also explored for their respective contributions to the overall loss functions.DR-GAN w SCA: In this extension work, we add the multi-scale spatial and channel attention module into the generator. We also explore the effectiveness of this design. Table. IV compares the grading performance by each baseline. We find that dropping the input lesion masks significantly affects the grading performance due to the poor quality of synthesized lesion patterns. Thus, augmentation by the generated data cannot contribute to the grading model. Besides, compared with the model without AGM, this design can increase the grading result by a margin of 1.68% for kappa. The AGM can improve the fidelity and diversity of the synthesized lesion appearances within the corresponding grading levels. Moreover, dropping or in the multi-loss training will both reduce the grading performance. Particularly, for the discriminator without the embedded classification loss, the grading accuracy decreases by 1.17%, while the kappa value decreases 1.83%. Finally, with the enhancement by the multi-scale SCA module, not only can the fidelity of synthesized images be improved, but we also obtain a slight increase in the performance of the grading model.
In this paper, we proposed an effective high-resolution DR image generation model which is conditioned on the grading and lesion information. The synthesized data can be used for data augmentation, particularly for those abnormal images with severe DR levels, to improve the performance of grading models. In future work, more real annotated pixel-level lesion masks will be added for training DR-GAN better.
The authors would like to thank the collaborative local hospitals for providing the data and annotations.