Identifying the key components in ResNet-50 for diabetic retinopathy grading from fundus images: a systematic investigation

by   Yijin Huang, et al.

Although deep learning based diabetic retinopathy (DR) classification methods typically benefit from well-designed architectures of convolutional neural networks, the training setting also has a non-negligible impact on the prediction performance. The training setting includes various interdependent components, such as objective function, data sampling strategy and data augmentation approach. To identify the key components in a standard deep learning framework (ResNet-50) for DR grading, we systematically analyze the impact of several major components. Extensive experiments are conducted on a publicly-available dataset EyePACS. We demonstrate that (1) the ResNet-50 framework for DR grading is sensitive to input resolution, objective function, and composition of data augmentation, (2) using mean square error as the loss function can effectively improve the performance with respect to a task-specific evaluation metric, namely the quadratically-weighted Kappa, (3) utilizing eye pairs boosts the performance of DR grading and (4) using data resampling to address the problem of imbalanced data distribution in EyePACS hurts the performance. Based on these observations and an optimal combination of the investigated components, our framework, without any specialized network design, achieves the state-of-the-art result (0.8631 for Kappa) on the EyePACS test set (a total of 42670 fundus images) with only image-level labels. Our codes and pre-trained model are available at



There are no comments yet.


page 3

page 4

page 9

page 11

page 12

page 20


Safe Augmentation: Learning Task-Specific Transformations from Data

Data augmentation is widely used as a part of the training process appli...

Cost-Sensitive Regularization for Diabetic Retinopathy Grading from Eye Fundus Images

Assessing the degree of disease severity in biomedical images is a task ...

Hybrid Deep Learning Gaussian Process for Diabetic Retinopathy Diagnosis and Uncertainty Quantification

Diabetic Retinopathy (DR) is one of the microvascular complications of D...

Distributional Shifts in Automated Diabetic Retinopathy Screening

Deep learning-based models are developed to automatically detect if a re...

FEDI: Few-shot learning based on Earth Mover's Distance algorithm combined with deep residual network to identify diabetic retinopathy

Diabetic retinopathy(DR) is the main cause of blindness in diabetic pati...

DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data

Despite over two decades of progress, imbalanced data is still considere...

MAGNeto: An Efficient Deep Learning Method for the Extractive Tags Summarization Problem

In this work, we study a new image annotation task named Extractive Tags...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Diabetic retinopathy (DR) is one of the microvascular complications of diabetes, causing vision impairment and blindness (Li et al., 2021; Alyoubi et al., 2020). The major pathological signs of DR include hemorrhages, exudates, microaneurysms, and retinal neovascularization. The digital color fundus image is the most widely used imaging modality for ophthalmologists to screen and identify the severity of DR, which can reveal the presence of different lesions. An early diagnosis and timely intervention of DR is of vital importance in preventing patients from vision malfunction. However, due to the rapid increase in the number of patients at risk of developing DR, ophthalmologists in regions with limited medical resources bear a heavy labor-intensive burden in DR screening. As such, developing automated and efficient DR diagnosis and prognosis approaches is urgently needed to reduce the number of untreated patients and the burden of ophthalmic experts.

Based on the type and quantity of lesions in fundus images, DR can be classified into five grades: 0 (normal), 1 (mild DR), 2 (moderate DR), 3 (severe DR), and 4 (proliferative DR)

(Lin et al., 2020). Red dot-shaped microaneurysms are the first visible sign of DR, and their presence indicates a mild grade of DR. Red lesions (e.g., hemorrhages) and yellow-white lesions (e.g., hard exudates and soft exudates) have various types of shapes, from tiny points to large patches. A larger amount of such lesions indicate severer DR grading. Neovascularization, the formation of new retinal vessels in the optic disc or its periphery, is a significant sign of proliferative DR. Fig. 1 shows examples of fundus images with different types of lesions.

In recent years, deep learning based methods have achieved great success in the field of computer vision. With the capability of highly representative feature extraction, convolutional neural networks (CNNs) have been proposed to tackle different tasks. They have also been widely used in the medical image analysis realm

(Quellec et al., 2017; Lyu et al., 2019; Araújo et al., 2020; Guo and Yuan, 2020; Huang et al., 2020; Kervadec et al., 2021). In DR grading, Pratt et al. (2016) adopts a pre-trained CNN as a feature extractor and re-trains the last fully connected layer for DR detection. Given that lesions are important in DR grading, Attention Fusion Network (Lin et al., 2018)

employs a lession detector to predict the probabilities of lesions and proposes an information fusion method based on an attention mechanism to identify DR. Zoom-in-net

(Wang et al., 2017) consists of three sub-networks which respectively localize suspicious regions, analyze lesion patches and classify the image of interest. To enhance the capability of a standard CNN, CABNet (He et al., 2020) introduces two extra modules, one for exploring region-wise features for each DR grade and one for generating attention feature maps.

It can be observed that recent progress in automatic DR grading is largely attributed to carefully designed model architecture. Nevertheless, the task-specific designs and specialized configurations may limit their transferability and extensibility. Other than model architecture, the training setting is also a key factor affecting the performance of a deep learning method. A variety of interdependent components are typically involved in a training setting, including the design of configurations (e.g., preprocessing, loss function, sampling strategy, and data augmentation) and empirical decisions of hyper-parameters (e.g., input resolution, learning rate, and training epochs). Proper training settings can benefit automatic DR grading, while improper ones may damage the grading performance. However, the importance of the training setting has been overlooked or received less attention in the past few years, especially in the DR grading field. In computer vision, there have been growing efforts in improving the performance of deep learning methods by refining the training setting rather than the network architecture. For example,

He et al. (2019) boosts ResNet-50’s (He et al., 2016)

top-1 validation accuracy from 75.3% to 79.29% on ImageNet

(Deng et al., 2009) by applying numerous training procedure refinements. Bochkovskiy et al. (2020)

examines combinations of training configurations such as batch-normalization and residual-connection, and utilizes them to improve the performance of object detection. In the biomedical domain, efforts in this direction have also emerged. For example,

Isensee et al. (2021) proposes an efficient deep learning-based segmentation framework for biomedical images, namely nnU-Net, which can automatically and optimally configure its own setting including preprocessing, training and post-processing. In such context, we believe that refining the training setting has a great potential in enhancing the DR grading performance.

Figure 1: A normal fundus image (left) and a representative DR fundus image with lesions (right).

In this work, we systematically analyze the influence of several major components of a standard DR classification framework and identify the key elements in the training setting for improving the DR grading performance. The components analyzed in our work are shown in Fig. 2. The main contributions of this work can be summarized as follows:

  • We examine a collection of designs with respect to the training setting and evaluate them on the most challenging and largest publicly-available fundus image dataset, EyePACS 111 We further analyze and illustrate the impact of each component on the DR grading performance to identify the core ones.

  • Based on our observations, we adopt ResNet-50 (He et al., 2016) as the backbone and achieve a quadratically-weighted Kappa of 0.8631 on the EyePACS test set, which outperforms many specifically-designed state-of-the-art methods with only image-level labels. With the plain ResNet-50, our framework can serve as a strong, standardized, and scalable DR grading baseline. That is, most methodological improvements and modifications can be easily incorporated into our framework to further improve the DR grading performance.

  • We emphasize that the superior performance of our framework is not achieved by a new network architecture, a new objective function nor a new scheme. The key contribution of this work, in a more generalizable sense, is that we outline another direction to improve the performance of deep learning methods for DR grading and highlight the importance of training setting refinements in developing deep learning based pipelines. This may also shed new insights into other related fields.

The remainder of this paper is organized as follows. Section 2 describes the details of our baseline framework, the default training setting, and the evaluation protocol. Descriptions of the investigated components in the training setting are presented in section 3. Extensive experiments are conducted in section 4 to evaluate the performance and influence of each refinement. Discussion and conclusion are respectively provided in section 5 and section 6.

Figure 2: Components analyzed in our deep learning-based DR grading framework. The evaluation process of a framework can be divided into two parts: training (top) and testing (bottom). In the training phase, we first fix the architecture of the selected network (ResNet-50). Then we examine a collection of designs with respect to the training setting including preprocessing (image resizing and enhancement), training strategies (compositions of data augmentation (DA) and sampling strategies) and optimization configurations (objective functions and learning rate (LR) schedules). In the testing phase, we apply the same preprocessing as in the training phase and employ paired feature fusion to make use of the correlation between the two eyes (the training step of the fusion network is omitted in this figure). Then, we select the best ensemble method for the final prediction.
Figure 3: The imbalanced data distribution of EyePACS.

2 Method

2.1 Dataset description

The EyePACS dataset is the largest publicly-available DR grading dataset released in the Kaggle DR grading competition, consisting of 88702 color fundus images from the left and right eyes of 44351 patients. Images were officially split into 35126/10906/42670 fundus images for training/validation/testing. According to the severity of DR, they have also been divided by ophthalmologists into the aforementioned five grades. The fundus images were acquired under a variety of conditions and from different imaging devices, resulting in variations in image resolution, aspect ratio, intensity, and quality. As shown in Fig. 3, the class distribution of EyePACS is extremely imbalanced, wherein DR fundus images are dramatically less than normal images.

2.2 Baseline setting

We first specify our baseline for DR grading. In the preprocessing step, for each image, we first identify the smallest rectangle that contains the entire field of view and use the identified rectangle for cropping. After that, we resize each cropped image into

squares and rescale each pixel intensity value into [0, 1]. Next, we normalize the RGB channels using z-score transformations with the mean and the standard deviations obtained from the entire preprocessed training set. Common random data augmentation operations including horizontal flipping, vertical flipping, and rotation described in section 

3.4 are performed during training.

A widely used architecture ResNet-50 is employed in this work. We adopt the SGD optimizer with an initial learning rate of 0.001 and Nesterov Accelerated Gradient Descent (Nesterov, 1983)

with a momentum factor of 0.9 to train the network. A weighted decay of 0.0005 is applied for regularization. Convolutional layers are initialized with parameters obtained from a ResNet-50 pre-trained on the ImageNet dataset

(Deng et al., 2009) and the fully connected layer is initialized using He’s initialization method (He et al., 2015). We train the model for 25 epochs with a mini-batch size of 16 on a single NVIDIA RTX TITAN. All codes are implemented in PyTorch (Paszke et al., 2017). If not specified, all models are trained with a fixed random seed for fair comparisons. The model having the highest metric on the validation set is selected for testing.

2.3 Evaluation metric

The DR grading performance is evaluated using the quadratically-weighted Kappa (Cohen, 1968)

, which is an officially-used metric in the Kaggle DR grading competition. In an ordinal multi-class task, given an observed confusion matrix

and an expected matrix , measures their agreement by quadratically penalizing the distance between the prediction and the ground truth,


where denotes the total number of classes, is a quadratic weight matrix, and subscripts and respectively denote the row and column indices of the matrix. The weight is defined as . ranges from to , with -1 and 1 respectively indicate total disagreement and complete agreement.

3 Training setting components

3.1 Input resolution

The resolution of the input image has a direct impact on the DR grading performance. Generally, ResNet-50 is designed for images of input resolution (He et al., 2016). In ResNet-50, a convolution layer with a kernel size of

and a stride of

followed by a max-pooling layer is applied to dramatically downsample the input image first. Therefore, using images with very small input resolution may lose key features for DR grading, such as tiny lesions. In contrast, a network fed with large resolution images can extract more fine-grained and dense features at the cost of a smaller receptive field and a higher computational cost. In this work, a range of resolutions is evaluated to identify the trade-off.

3.2 Loss function

The objective function plays a critical role in deep learning. Let denote the training set, where is the input image and

is the corresponding ground truth label. There are a variety of objective functions that can be used to measure the discrepancy between the predicted probability distribution

and the ground truth distribution

(one-hot encoded

) of the given label.

3.2.1 Cross-entropy loss

The cross-entropy loss is the most commonly used loss function for classification tasks, which is the negative log-likelihood of a Bernoulli or categorical distribution,


3.2.2 Focal loss

The focal loss was initially proposed in RetinaNet (Lin et al., 2017), which introduces a modulating factor into cross-entropy to down-weigh the loss of well-classified samples, giving more attention to challenging and misclassified ones. The focal loss is widely used to address the class imbalance problem in training deep neural networks. As mentioned before, EyePACS is an extremely imbalanced dataset with the number of images per class ranges from 25810 to 708. Therefore, the focal loss is applied for better feature learning with samples from the minority classes. The focal loss is defined as


where is a hyper-parameter. When the predicted probability is small, the modulating factor is close to 1. When is large, this factor goes to 0 to down-weigh the corresponding loss.

3.2.3 Kappa loss

The quadratically-weighted Kappa is sensitive to disagreements in marginal distributions, whereas cross-entropy loss does not take into account the distribution of the predictions and the magnitude of the incorrect predictions. Therefore, the soft Kappa loss (Fauw, 2015) based on the Kappa metric is another common choice for training the DR grading model,


where is the number of classes, is the predicted probability of the -th class of and is an indicator function equaling to 1 if and otherwise 0. As suggested by a previous work (Fauw, 2015), combining the Kappa loss with the standard cross-entropy loss can stabilize the gradient at the beginning of training to achieve better prediction performance.

3.2.4 Regression loss

In addition to Kappa loss, the regression loss also provides a penalty to the distance between prediction and ground truth. When a regression loss is applied, the softmax layer of the fully connected layer is removed and the output dimension is set to be 1 to produce a prediction score

for the DR grade. Three regression loss functions are considered in this work, namely L1 loss (Mean Absolute Error, MAE), L2 loss (Mean Square Error, MSE), and smooth L1 loss (SmoothL1), which are respectively defined as


In the testing phase, the prediction scores are clipped to be between [0, 4] and then simply rounded to integers to serve as the finally predicted grades.

3.3 Learning rate schedule

The learning rate is important in gradient descent methods, which has non-trivial impact on the convergence of the objective function. However, the optimal learning rate may vary at different training phases. Therefore, a learning rate schedule is widely used to adjust the learning rate during training. Multiple-step decaying, exponential decaying, and cosine decaying (Loshchilov and Hutter, 2016) are popular learning rate adjustment strategies in deep learning. Specifically, the multiple-step decaying schedule decreases the learning rate by a constant factor at specific training epochs. The exponential decaying schedule exponentially decreases the learning rate by at every epoch, namely


where is the learning rate at epoch . A typical choice of is . The cosine decaying schedule decreases the learning rate following the cosine function. Given a total number of training epochs , the learning rate in the cosine decaying schedule is defined as


Due to the observation that too small learning rates may lead to overfiting of the model in the last few epochs, we set a minimum learning rate for the cosine decaying schedule, becoming clipped cosine decaying


The setting of the cosine and clipped cosine decaying schedules is independent of the number of epochs, making them more flexible than other schedules.

3.4 Composition of data augmentation

Figure 4: Illustration of common data augmentation operations.

Applying online data augmentation during training can increase the distribution variability of the input images to improve the generalization capacity and robustness of a model of interest. To systematically study the impact of the composition of data augmentation on DR grading, as shown in Fig. 4, various popular augmentation operations are considered in this work. For geometric transformations, we apply horizontal and vertical flipping, random rotation, and random cropping. For color transformations, color distortion is a common choice, including adjustments of brightness, contrast, saturation, and hue. Moreover, Krizhevsky color augmentation (Krizhevsky et al., 2012) is evaluated in our experiments, which has been suggested to be effective by the group that ranked the third place in the Kaggle DR grading competition (Antony, 2015).

For the cropping operation, we randomly crop a rectangular region the size of which is randomly sampled in [1/1.15, 1.15] times the original one and the aspect ratio is randomly sampled in [0.7, 1.3], and then we resize this region back to be of the original size. Horizontal and vertical flipping is applied with a probability of 0.5. The color distortion operation adjusts the brightness, contrast, and saturation of the images with a random factor in [-0.2, 0.2] and the hue with a random factor in [-0.1, 0.1]. The rotation operation randomly rotates each image of interest by an arbitrary angle.

3.5 Preprocessing

In addition to background removal, two popular preprocessing operations for fundus images are considered in this work, namely Graham processing (Graham, 2015) and contrast limited adaptive histogram equalization (CLAHE) (Huang et al., 2012). Both of them can alleviate the blur, low contrast, and inhomogeneous illumination issues that exist in the EyePACS dataset.

The Graham method was proposed by B. Graham the winner of the Kaggle DR grading competition. This preprocessing method has also been used in many previous works (Quellec et al., 2017; Yang et al., 2017) to remove image variations due to different lighting conditions or imaging devices. Given a fundus image I, the processed image after Graham is obtained by


where is a 2D Gaussian filter with a standard deviation , is the convolution operator, and are weighting factors. Following Yang et al. (2017), , , , and are respectively set as 10, 4, -4, and 128. As shown in Fig. 5, all images are normalized to be relatively consistent with each other and vessels as well as lesions are particularly highlighted after Graham processing.

CLAHE is a contrast enhancement method based on Histogram Equalization (HE) (Huang et al., 2006), which has also been widely used to process fundus images and has been suggested to be able to highlight lesions (Huang et al., 2020; Sahu et al., 2019; Datta et al., 2013). HE improves the image contrast by spreading out the most frequently-occurred intensity values in the histogram, but it amplifies noise as well. CLAHE was proposed to prevent an over-amplification of noise by clipping the histogram at a predefined value. Representative enhanced images via CLAHE are also illustrated in Fig. 5.

3.6 Sampling strategy

As mentioned in section 2.1, EyePACS is an extremely imbalanced dataset. To address this problem, several sampling strategies (Kang et al., 2019; Antony, 2015)

for the training set have been proposed to rebalance the data distribution. Three commonly used sampling strategies are examined in this work: (1) instance-balanced sampling samples each data point with an equal probability. In this case, the class with more samples than the others can be dominant in the training phase, leading to model bias during testing; (2) class-balanced sampling first selects each class with an equal probability, and then uniformly samples data points from specific classes. In this way, samples in the minority classes are given more attention for better representation learning; (3) progressively-balanced sampling starts with class-balanced sampling and then exponentially moves to instance-balanced sampling. Please note that we follow the interpolation strategy adopted by

Antony (2015) instead of the one presented by Kang et al. (2019), which linearly interpolates the sampling weight from instance-balanced sampling to class-balanced sampling. Specifically, the sampling weight in this work is defined as


where and are sampling weights in progressively-balanced, class-balanced and instance-balanced sampling, indexes the training epoch and is a hyper-parameter that controls the change rate.

Figure 5: Representative enhanced fundus images using Graham processing and CLAHE.

3.7 Prior knowledge

Figure 6: Representative eye pairs with different quality of the left and right fields.

For medical image analysis, prior knowledge can significantly enhance the performance of deep learning frameworks. In the EyePACS dataset, both the left and right eyes of a patient are provided. Evidence shows that for more than 95% the difference in the DR grade between the left and right eyes is no more than 1 (Wang et al., 2017). Moreover, as demonstrated in Fig. 6

, the quality of the left and right fields of an eye pair may be different. And it is difficult to identify the grade of a fundus image with poor quality. In this case, information of the eye on the other side may greatly benefit the estimation of the grade of the poor one.

As such, to utilize the correlation between the two eyes, we concatenate the feature vectors of both eyes from the global average pooling layer of ResNet-50 and then input it into a paired feature fusion network. The network consists of 3 linear layers each followed by a 1D max-pooling layer with a stride of 2 and rectified linear unit (ReLU). Considering that the grading criterion for left and right eyes is the same, the feature fusion network only outputs the prediction for one eye and then changes the order of the two feature vectors during concatenation for the prediction of the other eye.

3.8 Ensembling

Ensemble methods (Opitz and Maclin, 1999)

are widely used in data science competitions to achieve better performance. The variance in the predictions and the generalization errors can be considerably reduced by combining predictions from multiple models or inputs. However, ensembling too many models can be computationally expensive and the performance gains may diminish with the increasing number of models. To make our proposed pipeline generalizable, two simple ensemble methods are considered: 1) for the ensemble method that uses multiple models

(Krizhevsky et al., 2012; Caruana et al., 2004), we average the predictions from models trained with different random seeds. In this way, the datasets have different sampling orders and different data augmentation parameters to train each model, resulting in differently trained models for ensembling, 2) for the ensemble method that uses multiple views (Simonyan and Zisserman, 2014; Szegedy et al., 2016), we first generate different image views via random flipping and rotation (test-time augmentation). Then these views including the original one are input into a single model to generate each view’s DR grade score. We then use the averaged score as the finally predicted one.

4 Experimental Results

4.1 Influence of different input resolutions

First, we study the influence of different input resolutions using the default setting specified in section 2.1. The experimental results are shown in Table 1. As suggested by the results, DR grading benefits from larger input resolutions at the cost of higher training and inference computational expenses. A significant performance improvement of 16.42% in the test Kappa is obtained by increasing the resolution from to . Increasing the resolution to further improves the test Kappa by another 1.32% but with a large computational cost increase of 64.84G floating-point operations (FLOPs). Considering the trade-off between performance and computational cost, the input resolution is adopted for all our subsequent experiments.

Resolution Training time FLOPs Validation Kappa Test Kappa
1h 54m 1.35G 0.6535 0.6388
2h 19m 5.40G 0.7563 0.7435
5h 16m 21.61G 0.8054 0.8032
11h 15m 48.63G 0.8176 0.8137
11h 46m (2 GPUs) 86.45G 0.8187 0.8164
Table 1: DR grading performance with different input resolutions. Two GPUs are used to train the model with input resolution due to the CUDA memory limitation.

4.2 Influence of different objective functions

We further evaluate the seven objective functions described in section 3.2. We also evaluate the objective function by combining the Kappa loss and the cross-entropy loss (Fauw, 2015). All objective functions are observed to converge after 25 epochs of training. The validation and test Kappa for applying different loss functions are reported in Table 2. The results demonstrate the focal loss and the combination of the Kappa loss and the cross-entropy loss slightly improve the performance compared to the standard cross-entropy loss. The observation that using the Kappa loss alone makes the training unstable and results in inferior performance is consistent with that reported in Fauw (2015)

. The MSE loss takes into account the distance between the prediction and the ground truth, yielding a 2.02% improvement compared to the cross-entropy loss. It gives more penalties for outliers than the MAE loss and the smooth L1 loss, making itself have the highest validation and test Kappa among all the objective functions we consider.

To demonstrate the influence of different objective functions on the distribution of predictions, we present the confusion matrics of the test set for the cross-entropy loss and the MSE loss in Fig. 7. Considering the imbalanced distribution of different classes in EyePACS, we normalize the matrics by dividing each value by the sum of its corresponding row. As shown in Fig. 7, although employing the MSE loss does not improve the performance of correctly discriminating each category, the prediction-versus-ground truth distance from using MSE is smaller than that from using cross-entropy (e.g. 7.9% of proliferative DR images (Grade 4) are predicted to be normal when using the cross-entropy loss, while only 1.0% when using the MSE loss). That is, the predictions from the model using the MSE loss as the objective function show more diagonal tendency compared to those using the cross-entropy loss, which contributes to the improvement in the Kappa metric. This diagonal tendency is important for DR grading in clinical practice because even if the diagnosis is wrong we expect our prediction to be at least close to the correct one.

Figure 7: Confusion matrices from models respectively using the cross-entropy loss and the MSE loss as the objective function. All values in the confusion matrices are normalized.
Loss Validation Kappa Test Kappa
Cross Entropy (CE) 0.8054 0.8032
Focal (=2) 0.8079 0.8059
Kappa 0.7818 0.7775
Kappa + CE 0.8047 0.8050
MAE 0.7655 0.7679
Smooth L1 0.8094 0.8117
MSE 0.8207 0.8235
Table 2: DR grading performance of models using different objective functions. is empirically set to be 2 for the focal loss.

4.3 Influence of different learning rate schedules

Further on we study the influence of different learning rate schedules. During some experiments, we observe that, with the cosine decaying schedule, the training becomes ineffective and makes the model prone to overfitting when the learning rate decays below . Therefore, clipped cosine decaying with is also evaluated. All experiments are conducted using the baseline setting with the input resolution and the MSE loss. The experimental results are shown in Table 3. The results demonstrate that except for the exponential decaying schedule, all schedules improve the Kappa on both the validation and test sets and the clipped cosine decaying schedule gives the highest improvement of 0.37% in the test Kappa. A plausible reason for the performance drop caused by the exponential decaying schedule is because the learning rate decreases too fast at the beginning of training. Therefore, the initial learning rate should be carefully tuned when the exponential decaying schedule is employed.

Schedule Validation Kappa Test Kappa
Constant 0.8207 0.8235
Multiple Steps [15, 20] 0.8297 0.8264
Exponential (p=0.9) 0.8214 0.8185
Cosine 0.8269 0.8267
Clipped Cosine (=1e-4) 0.8258 0.8272
Table 3: DR grading performance of models using different learning rate schedules. We set the initial learning rate to be 0.001 in all experiments. For the multiple-step decaying schedule, we decrease the learning rate by 0.1 at epoch 15 and epoch 20. For the exponential decaying schedule, we set the decay factor to be 0.9.

4.4 Influence of different compositions of data augmentation

We evaluate ResNet-50 with different compositions of data augmentation. In addition to flipping and rotation in the baseline setting, we consider random cropping, color jitter, and Krizhevsky color augmentation. We also evaluate the model trained without any data augmentation. All experiments are based on the best setting from previous evaluations. As shown in Table 4, even a simple composition of geometric data augmentation operations (the third row of Table 4) in the baseline setting can provide a significant improvement of 3.49% on the test Kappa. Each data augmentation operation combined with flipping can improve the corresponding model’s performance. However, the composition of all data augmentation operations considered in this work degrades the DR grading performance because too strong transformations may shift the distribution of the training data far away from the original one. Therefore, we do not simultaneously employ the two color transformations. The best test Kappa of 0.8310 is achieved by applying the composition of flipping, rotation, cropping, and color jitter for data augmentation during training. We adopt this composition in our following experiments.

flipping rotation cropping Color jitter Krizhevsky Validation Kappa Test Kappa
0.7913 0.7923
0.8124 0.8125
0.8258 0.8272
0.8194 0.8217
0.8129 0.8167
0.8082 0.8159
0.8276 0.8247
0.8307 0.8310
0.8308 0.8277
0.8247 0.8252
Table 4: DR grading performance of models using different compositions of data augmentation.
Figure 8: The performance of models using different sampling strategies for training. The dotted red line represents the best validation Kappa among these four experiments, which is achieved by instance-balanced sampling.
Preprocessing Validation Kappa Test Kappa
Default 0.8307 0.8310
Default + Graham (Graham, 2015) 0.8262 0.8260
Default + CLAHE (Huang et al., 2012) 0.8243 0.8238
Table 5: Our default preprocessing setting consists of background removal and image resizing. The parameters used in the Graham method are set following Yang et al. (2017). The clipping value and tile grid size of CLAHE are respectively set to be 3 and 8.

4.5 Influence of different preprocessing methods

Two popular image enhancement methods are evaluated in our study, Graham processing and CLAHE. Both of them have been suggested to be beneficial for DR identification (Yang et al., 2017; Sahu et al., 2019). Although lesions become more recognizable with the application of the two preprocessing methods, they are not helpful for DR grading. As shown in Table 5, our framework with the Graham method achieves a 0.8227 test Kappa, which is lower than the default setting by about 0.5%. Applying CLAHE also hurts the performance of our framework, decreasing the test Kappa by about 0.7%. Unexpected noise and artifacts introduced by the preprocessing may be a cause of performance degradation in our experiments. As such, no image enhancement is applied in our following experiments.

4.6 Influence of different sampling strategies

Further, we concern about the influence of different sampling strategies. To alleviate the imbalance issue in EyePACS, class-balanced sampling and progressively-balanced sampling are conducted in the training phase. However, as illustrated in Fig. 8, because we repeatedly sample data points from the minority classes at each epoch, overfitting results in poor performance on the validation set. The gap between the training Kappa and the validation Kappa increases as the probability of sampling the minority classes increases. Instance-balanced sampling, a strategy that we most commonly use, achieves the highest validation Kappa at the end of the training. A plausible reason for this result is that the class distribution of the training set is consistent with that of the validation set as well as those of real-world datasets. The class-based sampling strategies may be more effective in cases where the training set is imbalanced and the test set is balanced (Kang et al., 2019).

4.7 Influence of feature fusion of paired eyes

We evaluate the improvement resulted from utilizing the correlation between the paired two eyes for DR grading. The best model from previous evaluations is fixed and adopted to generate feature vector of each fundus image. A simple paired feature fusion network described in section 3.7 is trained for 20 epochs with a batch size of 64. The learning rate is set to be 0.02 without any decaying schedule. As shown in Table 6, paired feature fusion improves the validation Kappa by 2.90% and the test Kappa by 2.71%, demonstrating the importance of the eye pair correlation to DR grading.

4.8 Influence of different ensemble methods

HR MSE CCD DA PFF ENS Validation Kappa Test Kappa test Kappa
0.7563 0.7435 0%
0.8054 0.8032 +5.97%
0.8207 0.8235 +2.03%
0.8258 0.8272 +0.37%
0.8307 0.8310 +0.38%
0.8597 0.8581 +2.71%
0.8660 0.8631 +0.50%
Table 6: The performance of models for stacking refinements one by one. The first row is the result of the baseline we describe in section 2.1. HR, MSE, CCD, DA, PFF, and ENS respectively denote the application of high resolution, MSE loss, clipped cosine decaying schedule, data augmentation, paired feature fusion, and ensemble of multiple models.
# views / models Multiple views Multiple models
Validation Kappa Test Kappa Validation Kappa Test Kappa
1 0.8597 0.8581 0.8597 0.8581
2 0.8611 0.8593 0.8622 0.8596
3 0.8608 0.8601 0.8635 0.8615
5 0.8607 0.8609 0.8644 0.8617
10 0.8633 0.8603 0.8660 0.8631
15 0.8631 0.8611 0.8653 0.8631
Table 7: The performance of models with different ensemble methods.

We also evaluate the impact of the number of input views for the ensemble method of multiple views and the number of models for the ensemble method of multiple models. The experimental results are tabulated in Table 7. We observe that as the number of models increases, both the test Kappa and the validation Kappa steadily increase. Unsurprisingly, the computational cost also monotonically increases with the number of ensembling. For the ensemble method that uses multiple models, the performance gain from increasing the number of models diminishes in the end and the best test Kappa is achieved by using 10 models.

4.9 Comparison of the importance of all components

Finally, we investigate and compare the importance of all considered components in our DR grading task. We quantify the improvement from each component by applying them one by one, the results of which are shown in Table 6. We observe three significant improvements outstand from that table. First, increasing the input resolution from to gives the highest improvement of 5.97%. Then, the choice of the MSE loss and utilization of the eye pair fusion respectively improve the test Kappa by another 2.03% and 2.71%. Additional improvements of 0.37%, 0.38%, and 0.5% on the test Kappa are obtained by applying clipped cosine decaying schedule, data augmentation, and ensemble (multiple models). Note that, the incremental results alone do not completely reflect the importance of different components. The baseline configuration may also affect the corresponding improvements. In Fig. 9, we present the ranges and standard deviations of all experiments in this work. If the range of a box is large, it indicates that the results of different choices of this component vary significantly. The top bar of the box represents the highest test Kappa that can be achieved by specifically refining the corresponding component. Obviously, a bad choice of either resolution, objective function or data augmentation may lead to a great performance drop. Applying a learning rate schedule and ensembling can both provide steady improvements but using different schedules or ensemble methods does not significantly change the DR grading result.

Figure 9: Box plots of the test Kappa of all experiments in this work. The experiments in each column are set up based on the best model considering all its left components. DA and PFF denote the experiment results of different compositions of data augmentation and applying paired feature fusion or not.

4.10 Comparison with state-of-the-art

Method Backbone Test Kappa
Min-Pooling - 0.8490
o_O - 0.8450
RG - 0.8390
Zoom-in Net (Wang et al., 2017) - 0.8540
CABNet (He et al., 2020) ResNet-50 0.8456
Ours ResNet-50 0.8631
Table 8: Comparisons with state-of-the-art methods with only image-level labels. Symbol ‘-’ indicates the backbone of the method is designed by the corresponding authors. The results listed in the first three rows denote the top-3 entries on Kaggle’s challenge.
Figure 10: Visualization results from GradCAM. Representative eye pairs of four grades (mild DR, moderate DR, severe DR, and proliferate DR) are presented from top to bottom. The intensity of the heatmap indicates the importance of each pixel in the corresponding image for making the prediction.

To assess the performance of our framework that incorporates the optimal set of all components investigated in this work, comparisons between the proposed method and previously-reported state-of-the-art ones without any utilization of additional datasets nor annotations are tabulated in Table 8. Our proposed method, without any fancy technique, outperforms previous state-of-the-art results by 0.91% in terms of the test Kappa.

We then visualize our results using Grad-CAM (Selvaraju et al., 2017). As illustrated in Fig. 10, representative results of four eye pairs corresponding to the four DR grades from 1 to 4 are provided. It reveals that our method’s performance in DR grading may be a result of its ability to recognize different signs of DR, namely lesions. We observe that the region of the heatmap in a severe DR image is usually larger than that in a mild one because the amount of lesions to some degree reflects the DR grade and the lesions are what the network focuses on.

5 Discussion

Recently, deep learning methods have exhibited great performance on the DR grading task, but there is a trend that deep neural networks today become very large and highly sophisticated, making them difficult to be transferred and extended. Inspired by Litjens et al. (2017), who states that ‘the exact architecture is not the most important determinant in getting a good solution’, we present a simple but effective framework without any dazzling design in the network itself. Our proposed framework outperforms several state-of-the-art specifically-designed approaches tested on the EyePACS dataset. The promising performance of our proposed framework comes from the right choices of the input resolution, the objective function, the learning rate schedule, the composition of data augmentation, the utilization of the eye pair and the ensemble of multiple models. We also show that some popular techniques for fundus image-related tasks are not always beneficial for DR grading, such as image enhancement approaches and re-sampling strategies.

In this work, we focus on improving the DR grading performance of ResNet-50 on the EyePACS dataset. All refinements and configurations are determined empirically based on our experimental results. Therefore, our solutions for DR grading may be dependent on the property of the specific dataset of interest and the specific network of interest. In other words, our empirically-selected parameters may not work well on other neural network architectures or datasets. Specifically, the learning rate and its schedule need to be adjusted to identify the optimal solutions for frameworks using other types of neural networks as the backbone. The data augmentation composition may also need to be modified and the paired feature fusion strategy may be not always applicable for other DR grading datasets. Nevertheless, our framework and the empirically-selected parameters can be a good starting point for the trial-and-error process during method design.

Our framework still has considerable room for improvement. First, in addition to the components we analyzed, there are other major components in deep learning based frameworks that are also worthy of being systematically investigated and refined, such as regularization techniques and optimization methods. Second, deeper CNNs have the potential to achieve better performance on DR grading. As such, more advanced CNN architectures such as ResNeXt (Xie et al., 2017) and DenseNet (Huang et al., 2017) will be evaluated in our future work. Third, another future work is to improve the generalization capability of our framework on other DR grading datasets such as Messidor (Decencière et al., 2014) and DDR (Li et al., 2019). More robust configurations can be identified through experiments across different DR grading datasets.

6 Conclusion

In this work, we systematically investigate several important components in deep convolutional neural networks for improving the performance of ResNet-50 based DR grading. Specifically, the input resolution, objective function, learning rate schedule, data augmentation, preprocessing, data sampling strategy, prior knowledge, and ensemble method are looked into in our study. Extensive experiments on the publicly-available EyePACS dataset are conducted to evaluate the influence of different selections for each component. Finally, based on our findings, a simple yet effective framework for DR grading is proposed. The experimental results yielded from this study can be summarized as below.

  • We raise the ResNet-50 Kappa metric from 0.7435 to 0.8631 on the EyePACS dataset, outperforming other specially-designed DR grading methods.

  • Achieving state-of-the-art performance without any network architecture modification, we emphasize the importance of training setting refining in the development of deep learning based frameworks.

  • Our codes and pre-trained model are publicly accessible at We believe our simple yet effective framework can serve as a strong, standardized, and scalable baseline for further studies and developments of DR grading algorithms.


The authors would like to thank Meng Li from Zhongshan Ophthalmic Centre of Sun Yat-sen University as well as Yue Zhang from the University of Hong Kong for their help on this work. This study was supported by the National Natural Science Foundation of China (62071210), the Shenzhen Basic Research Program (JCYJ20190809120205578), the National Key R&D Program of China (2017YFC0112404), and the High-level University Fund (G02236002).


  • W. L. Alyoubi, W. M. Shalash, and M. F. Abulkhair (2020) Diabetic retinopathy detection through deep learning techniques: a review. Informatics in Medicine Unlocked, pp. 100377. Cited by: §1.
  • M. Antony (2015) External Links: Link Cited by: §3.4, §3.6.
  • T. Araújo, G. Aresta, L. Mendonça, S. Penas, C. Maia, Â. Carneiro, A. M. Mendonça, and A. Campilho (2020) DR— graduate: uncertainty-aware deep learning-based diabetic retinopathy grading in eye fundus images. Medical Image Analysis 63, pp. 101715. Cited by: §1.
  • A. Bochkovskiy, C. Wang, and H. M. Liao (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. Cited by: §1.
  • R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes (2004) Ensemble selection from libraries of models. In

    Proceedings of the twenty-first international conference on Machine learning

    pp. 18. Cited by: §3.8.
  • J. Cohen (1968) Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit.. Psychological bulletin 70 (4), pp. 213. Cited by: §2.3.
  • N. S. Datta, H. S. Dutta, M. De, and S. Mondal (2013) An effective approach: image quality enhancement for microaneurysms detection of non-dilated retinal fundus image. Procedia Technology 10, pp. 731–737. Cited by: §3.5.
  • E. Decencière, X. Zhang, G. Cazuguel, B. Lay, B. Cochener, C. Trone, P. Gain, R. Ordonez, P. Massin, A. Erginay, et al. (2014) Feedback on a publicly distributed image database: the messidor database. Image Analysis & Stereology 33 (3), pp. 231–234. Cited by: §5.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §1, §2.2.
  • J. D. Fauw (2015) External Links: Link Cited by: §3.2.3, §3.2.3, §4.2.
  • B. Graham (2015) Kaggle diabetic retinopathy detection competition report. University of Warwick. Cited by: §3.5, Table 5.
  • X. Guo and Y. Yuan (2020) Semi-supervised wce image classification with adaptive aggregated attention. Medical Image Analysis, pp. 101733. Cited by: §1.
  • A. He, T. Li, N. Li, K. Wang, and H. Fu (2020) CABNet: category attention block for imbalanced diabetic retinopathy grading. IEEE Transactions on Medical Imaging. Cited by: §1, Table 8.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: item 2, §1, §3.1.
  • T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li (2019) Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 558–567. Cited by: §1.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §5.
  • K. Huang, Q. Wang, and Z. Wu (2006) Natural color image enhancement and evaluation algorithm based on human visual system. Computer Vision and Image Understanding 103 (1), pp. 52–63. Cited by: §3.5.
  • S. Huang, F. Cheng, and Y. Chiu (2012) Efficient contrast enhancement using adaptive gamma correction with weighting distribution. IEEE transactions on image processing 22 (3), pp. 1032–1041. Cited by: §3.5, Table 5.
  • Y. Huang, L. Lin, M. Li, J. Wu, P. Cheng, K. Wang, J. Yuan, and X. Tang (2020) Automated hemorrhage detection from coarsely annotated fundus images in diabetic retinopathy. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1369–1372. Cited by: §1, §3.5.
  • F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein (2021) NnU-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18 (2), pp. 203–211. Cited by: §1.
  • B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis (2019) Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217. Cited by: §3.6, §4.6.
  • H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, and I. B. Ayed (2021) Boundary loss for highly unbalanced segmentation. Medical Image Analysis 67, pp. 101851. Cited by: §1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §3.4, §3.8.
  • T. Li, W. Bo, C. Hu, H. Kang, H. Liu, K. Wang, and H. Fu (2021) Applications of deep learning in fundus images: a review. Medical Image Analysis, pp. 101971. Cited by: §1.
  • T. Li, Y. Gao, K. Wang, S. Guo, H. Liu, and H. Kang (2019) Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. Information Sciences 501, pp. 511–522. Cited by: §5.
  • L. Lin, M. Li, Y. Huang, P. Cheng, H. Xia, K. Wang, J. Yuan, and X. Tang (2020) The sustech-sysu dataset for automated exudate detection and diabetic retinopathy grading. Scientific Data 7 (1), pp. 1–10. Cited by: §1.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.2.2.
  • Z. Lin, R. Guo, Y. Wang, B. Wu, T. Chen, W. Wang, D. Z. Chen, and J. Wu (2018) A framework for identifying diabetic retinopathy based on anti-noise detection and attention-based fusion. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 74–82. Cited by: §1.
  • G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez (2017) A survey on deep learning in medical image analysis. Medical image analysis 42, pp. 60–88. Cited by: §5.
  • I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    arXiv preprint arXiv:1608.03983. Cited by: §3.3.
  • J. Lyu, P. Cheng, and X. Tang (2019) Fundus image based retinal vessel segmentation utilizing a fast and accurate fully convolutional network. In International Workshop on Ophthalmic Medical Image Analysis, pp. 112–120. Cited by: §1.
  • Y. E. Nesterov (1983) A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, Vol. 269, pp. 543–547. Cited by: §2.2.
  • D. Opitz and R. Maclin (1999) Popular ensemble methods: an empirical study.

    Journal of artificial intelligence research

    11, pp. 169–198.
    Cited by: §3.8.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §2.2.
  • H. Pratt, F. Coenen, D. M. Broadbent, S. P. Harding, and Y. Zheng (2016) Convolutional neural networks for diabetic retinopathy. Procedia computer science 90, pp. 200–205. Cited by: §1.
  • G. Quellec, K. Charrière, Y. Boudi, B. Cochener, and M. Lamard (2017) Deep image mining for diabetic retinopathy screening. Medical image analysis 39, pp. 178–193. Cited by: §1, §3.5.
  • S. Sahu, A. K. Singh, S. Ghrera, M. Elhoseny, et al. (2019) An approach for de-noising and contrast enhancement of retinal fundus image using clahe. Optics & Laser Technology 110, pp. 87–98. Cited by: §3.5, §4.5.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §4.10.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.8.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §3.8.
  • Z. Wang, Y. Yin, J. Shi, W. Fang, H. Li, and X. Wang (2017) Zoom-in-net: deep mining lesions for diabetic retinopathy detection. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 267–275. Cited by: §1, §3.7, Table 8.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §5.
  • Y. Yang, T. Li, W. Li, H. Wu, W. Fan, and W. Zhang (2017) Lesion detection and grading of diabetic retinopathy via two-stages deep convolutional neural networks. In International conference on medical image computing and computer-assisted intervention, pp. 533–540. Cited by: §3.5, §4.5, Table 5.