Diabetic patients are at constant risk of developing Diabetic Retinopathy (DR) that may eventually lead to permanent vision loss if left unnoticed or untreated. In such patients, increased blood sugar, blood pressure, and cholesterol can cause small blood vessels in retina to protrude and, in due course, haemorrhage blood into retinal layers and/or vitreous humour . In severe conditions, scar tissues and newly proliferated fragile blood vessels blanket the retina and obstruct incoming light from falling on it. As a result, retina is unable to translate light into neural signals which results in blindness. Diabetic retinopathy advances slowly and gradually and may take years to reach proliferative stage, however, almost every diabetic patient is potentially susceptible to this complication.
Timely diagnosis is the key to appropriate prognosis. Ophthalmologists usually detect DR by examining retinal fundus and looking for any signs of microaneurysms (bulging of blood vessels), blood leakage, and/or neovascularization . While the indications of advanced stages of DR are rather prominent, these symptoms remain largely discrete in early stages. Figure ‣ 1 shows progress of DR from healthy to proliferative stage in Retinal Fundus Images (RFIs) taken from EyePACS dataset111https://www.kaggle.com/c/diabetic-retinopathy-detection/data. It can be observed from the figure that the difference between healthy and early stages of DR are very subtle and not readily discernible. Manual analysis of these images requires highly qualified and specialized ophthalmologists who may not be easily accessible in developing countries or remote areas of developed countries. Even when medical experts are available, large scale analysis of RFIs is highly time-consuming, labour-intensive and prone to human error and bias. Furthermore, manual diagnosis by clinicians is largely subjective and rarely reproducible and, therefore, inter-expert agreement for a certain diagnosis is generally very poor.
Computer-Aided Diagnosis (CAD) based on deep learning can provide easily accessible, efficient and economical solution for large-scale initial screening of many diseases including diabetic retinopathy. CAD can perform objective analysis of the given image and predict standardized and reproducible diagnosis, which is free from any bias or tiredness. It can not only help physicians by reducing their workload but can also outreach to underprivileged population and afford them the opportunity of swift and cost-effective initial screening, which may effectively prevent advancement of disease into severer stage.
Convolutional Neural Networks (CNNs) are computer algorithms inspired by biological visual cortex. They work especially well in visual recognition tasks. CNNs have been used to perform at par with or even outperform humans in various challenging image recognition problems [14, 13]. Today automated image recognition can be divided into coarse-grained classification and fine-grained classification. In former case, images are classified into high-level categories like humans, animals, vehicles and other objects in a natural scene, for example. In later case, classification is focused on low-level categories like species of dogs or models of cars etc. Fine-grained classification is particularly challenging owning to high intra-class variations and low inter-class variations. Although DR is also a fine-grained classification task, it has normally been addressed using simple coarse-grained classification algorithms.
In this work we used a combination of general and fine-grained deep CNNs to analyze RFIs and predict automated diagnosis for DR. We used two of the most popular conventional image classification architectures i.e Residual Networks  and Densely Connected Networks , a network search framework called NASNet  and two recently proposed methods for fine-grained classification namely NTS-Net  and SBS Layer . We tried to harvest the combined potential of these two approaches by training them separately and taking their ensemble during inference. We used EyePACS and Messidor datasets for evaluation. Since previous researches have used vastly disparate experimental setups, we cannot directly compare our results with most of them. However, we performed a broad range of experiments, following the most common problem settings in the literature like normal vs abnormal, referable vs non-referable, ternary and quaternary classification in order to define benchmarks which will afford future works with an opportunity of fair comparison.
1.1 Related Work
Over the past decade, machine learning and deep learning have been used to detect various pathologies, segment vessels and classify DR grades using RFIs. Welikala et al. detected proliferative DR by identifying neovascularization. They used an ensemble of two networks trained separately on 100 different patches for each network. The patches are taken from a selected set of 60 images collected from Messidor  and a private dataset. Since the dataset had only 60 images they performed leave-one-out cross validation and achieved 0.9505 Area Under the Curve (AUC) and sensitivity of 1 with specificity of 0.95 at the optimal operating point. Wang et al.  identified suspicious regions in RFIs and classified DR into normal (nDR) vs abnormal (aDR) and referable (rDR) vs non-referable (nrDR). They developed a CNN based model called Zoom-in-Network to identify important regions. To classify an image the network uses the overview of the whole images and pays particular attention to important regions. They took 182 images from EyePACS dataset and let a trained ophthalmologist draw bounding boxes around 306 lesions. On Messidor dataset they achieved 0.921 AUC, 0.905 accuracy and 0.960 sensitivity at 0.50 specificity for nDR vs aDR.
Gulshan et al.  conducted a comprehensive study to distinguish rDR from nrDR grades. They trained a deep CNN on 128175 fundus images from a private dataset and tested on 9963 images from EyePACS-I and 1748 images of Messidor-2. They achieved AUC of 0.991 on EyePACS-I and 0.990 on Messidor-2. Guan et al.  proposed that modeling each classifier after individual human grader instead of training a single classifier using average grading of all human experts improves classification performance. They trained 31 classifiers using a dataset of total 126522 images collected from EyePACS and three other clinics. The method is tested on 3547 images from EyePACS-I and Messidor-2, and achieved 0.9728 AUC, 0.9025 accuracy, and 0.8181 specificity at 0.97 sensitivity. However, it would have been more interesting if they had provided comparison of their suggested approach with ensemble of 31 networks modeled after average grading. Costa et al.  used adversarial learning to synthesize color retinal images. However, the performance of their classifier trained on synthetic images was less than the classifier trained on real images. Aujih et al.  found that blood vessels play important role in disease classification and fundus images without blood vessels resulted in poor performance by the classifier.
The role of multiple filter sizes in learning fine-grained features was studied by Vo et al. 
. To this end they used VGG network with extra kernels and combined kernels with multiple loss networks. They achieved 0.891 AUC for rDR vs nrDR and 0.870 AUC for normal vs abnormal on Messidor dataset using 10-fold cross validation. Somkuwar et al. performed classification of hard exudates by exploiting intensity features using 90 images from Messidor dataset and achieved 100% accuracy on normal and 90% accuracy on abnormal images. Seoud et al.  focused on red lesions in RFIs, like haemorrhages and microaneurysms, and detected these biomarkers using dynamic shape features in order to classify DR. They achieved 0.899 AUC and 0.916 AUC for nDR vs aDR and rDR vs nrDR, respectively on Messidor. Rakhlin et al.  used around 82000 images taken from EyePACS for training and around 7000 EyePACS images and 1748 images from Messidor-2 for testing their deep learning based classifier. They achieved 0.967 AUC on Messidor and 0.923 AUC on EyePACS for binary classification. Ramachandran et al.  used 485 private images and 1200 Messidor images to test a third party deep learning based classification platform, which was trained on more than 100000 images. Their validation gave them 0.980 AUC on Messidor dataset for rDR vs nrDR classification. Quellec et al.  capitalized a huge private dataset of around 110000 images and around 89000 EyePACS images to train and test a classifier for rDR vs nrDR grades and achieved 0.995 AUC on EyePACS.
2 Materials and Methods
This section provides details on the datasets used in this work and the ensemble methodology employed to perform classification.
We used EyePACS dataset published publicly by Kaggle for a competition on Diabetic Retinopathy Detection. Table 1 gives overview of EyePACS dataset. Although this dataset is very large in size, only about 75% of its images are of reasonable quality that they can be graded by human experts . EyePACS is graded on a scale of 0 to 4 in accordance with International Clinical Diabetic Retinopathy (ICDR) guidelines . However, low gradability of this dataset raises suspicions on the fidelity of labels provided with each image. We pruned the train set to get rid of 657 completely uninterpretable images. For testing on EyePACS we used 33423 images randomly taken from test set.
|Severity Grade||Criterion||Train Set||Test Set|
|2||More than just microaneurysms but less than Grade 3||5292||15.07||7861||14.67|
Messidor dataset consists of 1200 images collected at three different clinics in France. Each clinic contributed 400 images. This dataset is graded for DR on a scale of 0 to 3 following the criteria given in Table 2. Messidor dataset is validated by experts and is, therefore, of higher quality than EyePACS in terms of both image quality and labels.
Figure 1 illustrates complete pipeline of the system combining coarse-grained and find-grained classifiers. Before feeding an image to the network, we first applied Otsu Thresholding to extract and crop retinal rim from RFI and get rid of irrelevant black background. Since the images in both datasets are taken with different cameras and under different clinical settings, they suffer from large brightness and color variations. We used adaptive histogram equalization to normalize brightness and enhance the contrast of visual artefacts which are critical for DR detection. Since the images are in RGB colour space, we first translate them into YCbCr colour space to distribute all luminosity information in Y channel and colour information in Cb and Cr channels. Adaptive histogram equalization is then applied on Y channel only and the resultant image is converted back to RGB colour space. We further normalized the images by subtracting local average colour from each pixel to highlight the foreground and help our network detect small features. Figure 1 shows the effects of preprocessing steps on RFIs. These pre-processed images are then used to train all five networks individually. During inference, each network gives diagnosis which are ensembled to calculate the final prediction.
2.2.1 Experimental Setup
From EyePACS train set, we randomly selected 30000 images for training and rest of the 4469 images were used for validation. Test set of EyePACS was used for reporting results on this dataset. From Messidor, we used 800 images for training and 400 images from Lariboisière Hospital for testing (as done by Lam et al. ). We employed a broad range of hyper parameters during training. All networks are initialized with pre-trained weights and fine-tuned on ophthalmology datasets. To evaluate these models on EyePACS and Messidor datasets under similar problem settings, we first parallelized DR grades of both datasets using criteria given in Figure 2.
3 Results and Analysis
From section 1.1 we observe that previous works on EyePACS and Messidor have used disparate train and test splits and different classification tasks for example Quaternary, Ternary and Binary (rDR vs nrDR and nDR vs aDR). Furthermore, different researchers use different performance metrics to evaluate their method. Therefore, in such scenario comparison of any two works in not directly possible . However, we conducted extensive experiments to perform all four classification tasks mentioned above and report comprehensive results to allow a rough comparison with some of the published state-of-the-art results on these datasets.
3.1 Results of Binary Classification
As discussed above, many previous works focus primarily on binary classification as nDR vs aDR or rDR vs nrDR grading. The criteria to convert 4 or 5 grades into binary grades is given in Figure 2. For our binary classification, the number of images used for training, validation and testing from EyePACS and Messidor are given in Table 4 and 4. It can be seen from the tables that there is extensive class imbalance between both classes. Table 5 provides detailed performance metrics for all classification tasks including nDR vs aDR classification. Our results are competitive to that of Wang et al. in terms of accuracy and for all other metrics we outperform them. It should be noted here that Wang et al. performed 10-fold cross validation and although their sensitivity of 96 is higher than our 89.75, it is calculated at 50% specificity while ours is at 90% specificity.
|Results of Binary (Normal vs Abnormal) Classification|
|Model||Accuracy (%)||AUC (%)||Sensitivity (%)||Specificity (%)|
|Vo et. al||N/A||87.10||N/A||87.00||N/A||88.2||N/A||85.7|
|Wang et. al||N/A||90.50||N/A||92.10||N/A||96||N/A||50|
|Soud et. al||N/A||N/A||N/A||89.90||N/A||N/A||N/A||N/A|
|Results of Binary (Referable vs Non-Referable) Classification|
|Lam et. al||N/A||74.5||N/A||N/A||N/A||N/A||N/A||N/A|
|Vo et. al||N/A||89.70||N/A||89.10||N/A||89.3||N/A||90|
|Wang et. al||N/A||91.10||N/A||95.70||N/A||97.8||N/A||50|
|Seoud et. al||N/A||74.5||N/A||91.60||N/A||N/A||N/A||N/A|
|Results of Ternary Classification|
|Lam et. al||N/A||68.8||N/A||N/A||N/A||N/A||N/A||N/A|
|Results of Quaternary Classification|
|Lam et. al||N/A||57.2||N/A||N/A||N/A||N/A||N/A||N/A|
Results of rDR vs nrDR classification can also be found in Table 5
. All networks performed significantly better for this task than for normal vs abnormal classification on EyePACS dataset reaching maximum accuracy around 96% with 99.44% AUC using SBS layer architecture. For Messidor dataset, both NTS-Net and SBS Layer stand out from traditional classifiers. NTS-Net outperforms all other methods in all metrics, whereas ensemble of all methods gives sub-optimal performance than individual fine-grained methods. This can happen when majority of classifiers used for ensembling have a skewed performance towards downside and only a few give standout results.
3.2 Results of Multi-Class Classification
The complexity of classification task was gradually increased from binary to ternary and quaternary classification. Table 7 and 7 show the class distribution in train, validation and test splits for this multi-class setting. For ternary classification we used the criterion used by , as shown in Figure 2.
Performance of individual networks and their ensemble for ternary and quaternary classification is given in Table 5. Ensemble of all models gave better performance in this case. We also observe that the performance of NTS-Net is higher than all other individual networks. Our accuracies for both ternary and quaternary classification are superior than accuracies reported by Lam et al. . Figure 2 provides a detailed overview of classification performance of ensemble.
Diabetic Retinopathy detection using retinal fundus images is a fine-grained classification task. The biomarkers of this disease on retinal images are usually very small in size, especially for early stages, and are scattered all across the image. The ratio of pathologically important region to the whole input volume is therefore minuscule. Due to this reason traditional deep CNNs usually struggle to identify regions of interest and do not learn discriminatory features well. This problem of small and distributed visual artefacts coupled with unavailability of large publicly available high quality dataset with reasonable class imbalance makes DR detection particularly challenging for deep CNN models. However, fine-grained classification networks have high potential to provide standardized and large scale initial screening of diabetic retinopathy and help in prevention and better management of this disease. These networks are equipped with specialized algorithms to discover the important region from the image and pay particular heed to learn characterizing features from those regions.
We achieved superior performance for diabetic retinopathy detection on binary, ternary and quaternary classification tasks than many previously reported results. However, due to hugely different experimental setups and choice of performance metrics, it is unfair to draw a direct comparison with any of the cited research. Nevertheless, we have provided a wide spectrum of performance metrics and detailed experimental setup for comparison by any future work.
-  (2016) Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Investigative ophthalmology & visual science 57 (13), pp. 5200–5206. Cited by: §3.
-  (2014) Detection and classification of retinal lesions for grading of diabetic retinopathy. Computers in biology and medicine 45, pp. 161–171. Cited by: §1.
-  (2002)(Website) External Links: Cited by: §2.1.
-  (2017) A method for the detection and classification of diabetic retinopathy using structural predictors of bright lesions. Journal of Computational Science 19, pp. 153–164. Cited by: §1.
-  (2018) Analysis of retinal vessel segmentation with deep learning and its effect on diabetic retinopathy classification. In 2018 International conference on intelligent and advanced system (ICIAS), pp. 1–6. Cited by: §1.1.
-  (2018) Automated detection of diabetic retinopathy using deep learning. AMIA Summits on Translational Science Proceedings 2017, pp. 147. Cited by: §2.2.1, §3.2, §3.2.
-  (2018) End-to-end adversarial retinal image synthesis. IEEE transactions on medical imaging 37 (3), pp. 781–791. Cited by: §1.1.
-  (2014) Feedback on a publicly distributed image database: the messidor database. Image Analysis & Stereology 33 (3), pp. 231–234. Cited by: §1.1.
-  (2018) Who said what: modeling individual labelers improves classification. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.1.
-  (2016) Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama 316 (22), pp. 2402–2410. Cited by: §1.1.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §1.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
-  (2016) Generalizing pooling functions in convolutional neural networks: mixed, gated, and tree. In Artificial intelligence and statistics, pp. 464–472. Cited by: §1.
-  (2017) Progressive neural architecture search. CoRR abs/1712.00559. External Links: Cited by: §1.
-  (2017) Deep image mining for diabetic retinopathy screening. Medical image analysis 39, pp. 178–193. Cited by: §1.1.
-  (2018) Diabetic retinopathy detection through integration of deep learning classification framework. bioRxiv, pp. 225508. Cited by: §1.1, §2.1.
-  (2018) Diabetic retinopathy screening using deep neural network. Clinical & experimental ophthalmology 46 (4), pp. 412–416. Cited by: §1.1.
-  (2018) Learning to zoom: a saliency-based sampling layer for neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 51–66. Cited by: §1.
-  (2016) Red lesion detection using dynamic shape features for diabetic retinopathy screening. IEEE transactions on medical imaging 35 (4), pp. 1116–1126. Cited by: §1.1.
-  (2015) Intensity features based classification of hard exudates in retinal images. In India Conference (INDICON), 2015 Annual IEEE, pp. 1–5. Cited by: §1.1.
-  (2016) New deep neural nets for fine-grained diabetic retinopathy recognition on hybrid color space. In Multimedia (ISM), 2016 IEEE International Symposium on, pp. 209–215. Cited by: §1.1.
-  (2017) Zoom-in-net: deep mining lesions for diabetic retinopathy detection. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 267–275. Cited by: §1.1.
-  (2014) Automated detection of proliferative diabetic retinopathy using a modified line operator and dual classification. Computer methods and programs in biomedicine 114 (3), pp. 247–261. Cited by: §1.1.
-  (2018) Learning to navigate for fine-grained classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 420–435. Cited by: §1.
-  (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §1.