Two Eyes Are Better Than One: Exploiting Binocular Correlation for Diabetic Retinopathy Severity Grading

by   Peisheng Qian, et al.

Diabetic retinopathy (DR) is one of the most common eye conditions among diabetic patients. However, vision loss occurs primarily in the late stages of DR, and the symptoms of visual impairment, ranging from mild to severe, can vary greatly, adding to the burden of diagnosis and treatment in clinical practice. Deep learning methods based on retinal images have achieved remarkable success in automatic DR grading, but most of them neglect that the presence of diabetes usually affects both eyes, and ophthalmologists usually compare both eyes concurrently for DR diagnosis, leaving correlations between left and right eyes unexploited. In this study, simulating the diagnostic process, we propose a two-stream binocular network to capture the subtle correlations between left and right eyes, in which, paired images of eyes are fed into two identical subnetworks separately during training. We design a contrastive grading loss to learn binocular correlation for five-class DR detection, which maximizes inter-class dissimilarity while minimizing the intra-class difference. Experimental results on the EyePACS dataset show the superiority of the proposed binocular model, outperforming monocular methods by a large margin.



There are no comments yet.


page 1

page 4


Sea-Net: Squeeze-And-Excitation Attention Net For Diabetic Retinopathy Grading

Diabetes is one of the most common disease in individuals. Diabetic reti...

Diabetic Retinopathy Detection using Ensemble Machine Learning

Diabetic Retinopathy (DR) is among the worlds leading vision loss causes...

Multitasking Deep Learning Model for Detection of Five Stages of Diabetic Retinopathy

This paper presents a multitask deep learning model to detect all the fi...

Classification of Diabetic Retinopathy Images Using Multi-Class Multiple-Instance Learning Based on Color Correlogram Features

All people with diabetes have the risk of developing diabetic retinopath...

DiaRet: A browser-based application for the grading of Diabetic Retinopathy with Integrated Gradients

Diabetes is a metabolic disorder that results from defects in autoimmune...

Unimodal-uniform Constrained Wasserstein Training for Medical Diagnosis

The labels in medical diagnosis task are usually discrete and successive...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Diabetic retinopathy (DR) is one of the most prevailing eye diseases among patients with diabetes. It has become the primary cause of blindness in the working-age population of the developed world [Das_2018]. In Singapore, over of diabetic patients suffer from DR at various stages from mild to severe [2020SEA]. Prevention of DR is challenging because the symptoms of DR are hardly recognizable at the early stage. The gold standard for diagnosis of DR is digital color fundus photography. Digital color fundus photography is the gold standard for diagnosing DR. However, observing and evaluating fundus images is time-consuming and labor-intensive, requiring experienced ophthalmologists.

Fig. 1: (a) The distribution of (disease severity levels of left eyes). (b) The distribution of (difference between disease severity levels of each patient’s left and right eyes). (c) The scatter plot of disease severity levels between each patient’s left and right eyes. (d) The distribution of (disease severity levels of right eyes).

Deep learning approaches have achieved great success in DR grading based on retinal images [li2021reiew]

. Different from conventional machine learning methods, which rely on hand-crafted features, 


, the retinal blood vessels and the optic disc, deep neural networks can effectively extract high-level features and learn complex representations from retinal images, which better facilitates the clinical process and eliminates human errors. Most of the existing methods take monocular images as the model inputs, regardless of the difference between left and right eyes 

[2020SEA, wang2017zoom, zhao2019biranet]. However, in the clinical diagnosis of eye diseases, both the left and right eyes are taken into consideration [liu2005variation, Eppig2018]. In other words, the correlation between left and right eyes can be used for DR grading in clinical practice. As shown in Fig. 1, we performed exploratory data analysis on the Kaggle dataset provided by EyePACS [kaggle]

. For better visualization, we add random variance to each level of two eyes concurrently in Fig. 

1 (c). We can see that both eyes of the same patient are highly correlated and the Pearson’s correlation coefficient of them is 0.85 [kaggle]. Motivated by the clinical process and our analysis, we hypothesize that the left-right correlation can be used in deep learning for better DR grading.

In this work, we propose a two-stream binocular network. The network consists of convolutional neural networks (CNNs) that share the same weights and take the left and right eyes of the same patient as inputs, respectively. The model is simultaneously updated by both left and right eye images. With this learning framework, the model can recognize individual patterns in each eye as well as the similarities between the left and right eyes for DR grading. We propose a hybrid loss to optimize the network. The loss consists of a contrastive grading loss and a weighted cross-entropy loss. More specifically, we introduce the contrastive grading loss to optimize the network in accordance with the left-right similarity, in which we scale the contrastive loss proportional to the discrepancies in DR grading for finer classification granularity. The contribution of this paper are summarized as follows:

  • We construct a two-stream binocular network that consists of two identical networks with shared weights. We design a novel training strategy that takes both left and right eyes of the same patient as inputs.

  • We propose a hybrid loss function which consists of the contrastive grading loss and the weighted cross-entropy loss. The contrastive grading loss optimizes the network based on the similarity between the left and right eyes. Experiments show that with the proposed loss function, the network can recognize similarities among left and right eyes, and obtain superior results than baselines.

Ii Related Work

Early studies relied heavily on experts manually extracting features and certain textural properties for DR classification [ahmad2014image]. In recent years, deep learning techniques, such as CNNs, have been proved effective in DR grading. Bravo et al. designed a VGG-based network architecture and combine it with various pre-processing images [Mar2017Automatic], reaching classification accuracy on a balanced dataset. Zhao et al. described a bilinear model with an attention mechanism for fine-grained classification of DR [zhao2019biranet]. Wang et al. implemented Zoom-in-Net with multiple sub-networks for DR grading and localization of suspicious regions [wang2017zoom]. Zhao et al. further investigated the subtle differences between different DR severity levels and developed a new network architecture, SEA-Net, in which spatial and channel attention are alternatively stacked [2020SEA]. The aforementioned methods enhance model architectures and prove their effectiveness in DR grading. However, they do not leverage the left-right eye correlations.

Existing research suggests that the correlation between left and right eyes could be explored on eye symptoms [liu2005variation, Eppig2018, zeng2019automated]. We are not the first one to address the correlation between the two eyes. Zeng et al. presented promising results with binocular inputs to siamese-like deep learning models for DR classification [2020SEA]. While this method introduces weights sharing between CNNs in their model, it omits the calculation of similarities or variances among different DR grades. There is also no modification to the original contrastive loss. To overcome the above-mentioned shortcomings and clearly reflect the differences between DR grades in the loss function, we propose a two-stream binocular network and a novel contrastive grading loss, which are illustrated in Section III.

Iii Methodology

The proposed two-stream binocular network and training strategy for DR grading is illustrated in Fig. 2, in which, a pair of left and right eye images ,

are taken as inputs to two identical sub-networks respectively. The two sub-networks with shared weights extract features from the two eyes and classify their DR grading separately. To leverage the left-right eye correlations, we apply a novel loss function during training, which includes a contrastive grading loss and a weighted cross-entropy loss.

Fig. 2: The architecture of the two-stream binocular network.

Iii-a Two-stream Binocular Network

The proposed framework consists of two convolutional neural networks (CNNs) with shared weights. In this study, we use ResNet-50 [ResNet] and BiRA-Net [zhao2019biranet] as the backbones of the network to study the effectiveness of the proposed architecture. Both eyes from the same patient are paired and passed to each of the CNNs. The network captures features from both left and right eyes as well as the similarities between them, based on which the network classifies DR grading for both eyes. The process simulates the real-life clinical DR diagnosis on both eyes and utilizes the correlations between them for diagnosis.

Iii-B The Proposed Hybrid Loss

Iii-B1 Contrastive grading loss

To optimize the network and extract similarities between left and right eyes, we propose the contrastive grading loss. The loss function is improved from the contrastive loss, which is commonly used in conjunction with siamese networks [li2020siamese]. The loss calculates the similarity between eyes based on Euclidean distances between hidden features in the last layers of the CNNs. Assuming that the disease severity levels of paired images and are and respectively, the grading difference between the paired images can be represented as . The contrastive grading loss function is defined as:


where is the binary control factor for the first term in the loss function. If the disease severity level in both images is the same, . Otherwise, . represents the Euclidean distance between outputs from the two sub-networks in the two-stream binocular network. is the threshold, which is adjusted empirically and set to in this study [li2020siamese]. Compared to the existing contrastive loss, Eqn. 1 scales the second term in the loss function proportional to the differences between left and right features and therefore optimizes the network to recognize similarities between left and right eyes in the feature space.

Iii-B2 Weighted Cross-Entropy Loss

To alleviate the overfitting problem due to the imbalance of the DR dataset, we add the weighting mechanism in our hybrid loss function.

denotes the class probability of input

, and the class index is in the range , where is the number of classes. Each sample is scaled by the weight proportional to the inverse of the percentage of the class of the sample in the training set, denoted as in Eqn. 2. The weighted cross-entropy loss [article] is formulated as:


Finally, the proposed hybrid loss function is a weighted sum of the contrastive grading loss and the cross-entropy loss, which is defined in Eqn. 3:


where is the factor controlling the scale of the contrastive grading loss in the hybrid loss. In the experiments, is empirically set to for model optimization.

Iv Experiments

Iv-a Dataset and Implementation

We collate the dataset provided by EyePACs, hosted on Kaggle [kaggle]. The dataset is labeled with a set of definitions to ensure label consistency. The grades of DR are categorized into classes from to with increasing disease severity. Grade indicates non-diabetic, while grade indicates the most severe diabetes. We randomly split the retinal images from the dataset into images as the training set and images as the test set. The test set is balanced.

The implementation details are described as follows. Left and right retina images of the same patients are selected in pairs as the input. To augment the training set, random horizontal and vertical flipping, and random rotation of degrees are applied to the input images. The images are then resized to

. Finally, the images are standardized across the RGB channels by subtracting the mean and dividing by the standard deviation of each channel.

We load the ImageNet pre-trained weights into the network before starting the training process 


. The network is trained using the stochastic gradient descent (SGD) optimizer with an initial learning rate of

and a weight decay factor of . The learning rate is multiplied by when the model performance on the test set does not improve for

consecutive epochs. The network is trained for

epochs with a batch size of

. The experiments are implemented on NVIDIA RTX 2080Ti GPUs with Pytorch 1.7.1.

Iv-B Performance Metrics

For a comprehensive evaluation, we employ commonly-used metrics to quantitatively evaluate the performance, which have also been used in previous research [2020SEA, zhao2019biranet]. They are:

  • ACA: Average classification accuracy.

  • F1: Averaged Macro-F1 score of the 5 classes.

  • AUC: The area under the receiver operating characteristics (ROC) curve.

Iv-C Baseline Methods

We compare our framework with several baselines. In [Mar2017Automatic], a VGG-based classifier is trained on the dataset with preprocessing techniques including circular RGB, grayscale and color-centered sets. Zhao et al. report results of ResNet-50 on the Kaggle dataset, and we re-implement ResNet-50 with slightly higher ACA [zhao2019biranet]. We combine ResNet-50 with mean squared error (MSE) in the loss function as another baseline. In [zhao2019biranet], BiRA-Net is invented, which features a bilinear learning strategy together with a grading loss for fine-grained classification. Our methods with different backbones are shown as follows,

  • TSBN (ResNet-50): two-stream binocular network, with ResNet-50 as the backbone.

  • TSBN (BiRA-Net): two-stream binocular network, with BiRA-Net as the backbone. The network consists of BiRA-Net models that share weights.

Iv-D Results and Discussion

Fig. 3: Illustrative examples of original images and saliency maps from TSBN (ResNet-50). Each column contains left (top) and right (bottom) retina images from the same patient who has the same DR level in both eyes.
Method ACA Macro-F1 AUC
Bravo et al. [Mar2017Automatic] 0.5051 0.5081 -
ResNet-50 [ResNet] 0.4820 0.4877 0.8091
ResNet-50, MSE 0.4985 0.4995 0.8144
TSBN (ResNet-50) 0.5212 0.5242 0.8218
BiRA-Net [zhao2019biranet] 0.5431 0.5723 0.8448
TSBN (BiRA-Net) 0.5513 0.5792 0.8490
TABLE I: Experimental results on DR grading.
Class ResNet-50 TSBN (ResNet-50)
Normal (0) 0.6250 0.6442
Mild (1) 0.3814 0.4327
Moderate (2) 0.4327 0.4423
Severe (3) 0.4103 0.4615
Proliferative (4) 0.5609 0.6250
TABLE II: Grading accuracies in each DR level.
Fig. 4: Confusion matrices of (a) ResNet-50, (b) TSBN (ResNet-50).

In Table I, the experimental results of our approach are compared with baseline methods. When using ResNet-50 as the backbone, our method has a clear advantage of increase in classification accuracy. It proves that our training strategy and loss function can better optimize the model. We also outperform the original BiRA-Net, a more sophisticated architecture engineered by a dedicated grading loss [zhao2019biranet]. The results confirm that our approach is widely applicable to different backbone models for DR grading.

In Table. II, it is clear that our approach reaches higher classification accuracy in all levels of DR grading, especially in higher levels where the training samples are sparse. Fig. 4 represents confusion matrices for ResNet-50 and our method respectively. Our model distinguishes more cases among moderate to serious DR grades (grade to ). In other words, our model identifies patients who require clinical attention most, proving its clinical significance.

Examples of saliency maps on the test data are illustrated in Fig. 3 with Grad-CAM [Selvaraju_2017_ICCV]. Comparing the left and right saliency maps of the same patient, it is evident that our model is activated at symmetrical positions with similar intensities. It confirms that the similarities of DR symptoms in both eyes are informative for DR grading. It is also validated that our approach effectively learns such similarities for classification. Besides, it is observed that there are some discrepancies between the saliency maps of left and right eyes, due to variances of the DR symptoms and limitations in the model capability.

V Conclusions and Future Work

We propose a two-stream binocular network, which explores similarities and correlations between left and right eyes for DR grading. The framework consists of CNNs with shared weights and classifies the left and right eyes of the same patient respectively. To capture the similarities between left and right eyes, a hybrid loss function is proposed, which combines a contrastive grading loss and a weighted cross-entropy loss. Extensive experiments has shown that our approach is effective. The left-right eye similarities are visualized in the saliency maps of our model. In the future, we can further improve the model performance by exploring left-right eye correlations with domain knowledge.