Diabetic retinopathy(DR) is one of the most common vision-threatening diseases. Early diagnosis and timely treatment are vital for preventing visual impairment and have been proved to be able to reduce the risk of blindness by up to 90 percent. However, there are no obvious symptoms of retinopathy in the early stage, which requires diabetic patients to frequently take the fundus image examination to avoid vision problems. The diagnosis of DR is a highly time-consuming task since there are lots of types of lesions with diverse features. Therefore, an automatic and effective image grading algorithm for DR is in urgent need.
based on CNNs have significantly improved the performance of DR classification. The accurate diagnosis still exists great challenges: 1) Be different from the coarse-grained classification, DR grading is a fine-grained classification task with large intra-class variance. 2) Medical image labeling is a costly task. 3) The datasets follow a long-tailed class distribution. In this paper, we propose a fine-grained diabetic retinopathy grading scheme to address the above challenges.
Image recognition task includes coarse-grained image classification and fine-grained image classification. The goal of coarse-grained image classification is to recognize the categories with large appearance differences such as cats, dogs, etc. While concentrating on the fine categories with small inter-class variance, such as the species of birds, fine-grained classification puts more effort into extracting local discriminative features. For DR detection, the disease severity level mostly relies on the small lesion regions, such as microaneurysms, hemorrhages, and so on. Therefore, unlike most previous works, we consider DR grading as a fine-grained classification task. We build our neural network architecture based on the hierarchical bilinear pooling (HBP)  which has been demonstrated to be effective for fine-grained classification.
Conventional image classification methods encode the ground truth labels into one-hot labels that treat all wrong classes equally. However, most previous works ignore the fact that the severity of DR follows a natural order. It inspires us to utilize the ordinal regression approach  to encode the interclass distance. In addition, the success of deep learning methods highly depends on the ability of feature representations. Since metric learning tends to learn a feature embedding space where images belonging to the same class get closer than that from other classes. We combine the effective feature embedding technique  to learn a more discriminative representation for fundus images.
Our main contributions can be summarized as: (1) Identifying the discriminative parts of fundus images based on the fine-grained approach. (2) Using an ordinal regression method to take advantage of ordinal information. (3) Obtaining a more discriminative embedding space via deep metric learning. (4) Numerous experimental results show the superior performance of the proposed method.
2 Related Work
With its remarkable success in medical image analysis in the past decade, deep learning has been applied to the classification of DR. Most works for DR grading directly apply a feature extractor followed by a classifier, which are effective for general image classification task. Gulshan et al. used the Inception network  to detect diabetic retinopathy and obtained promising results. Pratt et al.  proposed a method with a 10-layer of CNNs followed three dense connected layers to make a diagnosis of DR grading in real-time. Gargeya et al.  utilized the residual network 
to extract features, and then used these features with other metadata to train the decision tree for classification. However, these methods take the DR classification task as a conventional natural image classification which ignores the small inter-class variance and the interclass ordinal relationships of the DR problem.
Bajwa et al.  proposed an ensembling method (ResNet and DenseNet  as two coarse-grained classification methods, NTS-Net  and SBS layer  as two fine-grained classification methods) to predict the severity of DR. Although utilizing the fine-grained classification approach, the ensembling approach of four neural networks is somewhat complicated and time-consuming. Moreover, ordinal information among classes is not taken into account.
This section presents the proposed method which is illustrated in Fig. 1. To reduce the inﬂuence of lighting conditions, we first preprocess the fundus images. The preprocessed images are fed into the backbone to extract the feature maps. Then, a bilinear module  captures the cross-layer interaction information. Finally, the classifer gives the predictions.
3.1 Network architecture
Because the progression of DR is a gradual process, there are subtle variations among different sub-categories. Moreover, the appearance of fundus images belonging to the same category is quite different since there exit various types of lesions. To tackle these challenges, we develop the network based on the HBP module which is proposed for fine-grained classification.
Each preprocessed fundus image is fed into the CNN based feature extractor to obtain its representation. HBP utilizes VGG16  as the backbone. To better capture the multi-scale information, the proposed method uses DSOD architecture  which combines DenseNet169 with a feature pyramid network (FPN)  as the backbone.
After obtaining feature maps, the bilinear module will be used to fuse multi-scale features from layers F1, F3, F5 by concatenating the outputs of bilinear vectors as follows:
where are the output features from layers F1, F3, F5, are projection matrices which map the features into bilinear vectors and is the classification parameter matrix.
3.2 Loss fuction
Be different from the traditional classification problems, there exists the ordinal information among disease severity levels in most medical datasets. Therefore, we use an ordinal regression approach to obtain a soft label  rather than the classical one-hot label which does not contain ordinal information. In particular, for i-th example , the ground-truth label is written as and
where C is the number of class, is the t-th rank of true metric value. For example, when there do not exist any abnormalities in fundus images, the value of is 0. Rank is one of the C ordinal lables, and is the pre-defined metric function to penalize the distance between and . For the penalty function , we use the simple squared error.
In order to learn a discriminative feature space, we use a metric loss  as:
where is a smooth term. For each class , there are K centers for the large intra-class variance. is a pre-defined margin and is the parameter of the fully connected layer. is the relaxed similarity between and the class . For each category with a representative center, SoftMax loss with a smooth term is equivalent to a smoothed triplet loss. While in real-world data, each class contains multiple centers due to the large intra-class variance. By expanding the weight matrix of each class to have multiple columns, this metric loss can be optimized with triplets.
Since there exists a serious class imbalance challenge, we use the focal loss  to deal with the imbalance problem of DR datasets:
where is a focusing parameter and
is the prediction probability of label. In the experiments, we set to be 2.
The final loss is the combination of focal loss and metric loss:
are the hyperparameters.
4.1 Datasets and evaluation metrics
In this paper, we use 5-fold cross-validation on two public DR datasets, namely IDRiD and DeepDR, which are provided by the IEEE International Symposium on Biomedical Imaging (ISBI) in 2018 and 2020 respectively. According to the clinical International Clinical Diabetic Retinopathy Disease Severity Scale , the progression of DR from healthy to proliferative phase can be developed into five categories: 1) healthy, 2) mild-NPDR, 3) moderate-NPDR, 4) severe-NPDR and 5) PDR. PDR and NPDR stand for proliferative diabetic retinopathy and nonproliferative diabetic retinopathy respectively. The IDRiD and DeepDR datasets contain 516 and 1600 images with resolutions of 4288 x 2848 and 1736 x 1824 respectively.
In order to evaluate the proposed method on the IDRiD and DeepDR datasets. We utilize the quadratic weighted kappa which typically ranges from 0 to 1. This metric is used to measure the consistency between two ratings. The larger the value is, the higher the consistency is. We also present the accuracy of each class by calculating the confusion matrix.
4.2 Implementation details
The quality of some fundus images is poor due to the illumination factors. To reduce the influence of lighting conditions, we preprocess the datasets with the method of Ben’s preprocessing111https://www.kaggle.com/ratthachat/aptos-eye-preprocessing-in-diabetic-retinopathy. Besides, we remove the redundant black background. Then, we resize the images to be 512x512 pixels.
During the training phase, the data augmentations we used include random resized crop, random affine, random horizontal, and vertical flip.
Our experiments are conducted on four NVIDIA GTX 1080Ti GPUs based on the Pytorch. All backbone networks are pre-trained on the imageNet dataset. For the optimizer, we choose Stochastic Gradient Descent (SGD) and the momentum is set to be 0.9. Batch size and epoch are set to be 32 and 150 respectively. The learning rate is initially set to be 0.01 and divided by 10 every 50 epochs.and are set to be 0.5.
4.3 Comparison with the state-of-the-art
In this section, we compare the performance of the proposed method with the state-of-the-art fine-grained classification methods, such as B-CNN , HBP , DFL , and PMG . Meanwhile, the proposed method is also compared with the ensembling method  which is the fusion of AlexNet and GoogleNet. As illustrated in Table 1, our method achieves superior performance. This shows the effectiveness of the proposed approach.
|B-CNN (ICCV15) ||VGG16||0.8631||0.8702|
|HBP (ECCV18) ||VGG16||0.8511||0.8586|
|DFL (CVPR18) ||ResNet50||0.8804||0.8926|
|PMG (CVPR20) ||ResNet50||0.8694||0.8825|
|AG_Net (MIA20) ||AlexNetGoogleNet||0.8573||0.8644|
4.4 Ablation study
This section presents the effectiveness of the components in the proposed method. Firstly, replacing the backbone of the HBP model with DSOD obtains better results as shown in Table 2. In addition, the focal loss also has some impacts. The compared results presented in rows 3-4 of Table 2 indicate that the ordinal regression approach can improve the average quadratic kappa score by at least 1% on both datasets. This performance suggests the importance of the ordinal information in the DR problem.
The introduction of a metric loss can further improve the average quadratic weighted kappa by around 0.7% and 1.0% on IDRiD and DeepDR datasets respectively as shown in the last two rows of Table 2. It means the metric loss could help to learn a better feature embedding.
To illustrate the superior performance of the proposed method, we conduct visualization experiments via Grad-CAM . Column (a)-(c) in Fig.2 give an original DR image sampled from the DeepDR dataset, the activation maps from HBP and the proposed method respectively. Our method focuses on the discriminative regions rather than the black background and less informative regions. The class activation maps show that the proposed method truly makes predictions based on the discriminative parts. Furthermore, to better present the accuracy of each class, we plot the confusion matrix based on the DeepDR dataset as shown in Fig. 3. The diagonal elements of the confusion matrix represent the accuracy of each category. The phenomenon that those wrong predictions are close to the ground truth labels is extremely important for DR diagnosis. Due to the serious imbalance of the DeepDR dataset, only 92 images for PDR, the accuracy of PDR is slightly worse.
To get the discriminative feature representation for fundus images, we proposed a method based on the fine-grained classification framework. With the consideration of ordinal information among classes, the proposed method achieves competitive performance. Furthermore, the metric loss pushes semantically similar examples closer than that from different classes in the feature embedding space. Therefore, the proposed method achieves state-of-the-art performance on two public DR datasets which are more challenging than general images.
-  V. Gulshan, L. Peng, M. Coram, M.C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, R. Kim, R.Raman, P.C. Nelson, J.L. Mega, and D.R. Webster, “Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” JAMA, vol. 316, pp. 2402–2410, December 2016.
-  H. Pratt, F. Coenen, D.M. Broadbent, S.P. Harding, and Y. Zheng, “Convolutional neural networks for diabetic retinopathy,” Procedia Computer Science, vol. 90, pp. 200–205, 2016.
-  R. Gargeya and T. Leng, “Automated identification of diabetic retinopathy using deep learning,” Ophthalmology, vol. 124, pp. 962–969, July 2017.
C. Yu, X. Zhao, Q. Zheng, Zhang P., and X. You,
“Hierarchical bilinear pooling for fine-grained visual
Proceedings of the European Conference on Computer Vision. Springer, 2018, pp. 595–610.
R. Díaz and A. Marathe,
“Soft labels for ordinal regression,”
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4738–4747.
-  Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep metric learning without triplet sampling,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6450–6458.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  M.N. Bajwa, Y. Taniguchi, M.I. Malik, W. Neumeier, A. Dengel, and S. Ahmed, “Combining fine-and coarse-grained classifiers for diabetic retinopathy detection,” in Annual Conference on Medical Image Understanding and Analysis, 2019, vol. 1065, pp. 242–253.
-  G. Huang, Z. Liu, L.J.P. van der Maaten, and K.Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
-  Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang, “Learning to navigate for fine-grained classification,” in Proceedings of the European Conference on Computer Vision. Springer, 2018, pp. 420–435.
-  A. Recasens, P. Kellnhofer, S. Stent, W. Matusik, and A. Torralba, “Learning to zoom: a saliency-based sampling layer for neural networks,” in Proceedings of the European Conference on Computer Vision. Springer, 2018, vol. 11213, pp. 52–67.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  Z. Shen, Z. Liu, J. Li, Y. Jiang, Y. Chen, and X. Xue, “Dsod: Learning deeply supervised object detectors from scratch,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1937–1945.
-  T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 936–944.
-  T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2999–3007.
-  C. P. Wilkinson, Ferris III, F. L., R. E. Klein, P. P. Lee, C. D. Agardh, and M. Davis, “Proposed international clinical diabetic retinopathy and diabetic macular edema disease severity scales,” Ophthalmology, vol. 110, pp. 1677–1682, September 2003.
-  T. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1449–1457.
-  Y. Wang, V.I. Morariu, and L.S. Davis, “Learning a discriminative filter bank within a cnn for fine-grained recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4148–4157.
-  R. Du, D. Chang, A. K. Bhunia, J. Xie, Y. Z. Song, Z. Ma, and J. Guo, “Fine-grained visual classification via progressive multi-granularity training of jigsaw patches,” arXiv preprint arXiv:2003.03836, 2020.
-  P. Porwal, S. Pachade, M. Kokare, G. Deshmukh, J. Son, W. Bae, and T. Wu, “Idrid: Diabetic retinopathy–segmentation and grading challenge,” Medical Image Analysis, vol. 59, pp. 101561, January 2020.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 618–626.