Diabetic retinopathy (DR) is one of the microvascular complications of diabetes, which may cause vision impairments and even blindness . The major pathological signs of DR include hemorrhages, exudates, microaneurysms, and retinal neovascularization, as shown in Fig. 1. The color fundus image is the most widely-used photography for ophthalmologists to identify the severity of DR, which can clearly reveal the presence of different lesions. Early diagnoses and timely interventions are of vital importance in preventing DR patients from vision malfunction. As such, automated and efficient fundus image based DR diagnosis systems are urgently needed.
Self-supervised learning (SSL) methods [4, 11, 18, 21] have been explored to learn feature representations from unlabeled images. As a representative SSL method, contrastive learning (CL) [1, 2, 8] has been very successful in the natural image field, which defines a contrastive prediction task that trying maximize the similarity between features from differently augmented views of the same image and simultaneously maximize the distance between features from different images, via a contrastive loss. However, the application of CL in the fundus image field [15, 20] is relatively limited, mainly due to fundus images’ high resolution and low proportion of diagnostic features. First, fundus images are of high resolution so as to clearly reveal structural and pathological details. And it is challenging for CL to train high-resolution images with large batch sizes which are nevertheless generally required by CL so as to provide more negative samples. Secondly, data augmentation is typically applied in CL to generate different views of the same image. However, some strong data augmentation operations applied to fundus images may destroy important domain-specific information, such as cropping and Cutout . In a natural image, salient objects generally occupy a very large proportion and each part of them may contribute to the recognition of the object of interest, whereas the diagnostic features in a fundus image, such as lesions, may only occupy a small part of the whole image. This may result in that most cropped patches inputted to the CL framework have few or no diagnostic features. In this way, the network is prone to learn feature representations that are distinguishable for different views of the image but not discriminative for downstream tasks. CL in fundus image based DR grading is even rarer.
To address the aforementioned issues, we propose a lesion-based contrastive learning approach for fundus image based DR grading. Instead of using entire fundus images, lesion patches are taken as the input for our contrastive prediction task. By focusing on patches with lesions, the network is encouraged to learn more discriminative features for DR grading. The main steps of our framework are as follows. First, an object detection network is trained on a publicly-available dataset IDRiD  that consists of 81 fundus images with annotations of lesions. Then, the detection network is applied to the training set of EyePACS 
to predict lesions with a relatively high confidence threshold. Next, random data augmentation operations are applied to the lesion patches to generate multiple views of them. The feature extraction network in our CL framework is expected to map inputted patches into an embedding feature space, wherein the similarity between features from different views of the same lesion patch and the distance between features from different patches are maximized, by minimizing a contrastive loss. The performance of our proposed method is evaluated based on linear evaluation and transfer capacity evaluation on EyePACS.
The main contributions of this paper are three-fold: (1) We present a self-supervised framework for DR grading, namely lesion-based contrastive learning. Our framework’s contrastive prediction task takes lesion patches as the input, which addresses the problem of high memory requirements and lacking diagnostic features, as to common CL schemes. This design can be easily extended to other types of medical images with relatively weak physiological characteristics. (2) We study different data augmentation operations in defining our contrastive prediction task. Results show that a composition of cropping, color distortion, gray scaling and rotation is beneficial in defining pretext tasks for fundus images to learn discriminative feature representations. (3) We evaluate our framework on the large-scale EyePACS dataset for DR grading. Results from linear evaluation and transfer capacity evaluation identify our method’s superiority. The source code is available at https://github.com/YijinHuang/Lesion-based-Contrastive-Learning
2.1 Generation of lesion patches
Two datasets are used in this work. An object detection network is trained on one dataset with lesion annotations. Then, this detection network is used to generate lesion patches of fundus images from the other dataset for subsequent contrastive learning. Because of limited training samples in the first dataset, the detection network has a relatively poor generalization ability and cannot precisely predict lesions of fundus images from the other dataset. Therefore, a high confidence threshold is set to eliminate unconfident predictions. Then, we resize all fundus images to , and the bounding boxes of the patches are scaled correspondingly. After that, we expand every predicted bounding box to with the lesion lying in the center and then randomly shift the box within a range such that the resulting box still covers the lesion. In this way, we increase the difficulty of the contrastive prediction task while ensure the training of our CL framework can be performed with a large batch size. Please note the lesion detection network is not involved in the testing phase.
2.2 Generation of multiple views of lesion patches
Rather than employing a carefully designed task to learn feature representations [4, 11], CL tries to maximize the agreement between differently augmented views of the same image. Data augmentation, a widely used technique in deep learning, is applied to generate different views of a sample in CL. Some data augmentation operations that are commonly used in the natural image field may destroy important diagnostic features when transferred to the fundus image field. Therefore, as shown in part B of Fig. 2, four augmentation operations are considered in our work: cropping, color distortion, gray scaling, and rotation.
Let denote a randomly-sampled batch with a batch size of . Two random compositions of data augmentation operators are applied to each data point to generate two different views . Note that the parameters of these augmentation operators may differ from data point to data point. Now, we obtain a new batch . Given a patch that is generated from , we consider that is also generated from as a positive sample and every patch in the set as a negative sample .
2.3 Lesion-based contrastive learning
Given a data point , we first use a feature extractor
to extract its feature vector. Specificly, is a CNN and is the feature vector right before the fully connected layer of the network. Then, a projection head is applied to map the feature vector into an embedding space to obtain . Given a batch , we define and . For every , our contrastive prediction task is to identify embedded feature from . To find
, we define the one having the highest cosine similarity withas our prediction. To maximize the similarity of positive samples and to minimize that of negative samples, we define our contrastive loss as
where denotes the batch size of , , and is a temperature parameter. In the testing phase, we do not use the projection head but only the feature extractor for downstream tasks. Our framework of lesion-based CL is depicted in part C of Fig. 2.
2.4 Implementation details
2.4.1 Data augmentation operations.
For the cropping operation, we randomly crop lesion patches with a random factor in [0.8, 1.2]. For the gray scaling operation, each patch has a 0.2 probability of being gray scaled. The color distortion operation adjusts the brightness, contrast, and saturation of patches with a random factor in [-0.4, 0.4] and also changes the hue with a random factor in [-0.1, 0.1]. The rotation operation randomly rotates patches by an arbitrary angle.
2.4.2 Lesion detection network.
dataset. The detection network is trained with Adam optimizer for 100 epochs, with a 0.01 initial learning rate and getting decayed by 0.1 at the 50th epoch and the 80th epoch.
2.4.3 Contrastive learning network.
|Confidence threshold||# images||# lesions|
2.5 Evaluation protocol
2.5.1 Linear evaluation.
Linear evaluation is a widely used method for evaluating the quality of the learned representations of a self-supervised model. The pre-trained feature extractor described in Section 2.3
is frozen and a linear classifier on top of it is trained in a fully-supervised manner. The performance of downstream tasks is then used as a proxy of the quality of the learned representations.
2.5.2 Transfer capacity evaluation.
Pre-trained parameters of the feature extractor can be transferred to models used in downstream tasks. To evaluate the transfer learning capacity, we unfreeze and fine-tune the feature extractor followed by a linear classifier in supervised downstream tasks.
3.1 Dataset and evaluation metric
IDRiD  consists of 81 fundus images, with pixel-wise lesion annotations of hemorrhages, microaneurysms, soft exudates, and hard exudates. These manual annotations are converted to bounding boxes to train an object detection network . Microaneurysms are excluded in this work because detecting them is challenging and will lead to a large number of false positive predictions. In training the lesion detection network, 54 samples are used for training and 27 for validation.
35k/11k/43k fundus images are provided in EyePACS  for training/validation/testing (the class distribution of EyePACS is shown in Fig. A1 of the appendix). According to the severity of DR, images are classified into five grades: 0 (normal), 1 (mild), 2 (moderate), 3 (severe), and 4 (proliferative). The training and validation sets without annotations are used for training our self-supervised model. The total number of lesion patches under different confidence thresholds of the detection network is shown in Table 1. Representative lesion detection results are provided in Fig. A2 of the appendix. Partial datasets are obtained by randomly sampling 1%/5%/10%/25%/100% (0.3k/1.7k/3.5k/8.7k/35k images) from the training set, together with the corresponding annotations. Images in the partial datasets and the test set are resized to for training and testing subsequent DR grading models in both linear evaluation and transfer capability evaluation settings.
3.1.3 Evaluation metric.
We adopt the quadratic weighted kappa  for evaluation, which works well for unbalanced datasets.
3.2 Composition of data augmentation operations
We evaluate the importance of an augmentation operation by removing it from the composition or applying it solely. As shown in the top panel of Table 2, it is insufficient for a single augmentation operation to learn discriminative representations. Even so, color distortion works much better than other operations, showing its importance in defining our contrastive prediction task for fundus images. This clearly indicates that DR grading benefits from color invariance. We conjecture it is because the EyePACS images are highly diverse, especially in terms of image intensity profiles. From the bottom panel of Table 2, we notice that an absence of any of the four augmentation operations leads to a decrease in performance. Applying the composition of all four augmentation operations considerably increases the difficulty of our contrastive prediction task, but it also significantly improves the quality of the learned representations. Notably, this contrasts common CL methods; a heavy data augmentation would typically hurt the performance of disease diagnosis .
|Cropping||Rotation||Color distortion||Gray Scaling||Kappa|
|[rgb]0.2,0.2,0.2Method||[rgb]0.2,0.2,0.2Confidence [rgb]0.2,0.2,0.2[rgb]0.2,0.2,0.2threshold||[rgb]0.2,0.2,0.2Partial dataset|
|[rgb]0.2,0.2,0.2Quadratic [rgb]0.2,0.2,0.2weighted kappa|
|[rgb]0.2,0.2,0.2SimCLR (128 128)||-||16.19||26.70||31.62||37.41||43.64|
|[rgb]0.2,0.2,0.2SimCLR (224 224)||-||12.15||26.56||29.94||37.86||55.32|
|[rgb]0.2,0.2,0.2Lesion-base CL (ours)||0.7||24.72||43.98||53.33||61.55||66.87|
|[rgb]0.2,0.2,0.2Lesion-base CL (ours)||0.8||26.22||48.18||56.30||62.49||66.80|
|[rgb]0.2,0.2,0.2Lesion-base CL (ours)||0.9||31.92||56.57||60.70||63.45||66.88|
|[rgb]0.2,0.2,0.2Transfer capacity evaluation|
|[rgb]0.2,0.2,0.2SimCLR (128 128)||-||63.16||72.30||76.33||79.59||82.72|
|[rgb]0.2,0.2,0.2SimCLR (224 224)||-||55.43||70.66||75.15||77.32||82.11|
|[rgb]0.2,0.2,0.2Lesion-base CL (ours)||0.7||68.37||75.40||77.34||80.34||82.80|
|[rgb]0.2,0.2,0.2Lesion-base CL (ours)||0.8||68.14||74.18||77.49||80.74||83.22|
|[rgb]0.2,0.2,0.2Lesion-base CL (ours)||0.9||66.43||73.85||76.93||80.27||83.04|
3.3 Evaluation on DR grading
To evaluate the quality of the learned representations from fundus images, we perform linear evaluation and transfer capacity evaluation on the partial datasets. A state-of-the-art CL method SimCLR  is adopted as our baseline for comparison, which nevertheless takes an entire fundus image as the input for the contrastive prediction task. In Table 3, SimCLR (128 128) denotes the multiple views of fundus images are resized to 128 128 as the input for the contrastive prediction task in SimCLR, being consistent with the input size of our lesion-based CL framework. However, crop-and-resize transformation may critically change the pixel size of the input. SimCLR (224
224) experiments are conducted based on the consideration that aligning the pixel size of the input for CL and that for downstream tasks may achieve better performance. The ResNet50 model in the fully-supervised DR grading is initialized with parameters from a model trained on the ImageNet dataset. Training curves are provided in Fig. A3 of the appendix.
As shown in Table 3, our lesion-based CL under a detection confidence threshold of 0.9 achieves 66.88% kappa on linear evaluation on the full training set. Significant improvements over SimCLR have been observed; 23.34% over SimCLR (128 128) and 11.56% over SimCLR (224 224). The superiority of our method is more evident when a smaller training set is used for linear evaluation. Note that although using a higher confidence threshold results in fewer lesion patches for training, a better performance is achieved on linear evaluation. This implies that by improving the quality of lesions in the training set of the contrastive prediction task, the model can more effectively learn discriminative feature representations. For the transfer capacity evaluation, when fine-tuning on the full training set, there is not much difference between the fully-supervised method and CL methods. This is because feature representations can be sufficiently learned under full supervison, and thus there may be no need for CL based learning. Therefore, the advantage of our proposed method becomes more evident when the training samples for fine-tuning become fewer. With only 1% partial dataset for fine-tunning, the proposed method under a confidence threshold of 0.7 has a higher kappa by 15.01% than the fully-supervised method. Both linear evaluation and transfer capacity evaluation suggest that our proposed method can better learn feature representations, and thus is able to enhace DR grading, by exploiting unlabeled images (see also Fig. A4 in the appendix).
In this paper, we propose a novel self-supervised framework for DR grading. We use lesion patches as the input for our contrastive prediction task rather than entire fundus images, which encourages our feature extractor to learn representations of diagnostic features, improving the transfer capacity for downstream tasks. We also present the importance of different data augmentation operations in our CL task. By performing linear evaluation and transfer capacity evaluation on EyePACS, we show that our method has a superior DR grading performance, especially when the sample size of the training data with annotations is limited. This work, to the best of our knowledge, is the first one of its kind that has attempted contrastive learning on fundus image based DR grading.
-  Bachman, P., Hjelm, R.D. and Buchwalter, W.: Learning representations by maximizing mutual information across views. In: NeurIPS (2019)
-  Chen, T., Kornblith, S., Norouzi, M. and Hinton, G.: A simple framework for contrastive learning of visual representations. In: PMLR (2020)
DeVries, T. and Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint (2017).arXiv:1708.04552
-  Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)
-  He, A., Li, T., Li, N., Wang, K. and Fu, H.: CABNet: Category Attention Block for Imbalanced Diabetic Retinopathy Grading. IEEE Trans. Med. Imaging, TMI, 40(1), pp.143-153(2020) (2020)
-  He, K., Zhang, X., Ren, S. and Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
-  Huang, Y., Lin, L., Li, M., Wu, J., Cheng, P., Wang, K., … Tang, X.: Automated hemorrhage detection from coarsely annotated fundus images in diabetic retinopa- thy. In: IEEE 17th International Symposium on Biomedical Imaging, ISBI, pp. 1369-1372 (2020)
-  Jiao, J., Cai, Y., Alsharid, M., Drukker, L., Papageorghiou, A.T. and Noble, J.A.: Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound. In: Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (eds.) MICCAI 2020. LNCS, volume 12263, pp. 534–543, Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_51
-  Kaggle diabetic retinopathy detection competition. https://www.kaggle.com/c/diabetic-retinopathy-detection
Krizhevsky, A., Sutskever, I. and Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeruIPS (2012)
-  Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: ICCV (2017)
-  Lin, L., Li, M., Huang, Y., Cheng, P., Xia, H., et al.: The SUSTech-SYSU dataset for automated exudate detection and diabetic retinopathy grading. Scientific Data 7(1), 1-1 (2020)
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
-  Lin, Z., Guo, R., Wang, Y., Wu, B., Chen, T., Wang, W., Chen, D.Z. and Wu, J.: A framework for identifying diabetic retinopathy based on anti-noise detection and attention-based fusion. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-L´opez, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, volume 11071, pp. 74-82, Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_9
-  Li, X., Jia, M., Islam, M.T., Yu, L. and Xing, L.: Self-supervised Feature Learning via Exploiting Multi-modal Data for Retinal Disease Diagnosis. IEEE Trans. Med. Imaging, TMI, 39(12), pp.4023-4033(2020)
-  Porwal, P., et al.: Indian diabetic retinopathy image dataset (IDRID): a database for diabetic retinopathy screening research. Data 3(3), 25 (2018)
-  Ren, S., He, K., Girshick, R. and Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6), pp.1137–1149(2017)
-  Spitzer, H., Kiwitz, K., Amunts, K., Harmeling, S., Dickscheid, T.: Improving cytoarchitectonic segmentation of human brain areas with self-supervised siamese networks. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-L´opez, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11072, pp. 663–671. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00931-1_76
-  Wang, Z., Yin, Y., Shi, J., Fang, W., Li, H., Wang, X.: Zoom-in-Net: deep mining lesions for diabetic retinopathy detection. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 267–275. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7_31
-  Zeng, X., Chen, H., Luo, Y. and Ye, W.: Automated detection of diabetic retinopathy using a binocular siamese-like convolutional network. In: ISCAS (2019)
-  Zhuang, X., Li, Y., Hu, Y., Ma, K., Yang, Y., Zheng, Y.: Self-supervised feature learning for 3D medical images by playing a rubik’s cube. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 420–428. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_46