Lesion-based Contrastive Learning for Diabetic Retinopathy Grading from Fundus Images

by   Yijin Huang, et al.

Manually annotating medical images is extremely expensive, especially for large-scale datasets. Self-supervised contrastive learning has been explored to learn feature representations from unlabeled images. However, unlike natural images, the application of contrastive learning to medical images is relatively limited. In this work, we propose a self-supervised framework, namely lesion-based contrastive learning for automated diabetic retinopathy (DR) grading. Instead of taking entire images as the input in the common contrastive learning scheme, lesion patches are employed to encourage the feature extractor to learn representations that are highly discriminative for DR grading. We also investigate different data augmentation operations in defining our contrastive prediction task. Extensive experiments are conducted on the publicly-accessible dataset EyePACS, demonstrating that our proposed framework performs outstandingly on DR grading in terms of both linear evaluation and transfer capacity evaluation.



There are no comments yet.


page 2

page 4


Embedding Task Knowledge into 3D Neural Networks via Self-supervised Learning

Deep learning highly relies on the amount of annotated data. However, an...

Federated Contrastive Learning for Decentralized Unlabeled Medical Images

A label-efficient paradigm in computer vision is based on self-supervise...

Conditional Alignment and Uniformity for Contrastive Learning with Continuous Proxy Labels

Contrastive Learning has shown impressive results on natural and medical...

Contrastive learning of global and local features for medical image segmentation with limited annotations

A key requirement for the success of supervised deep learning is a large...

ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with Genetics

High annotation costs are a substantial bottleneck in applying modern de...

Unsupervised Feature Learning for Manipulation with Contrastive Domain Randomization

Robotic tasks such as manipulation with visual inputs require image feat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Diabetic retinopathy (DR) is one of the microvascular complications of diabetes, which may cause vision impairments and even blindness [12]. The major pathological signs of DR include hemorrhages, exudates, microaneurysms, and retinal neovascularization, as shown in Fig. 1. The color fundus image is the most widely-used photography for ophthalmologists to identify the severity of DR, which can clearly reveal the presence of different lesions. Early diagnoses and timely interventions are of vital importance in preventing DR patients from vision malfunction. As such, automated and efficient fundus image based DR diagnosis systems are urgently needed.

Figure 1: A representative fundus image with four types of DR related lesions.

Recently, deep learning has achieved great success in the field of medical image analysis. Convolutional neural networks (CNNs) have been proposed to tackle the automated DR grading task

[5, 14, 19]. The success of CNN is mainly attributed to its capability of extracting highly representative features. However, it usually requires a large-scale annotated dataset to train a network. The annotation process is time-intensive, tedious, and error-prone, and hence ophthalmologists bear a heavy burden in building a well-annotated dataset.

Self-supervised learning (SSL) methods [4, 11, 18, 21] have been explored to learn feature representations from unlabeled images. As a representative SSL method, contrastive learning (CL) [1, 2, 8] has been very successful in the natural image field, which defines a contrastive prediction task that trying maximize the similarity between features from differently augmented views of the same image and simultaneously maximize the distance between features from different images, via a contrastive loss. However, the application of CL in the fundus image field [15, 20] is relatively limited, mainly due to fundus images’ high resolution and low proportion of diagnostic features. First, fundus images are of high resolution so as to clearly reveal structural and pathological details. And it is challenging for CL to train high-resolution images with large batch sizes which are nevertheless generally required by CL so as to provide more negative samples. Secondly, data augmentation is typically applied in CL to generate different views of the same image. However, some strong data augmentation operations applied to fundus images may destroy important domain-specific information, such as cropping and Cutout [3]. In a natural image, salient objects generally occupy a very large proportion and each part of them may contribute to the recognition of the object of interest, whereas the diagnostic features in a fundus image, such as lesions, may only occupy a small part of the whole image. This may result in that most cropped patches inputted to the CL framework have few or no diagnostic features. In this way, the network is prone to learn feature representations that are distinguishable for different views of the image but not discriminative for downstream tasks. CL in fundus image based DR grading is even rarer.

To address the aforementioned issues, we propose a lesion-based contrastive learning approach for fundus image based DR grading. Instead of using entire fundus images, lesion patches are taken as the input for our contrastive prediction task. By focusing on patches with lesions, the network is encouraged to learn more discriminative features for DR grading. The main steps of our framework are as follows. First, an object detection network is trained on a publicly-available dataset IDRiD [16] that consists of 81 fundus images with annotations of lesions. Then, the detection network is applied to the training set of EyePACS [9]

to predict lesions with a relatively high confidence threshold. Next, random data augmentation operations are applied to the lesion patches to generate multiple views of them. The feature extraction network in our CL framework is expected to map inputted patches into an embedding feature space, wherein the similarity between features from different views of the same lesion patch and the distance between features from different patches are maximized, by minimizing a contrastive loss. The performance of our proposed method is evaluated based on linear evaluation and transfer capacity evaluation on EyePACS.

The main contributions of this paper are three-fold: (1) We present a self-supervised framework for DR grading, namely lesion-based contrastive learning. Our framework’s contrastive prediction task takes lesion patches as the input, which addresses the problem of high memory requirements and lacking diagnostic features, as to common CL schemes. This design can be easily extended to other types of medical images with relatively weak physiological characteristics. (2) We study different data augmentation operations in defining our contrastive prediction task. Results show that a composition of cropping, color distortion, gray scaling and rotation is beneficial in defining pretext tasks for fundus images to learn discriminative feature representations. (3) We evaluate our framework on the large-scale EyePACS dataset for DR grading. Results from linear evaluation and transfer capacity evaluation identify our method’s superiority. The source code is available at https://github.com/YijinHuang/Lesion-based-Contrastive-Learning

Figure 2: The proposed framework. In part A, an object detection network is trained to predict lesion patches in fundus images for subsequent contrastive learning. Illustrations of data augmentation operations are provided in part B. In part C, lesion patches are processed by a composition of data augmentation operations to generate multiple views of the lesion patches. A feature extractor is applied to encode patches into an embedding feature space that minimizes a contrastive loss.

2 Methods

2.1 Generation of lesion patches

Two datasets are used in this work. An object detection network is trained on one dataset with lesion annotations. Then, this detection network is used to generate lesion patches of fundus images from the other dataset for subsequent contrastive learning. Because of limited training samples in the first dataset, the detection network has a relatively poor generalization ability and cannot precisely predict lesions of fundus images from the other dataset. Therefore, a high confidence threshold is set to eliminate unconfident predictions. Then, we resize all fundus images to , and the bounding boxes of the patches are scaled correspondingly. After that, we expand every predicted bounding box to with the lesion lying in the center and then randomly shift the box within a range such that the resulting box still covers the lesion. In this way, we increase the difficulty of the contrastive prediction task while ensure the training of our CL framework can be performed with a large batch size. Please note the lesion detection network is not involved in the testing phase.

2.2 Generation of multiple views of lesion patches

Rather than employing a carefully designed task to learn feature representations [4, 11], CL tries to maximize the agreement between differently augmented views of the same image. Data augmentation, a widely used technique in deep learning, is applied to generate different views of a sample in CL. Some data augmentation operations that are commonly used in the natural image field may destroy important diagnostic features when transferred to the fundus image field. Therefore, as shown in part B of Fig. 2, four augmentation operations are considered in our work: cropping, color distortion, gray scaling, and rotation.

Let denote a randomly-sampled batch with a batch size of . Two random compositions of data augmentation operators are applied to each data point to generate two different views . Note that the parameters of these augmentation operators may differ from data point to data point. Now, we obtain a new batch . Given a patch that is generated from , we consider that is also generated from as a positive sample and every patch in the set as a negative sample .

2.3 Lesion-based contrastive learning

Given a data point , we first use a feature extractor

to extract its feature vector

. Specificly, is a CNN and is the feature vector right before the fully connected layer of the network. Then, a projection head is applied to map the feature vector into an embedding space to obtain . Given a batch , we define and . For every , our contrastive prediction task is to identify embedded feature from . To find

, we define the one having the highest cosine similarity with

as our prediction. To maximize the similarity of positive samples and to minimize that of negative samples, we define our contrastive loss as


where denotes the batch size of , , and is a temperature parameter. In the testing phase, we do not use the projection head but only the feature extractor for downstream tasks. Our framework of lesion-based CL is depicted in part C of Fig. 2.

2.4 Implementation details

2.4.1 Data augmentation operations.

For the cropping operation, we randomly crop lesion patches with a random factor in [0.8, 1.2]. For the gray scaling operation, each patch has a 0.2 probability of being gray scaled. The color distortion operation adjusts the brightness, contrast, and saturation of patches with a random factor in [-0.4, 0.4] and also changes the hue with a random factor in [-0.1, 0.1]. The rotation operation randomly rotates patches by an arbitrary angle.

2.4.2 Lesion detection network.

Faster-RCNN [17] with ResNet50 [6]

as the backbone is adopted as our lesion detection network. We apply transfer learning by initializing the network with parameters from a model pre-trained on the COCO


dataset. The detection network is trained with Adam optimizer for 100 epochs, with a 0.01 initial learning rate and getting decayed by 0.1 at the 50th epoch and the 80th epoch.

2.4.3 Contrastive learning network.

We also use ResNet50 as our feature extractor. The projection head is a one-layer MLP with ReLU as the activation function, which reduces the dimension of the feature vector to

. We adopt SGD optimizer with a 0.001 initial learning rate and cosine decay strategy to train the network. The batch size is set to be 768 and the temperature parameter is set to be . The augmented views of lesion patches are resized to as the input to our contrastive learning task. All experiments are equally trained for 1000 epochs with a fixed random seed.

Confidence threshold # images # lesions
0.7 25226 88867
0.8 21578 63362
0.9 15889 35550
Table 1: The total number of lesion patches for the lesion-based CL under different confidence thresholds of the lesion detection network with a 31.17% mAP.

2.5 Evaluation protocol

2.5.1 Linear evaluation.

Linear evaluation is a widely used method for evaluating the quality of the learned representations of a self-supervised model. The pre-trained feature extractor described in Section 2.3

is frozen and a linear classifier on top of it is trained in a fully-supervised manner. The performance of downstream tasks is then used as a proxy of the quality of the learned representations.

2.5.2 Transfer capacity evaluation.

Pre-trained parameters of the feature extractor can be transferred to models used in downstream tasks. To evaluate the transfer learning capacity, we unfreeze and fine-tune the feature extractor followed by a linear classifier in supervised downstream tasks.

3 Experiment

3.1 Dataset and evaluation metric

3.1.1 IDRiD.

IDRiD [16] consists of 81 fundus images, with pixel-wise lesion annotations of hemorrhages, microaneurysms, soft exudates, and hard exudates. These manual annotations are converted to bounding boxes to train an object detection network [7]. Microaneurysms are excluded in this work because detecting them is challenging and will lead to a large number of false positive predictions. In training the lesion detection network, 54 samples are used for training and 27 for validation.

3.1.2 EyePACS.

35k/11k/43k fundus images are provided in EyePACS [9] for training/validation/testing (the class distribution of EyePACS is shown in Fig. A1 of the appendix). According to the severity of DR, images are classified into five grades: 0 (normal), 1 (mild), 2 (moderate), 3 (severe), and 4 (proliferative). The training and validation sets without annotations are used for training our self-supervised model. The total number of lesion patches under different confidence thresholds of the detection network is shown in Table 1. Representative lesion detection results are provided in Fig. A2 of the appendix. Partial datasets are obtained by randomly sampling 1%/5%/10%/25%/100% (0.3k/1.7k/3.5k/8.7k/35k images) from the training set, together with the corresponding annotations. Images in the partial datasets and the test set are resized to for training and testing subsequent DR grading models in both linear evaluation and transfer capability evaluation settings.

3.1.3 Evaluation metric.

We adopt the quadratic weighted kappa [9] for evaluation, which works well for unbalanced datasets.

3.2 Composition of data augmentation operations

We evaluate the importance of an augmentation operation by removing it from the composition or applying it solely. As shown in the top panel of Table 2, it is insufficient for a single augmentation operation to learn discriminative representations. Even so, color distortion works much better than other operations, showing its importance in defining our contrastive prediction task for fundus images. This clearly indicates that DR grading benefits from color invariance. We conjecture it is because the EyePACS images are highly diverse, especially in terms of image intensity profiles. From the bottom panel of Table 2, we notice that an absence of any of the four augmentation operations leads to a decrease in performance. Applying the composition of all four augmentation operations considerably increases the difficulty of our contrastive prediction task, but it also significantly improves the quality of the learned representations. Notably, this contrasts common CL methods; a heavy data augmentation would typically hurt the performance of disease diagnosis [15].

Cropping Rotation Color distortion Gray Scaling Kappa
Table 2: Impact of different compositions of data augmentation operations. The kappa is the result of linear evaluation on the 25% partial dataset, under a detection confidence threshold of 0.8.
[rgb]0.2,0.2,0.2Method [rgb]0.2,0.2,0.2Confidence [rgb]0.2,0.2,0.2[rgb]0.2,0.2,0.2threshold [rgb]0.2,0.2,0.2Partial dataset
1% 5% 10% 25% 100%
[rgb]0.2,0.2,0.2Quadratic [rgb]0.2,0.2,0.2weighted kappa
[rgb]0.2,0.2,0.2Linear evaluation
[rgb]0.2,0.2,0.2SimCLR (128 128) - 16.19 26.70 31.62 37.41 43.64
[rgb]0.2,0.2,0.2SimCLR (224 224) - 12.15 26.56 29.94 37.86 55.32
[rgb]0.2,0.2,0.2Lesion-base CL (ours) 0.7 24.72 43.98 53.33 61.55 66.87
[rgb]0.2,0.2,0.2Lesion-base CL (ours) 0.8 26.22 48.18 56.30 62.49 66.80
[rgb]0.2,0.2,0.2Lesion-base CL (ours) 0.9 31.92 56.57 60.70 63.45 66.88
[rgb]0.2,0.2,0.2Transfer capacity evaluation
[rgb]0.2,0.2,0.2Supervised - 53.36 72.19 76.35 79.85 83.10
[rgb]0.2,0.2,0.2SimCLR (128 128) - 63.16 72.30 76.33 79.59 82.72
[rgb]0.2,0.2,0.2SimCLR (224 224) - 55.43 70.66 75.15 77.32 82.11
[rgb]0.2,0.2,0.2Lesion-base CL (ours) 0.7 68.37 75.40 77.34 80.34 82.80
[rgb]0.2,0.2,0.2Lesion-base CL (ours) 0.8 68.14 74.18 77.49 80.74 83.22
[rgb]0.2,0.2,0.2Lesion-base CL (ours) 0.9 66.43 73.85 76.93 80.27 83.04
Table 3: Linear evaluation and transfer capacity evaluation results on partial datasets.

3.3 Evaluation on DR grading

To evaluate the quality of the learned representations from fundus images, we perform linear evaluation and transfer capacity evaluation on the partial datasets. A state-of-the-art CL method SimCLR [2] is adopted as our baseline for comparison, which nevertheless takes an entire fundus image as the input for the contrastive prediction task. In Table 3, SimCLR (128 128) denotes the multiple views of fundus images are resized to 128 128 as the input for the contrastive prediction task in SimCLR, being consistent with the input size of our lesion-based CL framework. However, crop-and-resize transformation may critically change the pixel size of the input. SimCLR (224

224) experiments are conducted based on the consideration that aligning the pixel size of the input for CL and that for downstream tasks may achieve better performance. The ResNet50 model in the fully-supervised DR grading is initialized with parameters from a model trained on the ImageNet

[10] dataset. Training curves are provided in Fig. A3 of the appendix.

As shown in Table 3, our lesion-based CL under a detection confidence threshold of 0.9 achieves 66.88% kappa on linear evaluation on the full training set. Significant improvements over SimCLR have been observed; 23.34% over SimCLR (128 128) and 11.56% over SimCLR (224 224). The superiority of our method is more evident when a smaller training set is used for linear evaluation. Note that although using a higher confidence threshold results in fewer lesion patches for training, a better performance is achieved on linear evaluation. This implies that by improving the quality of lesions in the training set of the contrastive prediction task, the model can more effectively learn discriminative feature representations. For the transfer capacity evaluation, when fine-tuning on the full training set, there is not much difference between the fully-supervised method and CL methods. This is because feature representations can be sufficiently learned under full supervison, and thus there may be no need for CL based learning. Therefore, the advantage of our proposed method becomes more evident when the training samples for fine-tuning become fewer. With only 1% partial dataset for fine-tunning, the proposed method under a confidence threshold of 0.7 has a higher kappa by 15.01% than the fully-supervised method. Both linear evaluation and transfer capacity evaluation suggest that our proposed method can better learn feature representations, and thus is able to enhace DR grading, by exploiting unlabeled images (see also Fig. A4 in the appendix).

4 Conclusion

In this paper, we propose a novel self-supervised framework for DR grading. We use lesion patches as the input for our contrastive prediction task rather than entire fundus images, which encourages our feature extractor to learn representations of diagnostic features, improving the transfer capacity for downstream tasks. We also present the importance of different data augmentation operations in our CL task. By performing linear evaluation and transfer capacity evaluation on EyePACS, we show that our method has a superior DR grading performance, especially when the sample size of the training data with annotations is limited. This work, to the best of our knowledge, is the first one of its kind that has attempted contrastive learning on fundus image based DR grading.


  • [1] Bachman, P., Hjelm, R.D. and Buchwalter, W.: Learning representations by maximizing mutual information across views. In: NeurIPS (2019)
  • [2] Chen, T., Kornblith, S., Norouzi, M. and Hinton, G.: A simple framework for contrastive learning of visual representations. In: PMLR (2020)
  • [3]

    DeVries, T. and Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint (2017).

  • [4] Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)
  • [5] He, A., Li, T., Li, N., Wang, K. and Fu, H.: CABNet: Category Attention Block for Imbalanced Diabetic Retinopathy Grading. IEEE Trans. Med. Imaging, TMI, 40(1), pp.143-153(2020) (2020)
  • [6] He, K., Zhang, X., Ren, S. and Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
  • [7] Huang, Y., Lin, L., Li, M., Wu, J., Cheng, P., Wang, K., … Tang, X.: Automated hemorrhage detection from coarsely annotated fundus images in diabetic retinopa- thy. In: IEEE 17th International Symposium on Biomedical Imaging, ISBI, pp. 1369-1372 (2020)
  • [8] Jiao, J., Cai, Y., Alsharid, M., Drukker, L., Papageorghiou, A.T. and Noble, J.A.: Self-Supervised Contrastive Video-Speech Representation Learning for Ultrasound. In: Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (eds.) MICCAI 2020. LNCS, volume 12263, pp. 534–543, Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_51
  • [9] Kaggle diabetic retinopathy detection competition. https://www.kaggle.com/c/diabetic-retinopathy-detection
  • [10]

    Krizhevsky, A., Sutskever, I. and Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeruIPS (2012)

  • [11] Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: ICCV (2017)
  • [12] Lin, L., Li, M., Huang, Y., Cheng, P., Xia, H., et al.: The SUSTech-SYSU dataset for automated exudate detection and diabetic retinopathy grading. Scientific Data 7(1), 1-1 (2020)
  • [13] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
  • [14] Lin, Z., Guo, R., Wang, Y., Wu, B., Chen, T., Wang, W., Chen, D.Z. and Wu, J.: A framework for identifying diabetic retinopathy based on anti-noise detection and attention-based fusion. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-L´opez, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, volume 11071, pp. 74-82, Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_9
  • [15] Li, X., Jia, M., Islam, M.T., Yu, L. and Xing, L.: Self-supervised Feature Learning via Exploiting Multi-modal Data for Retinal Disease Diagnosis. IEEE Trans. Med. Imaging, TMI, 39(12), pp.4023-4033(2020)
  • [16] Porwal, P., et al.: Indian diabetic retinopathy image dataset (IDRID): a database for diabetic retinopathy screening research. Data 3(3), 25 (2018)
  • [17] Ren, S., He, K., Girshick, R. and Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6), pp.1137–1149(2017)
  • [18] Spitzer, H., Kiwitz, K., Amunts, K., Harmeling, S., Dickscheid, T.: Improving cytoarchitectonic segmentation of human brain areas with self-supervised siamese networks. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-L´opez, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11072, pp. 663–671. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00931-1_76
  • [19] Wang, Z., Yin, Y., Shi, J., Fang, W., Li, H., Wang, X.: Zoom-in-Net: deep mining lesions for diabetic retinopathy detection. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 267–275. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7_31
  • [20] Zeng, X., Chen, H., Luo, Y. and Ye, W.: Automated detection of diabetic retinopathy using a binocular siamese-like convolutional network. In: ISCAS (2019)
  • [21] Zhuang, X., Li, Y., Hu, Y., Ma, K., Yang, Y., Zheng, Y.: Self-supervised feature learning for 3D medical images by playing a rubik’s cube. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 420–428. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_46