Real-World Multi-Domain Data Applications for Generalizations to Clinical Settings

07/24/2020 ∙ by Nooshin Mojab, et al. ∙ University of Illinois at Chicago 0

With promising results of machine learning based models in computer vision, applications on medical imaging data have been increasing exponentially. However, generalizations to complex real-world clinical data is a persistent problem. Deep learning models perform well when trained on standardized datasets from artificial settings, such as clinical trials. However, real-world data is different and translations are yielding varying results. The complexity of real-world applications in healthcare could emanate from a mixture of different data distributions across multiple device domains alongside the inevitable noise sourced from varying image resolutions, human errors, and the lack of manual gradings. In addition, healthcare applications not only suffer from the scarcity of labeled data, but also face limited access to unlabeled data due to HIPAA regulations, patient privacy, ambiguity in data ownership, and challenges in collecting data from different sources. These limitations pose additional challenges to applying deep learning algorithms in healthcare and clinical translations. In this paper, we utilize self-supervised representation learning methods, formulated effectively in transfer learning settings, to address limited data availability. Our experiments verify the importance of diverse real-world data for generalization to clinical settings. We show that by employing a self-supervised approach with transfer learning on a multi-domain real-world dataset, we can achieve 16 standardized dataset over supervised baselines.



There are no comments yet.


page 1

page 2

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With successful applications of deep neural networks in computer vision, there has been a growing interest in utilizing these machine learning based models on real-world data, particularly in medical imaging. However, the applications on real-world data are limited and challenging. Most of the existing deep learning based approaches have shown promising results by relying on the availability of standardized and adequately labeled datasets. However, real-world data is characterized by variability in quality, machine-type, setting, and source, lacking standardization and labels. Therefore, with the existence of complex real-world data across different clinical applications, and the significant growing interest in employing deep learning models in medical imaging, it becomes crucial to assess the generalization capacity of such models on real-world data applications.

In this paper, we aim to assess the capacity of deep learning models in coping with real-world data and their generalization aspect by answering two main questions: (1) How well can deep learning based models cope with the complexity of real-world data comprised of multiple device domains versus standardized datasets with a single device domain? And (2) What is the role of real-world data in generalizing to a clinical setting? We answer these two questions by applying deep learning based models on real-world ophthalmic imaging data for the task of glaucoma detection.

Glaucoma is a complex disease that gradually leads to optic nerve damage, resulting in progressive irreversible vision loss. Over 60 million people are diagnosed with glaucoma, encompassing more than 8 million cases with irreversible blindness [5]. The global incidence of glaucoma is anticipated to increase up to 111.8 million by 2040 [14].

Ocular imaging is one of the main modalities used for glaucoma screening. Among the imaging modalities most commonly used for glaucoma, digital Fundus photos are heavily utilized, given their ease and visualization abilities of the disc and cup regions for noninvasive evaluation of the optic nerve head. Assessment of the optic nerve head is based on measuring the optic disc and optic cup regions in Fundus photos and calculating the cup-to-disc ratio (CDR). A sample of glaucomatous and non-glaucomatous optic nerve, with localization of the optic disc and optic cup regions is illustrated in Fig. 1. All types of glaucoma involve glaucomatous optic neuropathy, and structural progression is characterized by optic nerve thinning, resulting in larger CDR measurements. Glaucoma is suspected when CDR is above some threshold, usually .

Fig. 1: Samples of full Fundus images from I-ODA dataset. The top row represents a glaucomatous Fundus of the eye and the bottom row shows a non-glaucomatous Fundus. The area of the optic nerve head is zoomed in for better visualization of optic disc and cup. CDR is calculated by dividing the cup height (in green arrows) over disc height (in bright blue arrows).

Most approaches used to detect the presence of glaucoma in Fundus photos fall into two main categories. (i) Segmentation based approaches utilizing the CDR measurement to detect glaucoma. These approaches involve localizing the disc and cup regions and identifying the presence of glaucoma by measuring the CDR in Fundus photos [8],[13],[17]. These methods require labeled segmented data requiring the expertise of at least two ophthalmologists to manually mark the region of interest (ROI) in the image. The high variability among graders and low reproducibility limit the application of such approaches. There are some approaches that propose to address the scarcity of segmented labeled data utilizing a multi-task framework [10]. However, their optimal performance is achieved when labeled data for both tasks are available from the same domain. (ii) Classification based approaches in which Fundus photos are being directly fed to a neural network for detecting glaucoma [4], [1], [12]

. These approaches mostly employ specifically designed Convolutional Neural Networks (CNNs) trained on small-scaled standardized datasets, limiting their applications to other clinical settings.

Acquiring labeled data in healthcare is an ongoing challenge as the labeling process not only is time-consuming and expensive but also requires at least two expert graders. Additionally, medical applications suffer from limited access to even unlabeled data due to HIPAA regulations, patient privacy, and ethical considerations, data ownership, and challenges in collecting data from different sources.

Transfer learning and self-supervised methods have shown promising results in exploiting unlabeled data to learn visual representations [2], [15] in applications with data limitations such as healthcare. Although self-supervised learning could be a potential solution for applications with data shortages, most of the proposed methods rely on specific design choices for network architecture and predictive pretext tasks [6], [11], [7], [16]. Clinical applications such as glaucoma detection, mostly involve complex multi-domain datasets which limit the employment of such methods that are specifically designed for a particular task or dataset. The recent approach proposed in [3], which is currently the state-of-the-art in self-supervised learning attempts to avoid the complexity of specific design choices by incorporating a broad family of augmentation and contrastive loss into a simple off the shelf architecture, ResNet-50. The composition of data augmentation incorporated in a simple design choice in [3] could lead to learning more generalizable visual features and potentially generalize better to other settings. This inspired us to apply the proposed method in [3] on a real-world clinical application, glaucoma detection. We believe that real-world clinical applications with diverse multi-domain datasets can benefit from a broad composition of augmentations embedded in a simple framework. This could potentially lessen the sensitivity of the model to domain-specific information in data to some degree and hence improves upon the generalization capacity of the model.

We propose to formulate our problem into a transfer learning framework where we employ the self-supervised approach proposed in [3]

for visual feature extraction to both alleviate the scarcity of the data and improve generalizations. We evaluate our work by performing extensive experiments on the task of glaucoma detection using real-world datasets. Due to the lack of publicly available large and diverse datasets, we employ a subset of a private clinical ophthalmology dataset referred to as I-ODA which is a diverse multi-domain dataset capturing the complexity characteristic of a real-world data.

The main contribution of our work is as follows:

  • We effectively apply the self-supervised representation method via transfer learning on multi-domain real-world data applications and show its superiority over supervised approaches.

  • We show that leveraging self-supervised visual representation learning via transfer learning could be a potential solution to the limited data in healthcare applications.

  • Our experiments shows that deep neural networks have a limited capacity in coping with real-world datasets compared to a standardized dataset.

  • Our experiments demonstrate that training deep learning models with diverse real-world data generalizes better to clinical settings.

Ii Method

Ii-a Problem Formulation

The training data in our problem is comprised of samples denoted as . indicates the binary label of input where values of and represent diseased and non-diseased class respectively. Given

, our goal is to learn a binary classifier function

parameterized by . We define the following functions


where parameterized by represents the encoder function mapping the given input to the latent feature encoding . parameterized by represent the decoder function mapping the feature encoding to the label space. Given the input , function can be decomposed such that


where . Given the input , binary classifier estimates the probability of an input image being diseased. Both encoder and decoder functions and are modeled using neural networks. In particular, the encoder function is modeled by a Convolutional Neural Network proposed in [3] and function

is taken to be a simple linear layer. The loss function of the model comprises a contrastive and classification loss which will be explained in the following sections.

Ii-B Contrastive Loss

To model our encoder function we employ the approach proposed in [3] in which the representations are learned via contrastive loss in the latent space of two augmented views of the same data input. More specifically, this approach comprises three key components: (i) Given the input , two augmented views of and are obtained by applying stochastic augmentations, (ii) The base encoder maps the augmented inputs and to feature maps and , (iii) A projection head maps the feature maps and

to latent vectors

and respectively. Encoder function and projection head are both modeled by neural networks. Given two latent vectors and , the contrastive loss is defined as



represents the cosine similarity and

represents a temperature parameter. Given a batch of samples, the application of the two augmentations results in samples where pairs of and are accounted as positive samples, and the remainder of pair samples in the augmented batch are accounted as negative. Therefore the overall loss is defined as


Ii-C Classification Loss

Given the training set of , the loss function over the samples is as follows


where represents the classification loss for sample and it is defined as cross-entropy between the model’s estimation and the ground-truth label


where indicates the model’s estimation for sample being diseased or not and represent its corresponding ground-truth label. We aim to perform the task of detecting glaucoma disease given the Fundus photo. Therefore our dataset is a collection of Fundus photos labeled as either glaucoma () or non-glaucoma (=0).

Ii-D Transfer Learning

We propose the solution to our task by formulating our problem into the transfer learning framework. For the purpose of this paper, we employ the pretrained encoder networks obtained via the approach proposed in [3], where the projection head is thrown away and the encoder function and feature maps are used to perform the binary classification task of glaucoma detection. We take the base encoder to be ResNet-50 as suggested in [3] and consider two settings for transfer learning across our datasets: (i) using the fixed features extracted from a pretrained encoder network to train a linear classifier on top of the frozen based network, and (ii) fine-tuning a fraction of the network. In this scenario, we utilize the weights of the pretrained network as initialization.

Iii Experiments

Iii-a Dataset

I-ODA 111For questions related to accessing the I-ODA database please contact the author, Joelle Hallak. has been created from imaging data belonging to patients who visited the Illinois Eye and Ear Infirmary of the University of Illinois Chicago (UIC). This dataset was created by assigning unlabeled images into categories based on the imaging devices by utilizing metadata information and a hybrid approach of machine learning algorithms and manual search. After labeling images into one of the categories, patients were labeled with their corresponding diagnosis using billing information (ICD codes). We isolated glaucoma and non-glaucoma patients from this dataset. Non-glaucoma patients in our dataset are selected as those patients being diagnosed with neither glaucoma, nor glaucoma suspect with no potential damage to the optic nerve head. Among the categories, we isolated Fundus images for the purpose of our experiments in this paper which we refer to as ODA-G dataset. Each Fundus image can be generated by a different imaging device. The statistics of image data and the corresponding imaging device distribution in the ODA-G dataset is illustrated in Fig. 2.

Fig. 2: The number of Fundus photos per imaging device. The majority of Fundus images are generated by devices A and B.

Imaging data in the ODA-G dataset is spread across different imaging devices where devices A, B, C, and D comprise of the data, while the other devices are responsible for almost of the data. The data distribution for images generated from each device can be different and hence we can have a dataset which is a mixture of different domains. A snapshot of samples of Fundus photos from major imaging devices A, B, C, and D are illustrated in Fig. 3.

(a) Fundus samples from device A
(b) Fundus samples from device B
(c) Fundus samples from device C
(d) Fundus samples from device D
Fig. 3: A snapshot of samples of Fundus photos generated by different devices. The top row illustrates a subset of images from devices A and B from left to right. The bottom row illustrates a subset of Fundus generated by devices C and D from left to right.

As can be seen, images across different devices can have different color distributions, different shapes, and positioning of the cup and disc regions. We refer to imaging devices as different domains in our dataset. In order to analyze the complexity of coping with complex real-world data and its generalization capacity, we consider two subsets of data, (i) Fundus images generated from only one imaging device. We isolated the Fundus photos from device A which we refer to as ODA-A and (ii) ODA-G which contains Fundus generated by all devices used for glaucoma screening in our dataset. Both subsets are generated from a real-world setting, however, in this scenario, ODA-A comprises a single domain data resembling the properties of a standardized dataset to some extent. On the other hand, ODA-G comprises multi-domain data capturing the complex aspect of diverse real-world data. We aim to answer two main questions using these two subsets, (1) how do deep learning models cope with the complexity of real-world data as opposed to a standardized dataset? (2) Is training the model on real-world data important for generalizations?

Our goal was to also validate our results on an external dataset. However, to the best of our knowledge, the availability of public datasets with adequate samples of full Fundus photos for glaucoma detection is very limited. Therefore we performed all of our experiments on our datasets, ODA-G and ODA-A.

Iii-B Experimental Setting

Iii-B1 Base Encoder

The base encoder and projection function are modeled with ResNet-50 and

layer MLP with Rectified Linear Unit (ReLU) activation respectively. We employ pretrained networks on ImageNet and CIFAR-10 in our experiments. The CIFAR-10 based network employs a simpler network compared to the ImageNet based network. The CIFAR-10 based encoder employs a ResNet like architecture with a depth of

composing of

main residual blocks each consists of two stacked convolutional followed by batch normalization. The ImageNet based encoder employs ResNet-50 with three hidden layers widths (

, , ) [3], [9] which we utilize in our experiments. When using the ImageNet pretrained network, a random crop and resize with random flip, color distortions, and Gaussian blur are employed for data augmentations. When CIFAR-10 is used, only random crop and resize with random flip and color distortions are employed for augmentation [3].

Iii-B2 Transfer Learning

We evaluate our transfer learning setting across two datasets ODA-G and ODA-A by (1) learning a linear classifier on top of the pretrained network and (2) fine-tuning of the network where .

In the first setting, a logistic regression classifier is trained on top of the frozen base encoder network. The extracted features from the pretrained network are used to perform binary classification for detecting glaucoma in a given Fundus input. In this scenario, no data augmentation is applied. The number of epochs is selected from

, batch size is selected from and learning rate is selected from .

In the second setting, we fine-tune the base network using the pretrained network’s weights as initialization. We fine-tune of the network for epochs where learning rate is selected from . The batch size is selected from

and the rest of hyperparameters are set to default value as reported in

[3]. For optimizer we use SGD optimizer with Nestrov momentum with momentum parameter set to and Adam.

In all of our experiments, input images were resized to when using ImageNet based pretrained network and resized to when CIFAR-10 based pretrained network is employed. The split of and is used for training and testing. The hyperparameter tuning for each dataset is performed on the validation set chosen randomly as subset of the training set. After finding the appropriate values for hyperparameters, all the training set is used for training. Notably, each patient in our dataset can have more than one Fundus photo taken in multiple exam sessions. To prevent the leakage of information from the test set into training, we split the data by patients. Eventually, the result on the test data is reported in terms of the accuracy metric.

Iii-B3 Baselines

We compare our results with fully supervised methods. Previous approaches proposed for glaucoma detection mostly employ a Convolutional Neural Network (CNN) trained on small unique datasets. Since most of these studies do not release their code publicly, we simulate the employed general approach by developing a CNN model and consider that as our supervised baseline. We employ two approaches for our supervised baselines. (1) For the sake of comparison with our proposed approach, we employ ResNet-50, similar to the base encoder network architecture, followed by MLP with hidden layer of size

, batch normalization, ReLU activation, and dropout. Then sigmoid activation function is applied on the last layer to get the final output. (2) A simpler CNN specifically designed for our dataset. Our CNN network is composed of

convolutional block, where each consists of

convolutional layers with ReLU activation, batch normalization, non-linearity ReLU, and max pooling. Two fully connected layers with size

and followed by batch normalization, non-linearity ReLU, and dropout are further applied. The Sigmoid activation function is used on the final layer to get the output of the model.

We consider both random and ImagNet weights as initialization for the ResNet-50 model. We consider our supervised baselines in both absence and presence of data augmentation. We explore color distortion and random flip for our augmentation.

Iii-C Experimental Results

In this section we attempt to answer our two main questions through extensive experiments performed on the task of glaucoma detection: (1) How well can deep learning based models cope with the complexity of real-world data comprised of multiple device domains versus standardized datasets with a single device domain? and (2) What is the role of real-world data in generalizing to a clinical setting?

We first demonstrate the effectiveness of combining a self-supervised method with transfer learning on applications with limited data compared to fully supervised models. Further, to address our two main goals, we assess the capacity of neural networks in dealing with complex real-world data by comparing the performance of the model on a multi-domain real-world dataset, ODA-G, as opposed to a single-domain dataset ODA-A, resembling a standardized and simpler dataset. Additionally, we assess the role of learning with real-world data for generalization to a clinical setting (glaucoma detection).

We also analyze the effect of data augmentation, training time, and fraction of fine-tuning on the performance of the model.

Iii-C1 Comparison against fully supervised approach

In this section, we demonstrate the result of our work on both ODA-G and ODA-A dataset for the task of glaucoma detection using transfer learning in both linear evaluation and fine-tuning settings. We first evaluate our approach by learning a linear classifier using self-supervised representations learned by a pretrained network and compare the result with supervised baselines. To potentially improve the performance of the network compared to using the fixed extracted features, we further investigate the performance of our classifier with fine-tuning the base encoder network using the weights of the pretrained network as initialization. We experiment with fine-tuning of the network. For self-supervised pretrained networks, we consider both ImageNet and CIFAR-10 based networks proposed in [3]. For the ImageNet based pretrained network, we report the results for all ResNet-50 (, , ) [3], [9]. For the CIFAR-10 based network, we report the result on the simpler ResNet. We report the best result achieved for each combination of dataset and model in Table I.

ODA-G ODA-A Weights
Supervised Baselines
ResNet-50 (1) 59.93 83.57 Random
ResNet-50 (1) (+Augmentation) 82.61 86.44 Random
CNN 64.67 88.35 Random
CNN (+Augmentation) 82.93 90.43 Random
self-supervised TR+Linear
ResNet 72.87 82.40 CIFAR-10
ResNet-50 (1) 80.26 88.92 ImageNet
ResNet-50 (2) 83.84 91.00 ImageNet
ResNet-50 (4) 84.30 92.34 ImageNet
self-supervised TR+Fine-tuning
ResNet 82.05 86.04 CIFAR-10
ResNet-50 (1) 83.28 90.35 ImageNet
ResNet-50 (2) 83.14 90.43 ImageNet
ResNet-50 (4) 82.05 90.11 ImageNet
TABLE I: Comparison of employing self-supervised learned representations via transfer learning (TR) in linear evaluation and fine-tuning settings against fully supervised baselines.

The result depicted in Table I shows the superiority of transfer learning using self-supervised learned representations over fully supervised approaches. We achieve our best result with ResNet-50 (4). For supervised baselines, we only report the result for ResNet-50 with random initialization as we did not observe any significant change between the two weight initializations (i.e. random and ImageNet weights). We can also see that employing wider networks can further improve the performance of the linear classifier on the test set.

We observed that fine-tuning the network does not significantly change the performance of the model compared to training a linear classifier. Moreover, the result shown in Table I indicates that employing ImageNet pretrained weights compared to CIFAR-10 achieves a better result on both datasets ODA-G and ODA-A. This result is expected as training with a more diverse dataset comprised of a broader range of image categories, helps with learning more generalizable features.

Iii-C2 Medical domains benefit from transfer learning using self-supervised learned features

Among the supervised baselines in Table I, the CNN (+Augmentation) method that is specifically designed for our dataset-task achieves a better result over off the shelf ResNet networks. This result suggests that specifically designed networks for each dataset-task combination are usually crucial to the success of supervised approaches which limits their applications and generalization capacity to some extent. The superiority of our framework over supervised baselines shows that we can avoid the complexity of design choice for each particular task by simply using one of off the shelf networks. This result is particularly important in medical imaging where applications are limited with data to train robust supervised models.

Iii-C3 Neural Networks On real-world Data

In this section, we assess the capacity of neural networks in coping with real-world datasets versus standardized datasets, commonly used by most deep learning algorithms. We analyze our experimental results on the two datasets ODA-G and ODA-A. As a reminder, ODA comprises data from multiple imaging devices forming a complex multi-domain dataset, while ODA-A comprises a single domain data representing a less challenging standardized dataset. Naturally, a mixture of different data distributions emanating from different imaging devices in the ODA-G dataset poses a major challenge for training deep neural networks. From Table I we observe that experiments on ODA-A significantly outperform the experiments on ODA-G in all three settings supervised, linear evaluation, and fine-tuning. The performance gap is particularly noticeable in supervised settings as their success usually relies on the availability of large standardized labeled datasets. This experiment verifies that deep learning models do not perform as well on complex real-world datasets as on standardized datasets.

Iii-C4 Training deep learning models with diverse real-world data generalizes better to clinical settings.

To assess the role of training with complex real-world data in generalization to a clinical setting, we design an experiment where we form four pairs of (ODA-G, ODA-G), (ODA-G, ODA-A), (ODA-A, ODA-G) and (ODA-A, ODA-A) datasets. We train on the first element of each pair and evaluate on the second element. The result of this experiment is shown in Table  II.

Training Data Testing Data Test Accuracy(%)
ODA-G ODA-G 84.30
ODA-G ODA-A 90.35
ODA-A ODA-G 76.53
ODA-A ODA-A 92.34
TABLE II: Generalization across pairs of datasets employing self-supervised learned representation via transfer learning.

As the results suggest if we only train the model on one small unique dataset, ODA-A, and evaluate on an extremely complex real-world dataset, ODA-G, we achieve the worst result depicted in red in Table II. On the other hand, when we train the model using a diverse multi-domain real-world dataset, ODA-G and evaluate it on a smaller and more unique dataset, ODA-A, we achieve a promising result as depicted in green in Table II. This result indicates that even though complex data makes the learning process harder, it leads the model towards learning more generalizable features. It is not surprising that (ODA-A, ODA-A) experiment achieves the best result, as this is where usually neural networks perform best. However, we can see that the obtained result from the (ODA-G, ODA-A) experiment is also competitive with the result of the (ODA-A, ODA-A) experiment, indicating the advantage of learning with real-world data. The overall results validate the crucial role of real-world data for generalizing to clinical settings. If the network trains on small unique datasets, it fails to generalize to other domains. Hence we need complex diverse datasets that capture aspects of real-world data to cope with generalizations to other domains, especially in clinical settings.

Self-supervised learning approaches generalize better on real-world data. We would like to assess the capacity of self-supervised approaches in generalization compared to supervised methods. We perform the same experiments but this time under supervised settings. The result is shown in Table III.

Training Data Testing data Test Accuracy (%)
ODA-G ODA-G 82.93
ODA-G ODA-A 74.00
ODA-A ODA-G 69.19
ODA-A ODA-A 90.43
TABLE III: Generalization across pairs of datasets employing supervised approaches.

The observation from Table III (shown in bold) supports the result in our previous section. Training with a real-world dataset, ODA-G, generalizes better on ODA-A dataset than the reversed experiment (ODA-A, ODA-G). However, we can see that the performance gap between (ODA-G, ODA-G) and (ODA-G, ODA-A) experiment and between (ODA-A, ODA-A) and (ODA-A, ODA-G) is more noticeable compared to the result in Table II. This result may indicate that supervised approaches perform the best when they are trained and evaluated on the same dataset. Moreove, the comparison between the results in Tables II and  III shows that (ODA-G, ODA-A) performs poorly under a supervised approach achieving only accuracy, while under a self-supervised approach it achieves which is more than improvement over the supervised setting. This result verifies the superior capacity of self-supervised approaches in generalizing across different device domains, indicating another advantage of applications of self-supervised representation learning methods.

Iii-C5 Performance Analysis

In this section, we first analyze the role of data augmentation on generalization. Then we evaluate the effect of training time and fine-tuning of the network on the performance.

Data augmentation is crucial for generalization on real-world data. To assess the role of data augmentation in generalization, we explore the behavior of our supervised approach in both the presence and absence of data augmentation. For ODA-G we applied the composition of color distortion and random flip and for ODA-A we only applied the random flip. As the result in Table I suggests, data augmentation can efficiently improve the generalization on the test data. This result shows that incorporating data augmentation enhances the capacity of the model to learn more generalizable features. This improvement is particularly more noticeable for the ODA-G dataset containing multi-domain data. This result could indicate that the model leverages the most from data augmentation when training on a diverse real-world dataset. We believe that the composition of data augmentation utilized in [3], also contributes significantly in the success of this method on our task.

Diverse large-scaled datasets benefit from longer training. We perform our experiments by increasing the number of training epochs while keeping the batch size fixed for each model and report the accuracy on the testing set. Fig. 4 and  5 shows the plots of test accuracy versus training epochs for linear evaluation employing ImageNet based ResNet-50 (, , ) and CIFAR-10 based ResNet respectively on both ODA-G and ODA-A datasets.

(a) ODA-G dataset
(b) ODA-A dataset
Fig. 4: Effect of the number of training epochs on testing accuracy employing ImageNet based pretrained network for ODA-G and ODA-A datasets.
Fig. 5: Effect of the number of training epochs on testing accuracy employing CIFAR-10 based pretrained network for ODA-G and ODA-A datasets.

As the plots in Fig. 4 suggest, the performance improves as we increase the number of training epochs when employing the ImageNet based encoder network on the ODA-G dataset. However, increasing the number of epochs has the opposite effect on the ODA-A dataset. We observe a similar effect when using CIFAR-10 based encoder as the Fig. 5 suggests. However, we can see that increasing the number of epochs has a less effect on performance improvement on ODA-G compared to using the ImageNet based encoder. The overall result indicates that more diverse and larger datasets can benefit more from longer training.

Additionally, we plot the effect of fine-tuning of the network on test accuracy. The result is shown in Fig. 6 when fine-tuning the network using the CIFAR-10 based encoder network.

Fig. 6: The effect of fine-tuned network on test accuracy using CIFAR-10 based encoder network.

As the plot in Fig. 6 suggests, the test accuracy improves as we fine-tune a larger portion of the network when training on ODA-G dataset. Training on the ODA-A still benefits from fine-tuning the network especially when it is fine-tuned on of the network but after that, we do not observe a significant change in the performance. We achieved our best result for this experiment by fine-tuning the whole network from scratch when using the ODA-G dataset and of the network when using ODA-A. We also observed that the model does not benefit substantially from fine-tuning the network with ImageNet weights.

Iv Conclusion

In this paper, we utilized self-supervised visual representation learning methods effectively formulated in transfer learning settings to alleviate the shortage of data in medical imaging applications and improve upon the generalization capacity of the model. We verified our results by performing a glaucoma detection task on a real-world ophthalmic data application. We showcased the importance of learning with real-world data for generalization, through performing extensive experiments on a multi-domain real-world dataset versus a single-domain standardized dataset. Additionally, We showed that without learning on complex real-world data, the deep learning models cannot generalize well to clinical settings.


  • [1] Q. Abbas (2017) Glaucoma-deep: detection of glaucoma eye disease on retinal fundus images using deep learning. International Journal of Advanced Computer Science and Applications 8 (6), pp. 41–45. Cited by: §I.
  • [2] Y. Bengio (2012) Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML workshop on unsupervised and transfer learning, pp. 17–36. Cited by: §I.
  • [3] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §I, §I, §II-A, §II-B, §II-D, §III-B1, §III-B2, §III-C1, §III-C5.
  • [4] X. Chen, Y. Xu, D. W. K. Wong, T. Y. Wong, and J. Liu (2015) Glaucoma detection based on deep convolutional neural network. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 715–718. Cited by: §I.
  • [5] B. M. Davis, L. Crawley, M. Pahlitzsch, F. Javaid, and M. F. Cordeiro (2016) Glaucoma: the retina and beyond. Acta neuropathologica 132 (6), pp. 807–826. Cited by: §I.
  • [6] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430. Cited by: §I.
  • [7] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §I.
  • [8] F. Khan, S. A. Khan, U. U. Yasin, I. ul Haq, and U. Qamar (2013) Detection of glaucoma using retinal fundus images. In The 6th 2013 Biomedical Engineering International Conference, pp. 1–5. Cited by: §I.
  • [9] A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    pp. 1920–1929. Cited by: §III-B1, §III-C1.
  • [10] N. Mojab, V. Noroozi, P. Yu, and J. Hallak (2019) Deep multi-task learning for interpretable glaucoma detection. In

    2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)

    pp. 167–174. Cited by: §I.
  • [11] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §I.
  • [12] U. Raghavendra, H. Fujita, S. V. Bhandary, A. Gudigar, J. H. Tan, and U. R. Acharya (2018) Deep convolution neural network for accurate diagnosis of glaucoma using digital fundus images. Information Sciences 441, pp. 41–49. Cited by: §I.
  • [13] A. Sevastopolsky (2017) Optic disc and cup segmentation methods for glaucoma detection with modification of u-net convolutional neural network. Pattern Recognition and Image Analysis 27 (3), pp. 618–624. Cited by: §I.
  • [14] Y. Tham, X. Li, T. Y. Wong, H. A. Quigley, T. Aung, and C. Cheng (2014) Global prevalence of glaucoma and projections of glaucoma burden through 2040: a systematic review and meta-analysis. Ophthalmology 121 (11), pp. 2081–2090. Cited by: §I.
  • [15] K. Weiss, T. M. Khoshgoftaar, and D. Wang (2016) A survey of transfer learning. Journal of Big data 3 (1), pp. 9. Cited by: §I.
  • [16] R. Zhang, P. Isola, and A. A. Efros (2016)

    Colorful image colorization

    In European conference on computer vision, pp. 649–666. Cited by: §I.
  • [17] J. G. Zilly, J. M. Buhmann, and D. Mahapatra (2015) Boosting convolutional filters with entropy sampling for optic cup and disc image segmentation from fundus images. In International Workshop on Machine Learning in Medical Imaging, pp. 136–143. Cited by: §I.