Automatic detection of multiple pathologies in fundus photographs using spin-off learning

07/22/2019 ∙ by Gwenolé Quellec, et al. ∙ Inserm 2

In the last decades, large datasets of fundus photographs have been collected in diabetic retinopathy (DR) screening networks. Through deep learning, these datasets were used to train automatic detectors for DR and a few other frequent pathologies, with the goal to automate screening. One challenge limits the adoption of such systems so far: automatic detectors ignore rare conditions that ophthalmologists currently detect. To address this limitation, we propose a new machine learning (ML) framework, called spin-off learning, for the automatic detection of rare conditions. This framework extends convolutional neural networks (CNNs), trained for frequent conditions, with an unsupervised probabilistic model for rare condition detection. Spin-off learning is based on the observation that CNNs often perceive photographs containing the same anomalies as similar, even though these CNNs were trained to detect unrelated conditions. This observation was based on the t-SNE visualization tool, which we decided to include in our probabilistic model. Spin-off learning supports heatmap generation, so the detected anomalies can be highlighted in images for decision support. Experiments in a dataset of more than 160,000 screening examinations from the OPHDIAT screening network show that spin-off learning can detect 37 conditions, out of 41, with an area under the ROC curve (AUC) greater than 0.8 (average AUC: 0.938). In particular, spin-off learning significantly outperforms other candidate ML frameworks for detecting rare conditions: multitask learning, transfer learning and one-shot learning. We expect these richer predictions to trigger the adoption of automated eye pathology screening, which will revolutionize clinical practice in ophthalmology.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

According to the World Health Organization, 285 million people are visually impaired worldwide, but the preventable causes represent 80% of the total burden (Pascolini and Mariotti, 2012). Early detection and management of ocular pathologies is one major strategy to prevent vision impairment. With the recent success of deep learning, many automatic screening systems based on fundus photography were proposed recently. Diabetic retinopathy (DR) was historically the first pathology targeted by those systems (Gulshan et al., 2016; Abràmoff et al., 2016; Quellec et al., 2017; Raju et al., 2017; Gargeya and Leng, 2017; Quellec et al., 2019; Nielsen et al., 2019). The reason is that large datasets of images have been collected within DR screening programs for diabetic patients over the past decades (Massin et al., 2008; Cuadros and Bresnick, 2009)

: those images were interpreted by human readers, which allows efficient training of supervised deep learning classifiers. Automatic screening systems were also proposed for glaucoma

(Li et al., 2018; Shibata et al., 2018; Christopher et al., 2018; Phan et al., 2019; Ahn et al., 2019; Diaz-Pinto et al., 2019) and age-related macular degeneration (AMD) (Matsuba et al., 2018; Pead et al., 2019), the other two major sight-threatening pathologies in developed countries. Other pathologies such as retinopathy of prematurity (Wang et al., 2018) have also been targeted. A few studies also addressed multiple pathology screening (Keel et al., 2018; Choi et al., 2017). Ting et al. (2017) thus proposed to detect AMD and glaucoma, in addition to DR, in DR screening images. The motivation is that diabetic patients, targeted by DR screening programs, may also suffer from AMD or glaucoma: ophthalmologists may not be willing to replace their interpretations with automatic interpretations if other sight-threatening pathologies are ignored.

In this study, we propose to go one step further and detect all conditions annotated by human readers in DR screening reports. In the OPHDIAT screening program (Massin et al., 2008), for instance, this represents 41 conditions. Targeting those conditions has become possible because more than 160,000 screening examinations ( 760,000 images) have been performed so far. Yet some of these conditions are still very rare and appear in less than ten screening reports: this impacts the type of machine learning (ML) strategy to employ. In particular, training a specific deep learning model for each of these conditions is prohibitive, even through transfer learning (Cheplygina, 2019). Targeting 41 conditions is a big leap compared to the state of the art. Choi et al. (2017) focused on the classification of 10 pathologies, but not in a screening context: the goal was to differentiate pathologies, not to detect them in a large population. Fauw et al. (2018) mention that they target 53 “key diagnoses” in optical coherence tomography, but these diagnoses are not listed and the detection performance not reported: the main goal was to propose automatic referral decisions. We expect this additional information to facilitate the adoption of automatic screening.

This paper presents the ML solution we propose to address the challenge of detecting rare conditions. The genesis of this framework was the use of t-distributed stochastic neighbor embedding (t-SNE) (Maaten and Hinton, 2008) to visualize what convolutional neural networks (CNNs), trained to detect DR in the OPHDIAT dataset (Quellec et al., 2019), have learnt. We observed that many conditions unrelated to DR were clustered in feature space, even though the models were only trained to detect DR. This suggests that CNNs are performing differential diagnosis to detect DR. We hypothesized that CNNs trained to detect several frequent conditions simultaneously could improve this phenomenon further. Therefore, in the proposed framework, a standard deep learning classifier is trained to detect frequent conditions and simple probabilistic models are derived from these deep learning models to detect rare conditions. One specificity of this framework, called spin-off learning, is that probabilistic models rely on t-SNE. Like standard deep learning models, these models allow heatmap generation for decision support. Spin-off learning combines ideas from transfer learning, multitask learning and one-shot learning, while outperforming each of these frameworks individually, as demonstrated in this paper.

The paper is organized as follows. ML frameworks related to spin-off learning are presented in section 2. Spin-off learning is described in section 3. Experiments in the OPHDIAT dataset are reported in section 4. We end up with a discussion and conclusions in section 5.

2 Related Machine Learning Frameworks

A well-known solution for dealing with data scarcity is transfer learning (Cheplygina, 2019)

. In transfer learning, an initial classification model is trained on a large dataset, such as ImageNet (1.2 million images)

111http://www.image-net.org

, to perform unrelated tasks. Then, this model is fine-tuned on the dataset of interest, to detect a target condition. The idea is that parts of the feature extraction process, such as edge detection, are common to many computer vision tasks and can therefore be reused, with or without modifications. This approach has become the leading strategy in medical image analysis

(Litjens et al., 2017).

Another solution to this problem is multitask learning (Caruana, 1997). The difference with transfer learning is that one learns to address multiple tasks simultaneously rather than sequentially. In multitask learning, auxiliary tasks are usually chosen because training labels are abundant or not needed, unlike the target task (Zhang et al., 2014; Mordan et al., 2018). Multitask learning can thus be used to train a unique detector for multiple (both rare and frequent) conditions (Guendel et al., 2019): detecting frequent conditions can be regarded as an auxiliary task, for the main task of detecting rare conditions.

Finally, spin-off learning is related to one-shot learning (Fei-Fei et al., 2006). In both solutions, a probabilistic model is used to define a classifier for a new category, based on classification models trained for auxiliary categories, using a very small number of training examples. Methodology differences between one-shot learning and spin-off learning come from the different type of features considered: a sparse image representation (with local features from the pre-deep-learning era) versus a holistic image representation (derived from a classification model). A few variations on one-shot learning (or few-shot learning) were proposed in the context of deep learning. These solutions rely on specific networks, such as Siamese networks (Koch et al., 2015; Shyam et al., 2017), matching networks (Vinyals et al., 2016) or relation networks (Sung et al., 2018), with the aim to compare two images and decide whether or not they belong to the same category. This approach is very different from spin-off learning.

3 Spin-off Learning

As illustrated in Fig. 1, spin-off learning can be summarized as follows. A multitask detector for frequent conditions is trained first (see section 3.2). Next, a probabilistic detection model is defined for each rare condition (see sections 3.3 and 3.4). Then, new images can be processed: predictions are computed for both frequent and rare conditions (see section 3.5) and heatmaps are generated to document predictions (see section 3.6).

Figure 1: Spin-off learning pipeline.

3.1 Notations

Let denote an image dataset where the presence or absence of conditions has been annotated by one or multiple human readers for each image . Let denote these conditions. Let denote a label indicating the presence () or absence () of condition in image according to experts. Let denote the frequency (the raw count) of condition in dataset : . Conditions are sorted by decreasing frequency order: .

We assume that dataset was divided into a learning (or training) subset , used for deep learning, a validation subset , and a test subset . These datasets are mutually exclusive (, ). We also define a subset of “reference images” , whose definition will vary depending on whether the algorithm is being validated or tested (see section 4.5).

All processing steps described hereafter are performed on preprocessed images (see section 4.3): for simplicity, denotes the preprocessed image in the following sections.

3.2 Deep Learning for Frequent Condition Detection

The first step in spin-off learning is to define a deep learning model for recognizing the most frequent conditions. This model relies on a convolutional neural network (CNN). This CNN is defined as a multilabel classifier; it is trained to minimize the following cost function :

(1)

where denotes the output of the model for image and condition . Through the logistic function

, this output is converted into a probability

(simply noted in the absence of ambiguity).

was selected as activation function since patients can have multiple conditions simultaneously, which is properly modeled by multiple logistic functions. We note that training this initial classification model, defined for frequent conditions, is a multitask learning problem: spin-off learning extends multitask learning to rare conditions as described hereafter.

3.3 Feature Space Definition

Since a unique CNN is defined to detect the most frequent conditions, the penultimate layer of this CNN is very general: it extracts all features required to detect conditions. We use the output of this layer to define a feature space in which the remaining conditions will be detected. Let denote this feature space and the projection of a given image

in this space. The number of neurons in the penultimate layer of a classification CNN is generally high, for instance 2,049 for Inception-v3

(Szegedy et al., 2016) or 1,537 for inception-v4 (Szegedy et al., 2017): let denote the dimension of this space (

= 2,049 or 1,537 for instance). To address the curse of dimensionality, dimension reduction is performed afterwards.

We propose the use of t-SNE (van der Maaten and Hinton, 2008) for this purpose. In t-SNE, dimension reduction is unsupervised, but it is data-driven: it relies on the reference subset. Following common practice, a two-step procedure was adopted:

  • A first reduction step relies on principal component analysis (PCA), which transforms the initial feature space

    into a new -dimensional feature space (). Let denote the projection of image into .

  • In a second step, t-SNE itself transforms into a -dimensional feature space . Let denote the projection of image into .

In t-SNE, the projection is optimized to preserve the conditional probability that image would pick image as its neighbors in both feature spaces and :

(2)

where is a Gaussian kernel, with an image-specific bandwidth controlled by the local data density in :

(3)

3.4 Probability Function Estimation

As mentioned in the introduction, the t-SNE algorithm has a very interesting property: although non-supervised, the output feature space allows very good separation of the various conditions. This property is leveraged to define a probabilistic condition detection model in space. In that purpose, a density probability function is first defined in for each condition

. Probability density functions

are also defined for the absence of each condition. These estimations are performed in the

reference subset; training images are discarded in case the CNN has overfit the training data. Density estimations rely on the Parzen-Rosenblatt method (Parzen, 1962), using the Gaussian kernel of equation (3). For each location :

(4)

For each density function, one parameter needs to be set: the or bandwidth, which controls the smoothness of the estimated function. This parameter is set in an unsupervised fashion, according to Scott’s criterion (Scott, 1992):

(5)

Finally, based on these two probability density functions, and , the probability that image contains condition (simply noted in the absence of ambiguity) is defined as follows:

(6)

A strong similarity can be noted between equation (2) of t-SNE and equations (4) and (6) of probability function estimation: the main difference is the change of emphasis from sample-level in t-SNE to class-level in probability function estimation.

3.5 Detecting Rare Conditions in one Image

One challenge arises once we need to process a new image: equation (6) is only theoretical. Indeed, the projection based on t-SNE has no expression. It is only defined for the development samples (i.e. ), but it does not allow projection of new samples in the output feature space. This limitation does not apply to the projection from space to space (CNN) or from space to space (PCA).

In order to bypass this lack of expression, the following procedure is proposed to determine the probability that condition is present in a new (preprocessed) image :

  1. is processed by the CNN and the output of the penultimate layer are computed (see sections 3.2 and 3.3).

  2. The PCA-based projection is applied to obtain (see section 3.3).

  3. A -nearest neighbor regression is performed to approximate . The search for the nearest neighbors is performed in . The reference samples are the couples, where the values are computed exactly through equation (6). The approximate prediction is given by:

(7)

In summary, the probability that a condition is present in any image can be estimated using equation (7). If , two probabilities of presence can be used: either or (see section 3.2).

3.6 Heatmap Generation

Once conditions have been detected in image , it can be useful to show how much each pixel contributes to those predictions. The result is a heatmap the size of showing the detected structures. We have already solved this problem in the past, in the case where the predictions rely on (Quellec et al., 2017). The proposed solution extended sensitivity analysis (Simonyan et al., 2014)

: the idea is to compute the gradient of the model predictions with respect to each input pixel, using the backpropagation algorithm. When predictions rely on

, the full processing chain is differentiable, provided that the nearest neighbors of are considered fixed. A differentiable processing graph can be formed by stacking the following operations (see section 3.4):

  1. the CNN up to the penultimate layer,

  2. the PCA-based linear projection,

  3. and the regression of equation (7).

The contribution of each pixel , for condition , is determined as follows:

(8)

where denotes a matrix the size of filled with ones, and denotes the Hadamard product. Computing rather than places the focus on pixels as a whole, rather than individual pixel color components (Quellec et al., 2017), which facilitates the analysis. Criterion highlights pixels explaining the probability density for condition at location .

One limitation of this direct application of sensitivity analysis is that we lose the connection with neighboring images. To analyze those connections, we can also highlight the pixels which explain the relative similarity of to its neighbors. This can be done by applying sensitivity analysis to the following term:

(9)

where is the L2 norm and are the nearest neighbors of inside , in space. Precisely, we propose that the contribution of pixel is determined as follows:

(10)

where is the sign function. The sign corrective term, , is used to know whether a pixel pushes the image towards its neighbors or away from it. Positively highlighted patterns indicate similarities with neighbors. Negatively highlighted patterns indicate differences. One additional advantage of this criterion is to have a single heatmap per image.

4 Experiments in the OPHDIAT Dataset

We have presented a probabilistic framework, called spin-off learning, to detect rare conditions in images. This framework is now applied to DR screening images from the OPHDIAT network.

4.1 The OPHDIAT Dataset

Figure 2: Examples of images from each targeted condition. For improved visualization, the preprocessed images (see section 4.3) are reported. DR: diabetic retinopathy; AMD: age-related macular degeneration; DME: diabetic macular edema; HR: hypertensive retinopathy; BRVO: branch retinal vein occlusion; RPE: retinal pigment epithelium; CRVO: central retinal vein occlusion; MIDD: maternally inherited diabetes and deafness; AION: anterior ischemic optic neuropathy.

The OPHDIAT network consists of 40 screening centers located in 22 diabetic wards of hospitals, 15 primary health-care centers and 3 prisons in the Ile-de-France area (Massin et al., 2008). Each center is equipped with one of the following 45 digital non-mydriatic cameras: Canon CR-DGI or CR2 (Tokyo, Japan), Topcon TRC-NW6 or TR-NW400 (Rotterdam, The Netherlands). Two photographs were taken per eye, one centered on the posterior pole and the other on the optic disc, and transmitted to the central server for interpretation and storage. From 2004 to the end of 2017, a total of 164,660 screening procedures were performed and 763,848 images were collected. Each screening exam was analyzed by an ophthalmologist, through a web interface, in order to generate a structured report (Massin et al., 2008). This structured report includes the grading of diabetic retinopathy (DR) in each eye. It also indicates the presence or suspicion of presence of a few other pathologies in each eye. In addition to the structured report, the ophthalmologist also indicated his or her findings in free-form text.

4.2 Ground Truth Annotations

For the purpose of this study, these reports were analyzed by a retina specialist and 41 conditions were identified ( — see Fig. 2). Ground truth annotations were obtained for each eye by combining structured information and manually-extracted textual information. Next, annotations were assigned to images thanks to our laterality identification algorithm (Quellec et al., 2019). One limitation of this approach is that ophthalmologists may not have written all their findings. To ensure that “normal images” are indeed non-pathological, normal images were visually inspected and images containing anomalies were discarded: a total of 16,955 normal images, out of 18,000 inspected images, were included. A total of 115,159 images were included in dataset .

4.3 Image Preprocessing

Figure 3: Fundus photograph preprocessing. Original images (a) and (c) are transformed into (b) and (d).

Inception-3

Inception-4

ResNet-50

ResNet-101

ResNet-152

NASNet-A

Inception-3 0.966 0.963 0.935 0.924 0.920 0.921
Inception-4 0.960 0.930 0.926 0.931 0.927
ResNet-50 0.912 0.924 0.925 0.916
ResNet-101 0.916 0.919 0.918
ResNet-152 0.890 0.895
NASNet-A 0.882
Table 1: Average classification scores (AUC) in the test subset for frequent conditions () using various CNN architectures. Scores for single CNNs are in the diagonal; scores for CNN pairs are above. Architectures selected in the validation subset are in bold.

Because fundus photographs were acquired with various cameras, the size and appearance of images were normalized to allow device-independent analysis. For size normalization, a square region of interest was defined around the camera’s field of view; this region of interest was then resized to 299

299 pixels. For appearance normalization, illumination variations throughout and across images were attenuated. This step was performed in the YCrCb color space. In this color space, components Cr and Cb contain chrominance information: these components were left unchanged. Component Y represents luminance: this channel was normalized to compensate for illumination variations. In that purpose, the image background was estimated using a Gaussian kernel with a large kernel size (standard deviation: 5 pixels). Next, this background image was subtracted from the Y channel. Finally, the obtained image was converted to an RGB image.

We used a similar model for diabetic retinopathy (DR) screening, except that each channel in the RGB color space was normalized independently, as described above for the Y channel (Quellec et al., 2017). Although suitable for DR screening, that representation proved inefficient to detect pigmentation conditions in particular.

4.4 Cutoff between Rare and Frequent Conditions

The choice of frequent conditions is arbitrary, so we performed spin-off learning experiments for various values of . We varied from to by steps of 6 conditions: . was chosen such that , .

4.5 Learning, Validation and Testing

In the presence of numerous and highly unbalanced conditions, dividing the dataset into learning, validation and test subsets is critical. The following strategy was proposed to 1) distribute the data between subsets, 2) train and validate the models and 3) test the selected models; one model is selected per condition based on validation scores. The proposed data distribution strategy ensures that two images from the same patient were assigned to the same subset (either , or ).

For the purpose of training CNNs, a “balanced” dataset was created in such a way that, ideally, all frequent conditions were equally represented. For each condition , images with condition were selected at random until all images containing had been selected or until the number of selected images with condition reached 1,500. Images containing rare conditions were excluded from this selection. 5,000 normal images were also selected at random. The size of these balanced datasets ranges from 17,205 images (for ) to 21,973 images (for ).

The learning subset was populated by of . The validation subset was populated by of plus of . The test subset was populated by of plus of . Taking validation and test images from ensures that all conditions , , can be validated and tested.

To validate rare condition detectors, a 10-fold cross-validation strategy was employed: for each fold, probability functions were built using of as reference images (), and probabilities were computed for the other . Similarly, performance of rare condition detectors in the test subset relied on 10-fold cross-testing: for each fold, probability functions were built using plus of as reference images, and probabilities were computed for the other . The usual validation and testing strategies were used for frequent condition detectors.

Figure 4: ROC curves in the test subset for each condition.

4.6 Parameter Selection

The following CNN architectures were investigated: Inception-v3 (Szegedy et al., 2016), Inception-v4 (Szegedy et al., 2017), ResNet-50, ResNet-101 and ResNet-152 (He et al., 2016), and NASNet-A (Zoph et al., 2018). The TF-slim image classification library was used.222https://github.com/tensorflow/models/tree/master/research/slim The combination of two CNNs was also investigated: in that case, their penultimate layers were concatenated to define the initial feature space. An initial experiment, involving the most frequent conditions alone, was performed to select the most promising architectures. Architectures maximizing classification performance on the

validation subset were selected. Classification performance are defined as the area under the ROC (Receiver-Operating Characteristic) curve, noted AUC. Average AUC scores for the

most frequent conditions are reported in table 1 for each architecture. This experiment reveals that three architectures lead to particularly good classification performance: Inception-v3, Inception-v4 and “Inception-v3 + Inception-v4”. We only considered those three architectures in subsequent experiments.

Two important parameters also had to be set:

  • the dimension of the reduced feature space generated by t-SNE (for visualization, is generally set to 2 or 3, but higher values can be used),

  • the number of neighbors to approximate the probability functions in section 3.4.

These parameters were chosen to maximize classification performance in the validation subset. The optimal parameter values were: and . The number of neighbors to compute heatmaps in equation (10) was based on visual inspections in the validation subset: it was set to .

4.7 ROC Analysis

The ROC analysis of spin-off learning in the test subset is reported in Fig. 4. The influence of condition frequency on the area under the ROC curve is illustrated in Fig. 5. We observe that detection performance is poorly correlated with the frequency of a condition.

Figure 5: Detection performance of conditions in the test subset as a function of frequency . An area under the ROC curve less than 0.8 is considered a failure.

4.8 Visualization

The probability density functions obtained for one example of CNN are shown in Fig. 6. Heatmap examples for conditions unknown to the CNNs (i.e. for rare conditions) are given in Fig. 7. Condition-specific heatmaps of equation (8) tend to be more sensitive. However, similarity-based heatmaps are still quite useful and provide a good summary of relevant information in images.

Figure 6: Probability density functions (Inception-v3 CNN — ). A wide probability density function indicates that images with or without the condition could not be separated well.
Figure 7: Heatmap generation. Examples of images are given in the first row. From left to right, these images present an anterior ischemic optic neuropathy, a macular hole, maternally inherited diabetes and deafness, optic disc pallor and retinal pigment epithelium (RPE) alterations. Heatmaps obtained using equation (8) are given in the second row for those conditions. Black means zero; positive values are in green. Heatmaps obtained using equation (10) are given in the third row (these heatmaps are condition-independent). Gray means zero; positive values are bright and negative values are dark. The CNN of Fig. 6 was used (Inception-v3 CNN — ): with the exception of RPE alterations (last column), those conditions are thus unknown to this CNN.

4.9 Comparison with Other Machine Learning Frameworks

The proposed spin-off learning framework is now compared with competing ML frameworks, namely one-shot learning, transfer learning and multitask learning. The same CNN architectures (Inception-v3, Inception-v4 and “Inception-v3 + Inception-v4”) were considered in all experiments.

For one-shot learning, the similarity between two images and was defined using a Siamese network (Koch et al., 2015) derived from one of the three selected CNN architectures. The outputs of the penultimate CNN layer, i.e. the and vectors, were used to compute the similarity between and

. This similarity is defined as a logistic regression of the absolute difference between

and (Koch et al., 2015). For training the Siamese networks, and were considered to match if at least one condition was present in both images. To detect in a test image, the average similarity to validation images containing condition was used: it proved more efficient than considering the maximal similarity (Koch et al., 2015).

For transfer learning, CNNs were trained for the most frequent conditions. Then, these CNNs were fine-tuned to detect each of the remaining 30 conditions individually. For multitask learning, CNNs were trained to detect the 41 conditions altogether.

Results are reported in Table 2. Statistical comparisons between frameworks were performed using repeated measures ANOVA, using the MedCalc Statistical Software version 19.0.6 (MedCalc Software bvba, Ostend, Belgium). Results are reported in Table 3.

condition ()

spin-off learning

one-shot learning

transfer learning

multitask learning

referable DR 0.9882 0.7422 0.9882 0.9251
glaucoma 0.9733 0.7567 0.9733 0.9600
cataract 0.9834 0.7947 0.9780 0.9754
AMD 0.9916 0.7441 0.9916 0.9717
drusen 0.9895 0.8304 0.9895 0.9470
DME 0.9982 0.8709 0.9959 0.9973
HR 0.9347 0.8087 0.9299 0.8568
laser scars 0.9993 0.7647 0.9993 0.9971
arteriosclerosis 0.8289 0.8110 0.7998 0.8083
tortuous vessels 0.9888 0.7603 0.9888 0.9774
degenerative myopia 0.9999 0.8726 0.9999 0.9973
BRVO 0.9289 0.8285 0.8020 0.8613
epiretinal membrane 0.9611 0.8521 0.9162 0.9456
nevi 0.9337 0.8250 0.8035 0.6311
retinal atrophy 0.9842 0.7605 0.8293 0.9178
myelinated nerve fibers 0.9815 0.8255 0.9462 0.9059
RPE alterations 0.9458 0.8600 0.9177 0.8818
optic disc pallor 0.9268 0.8537 0.8794 0.8878
synchisis 1,0000 0.9091 0.9932 0.9991
tilted optic disc 0.9522 0.8568 0.9031 0.9258
CRVO 0.9873 0.8325 0.9028 0.9879
chorioretinitis 0.9913 0.7635 0.9555 0.9862
dystrophy 0.9760 0.9493 0.8303 0.9058
retinis pigmentosa 0.9984 0.9889 0.9740 0.9967
chorioretinal atrophy 0.9909 0.8435 0.8405 0.9714
dilated veins 0.8433 0.8804 0.8093 0.8063
angioid streaks 0.7947 0.9314 0.7594 0.8090
papilledema 0.9510 0.9403 0.9363 0.9192
macular hole 0.9002 0.8784 0.6734 0.7404
embolus 0.7946 0.8565 0.6690 0.6916
MIDD 0.9555 0.9132 0.9206 0.9270
coloboma 0.9948 0.7188 0.9346 0.6446
shunt 0.7586 0.8380 0.6818 0.6782
AION 0.9534 0.9330 0.9108 0.8330
bear track dystrophy 0.8154 0.6921 0.6245 0.5912
pseudovitelliform 0.9676 0.9412 0.9176 0.9157
dystrophy
pigmentary migration 0.9986 0.7750 0.8714 0.9251
prethrombosis 0.8050 0.6688 0.5123 0.4325
hyaloid remnant 0.8916 0.7247 0.8000 0.6426
asteroid hyalosis 0.9979 0.7635 0.9327 0.8382
telangiectasia 0.7999 0.6423 0.3982 0.4236
average () 0.9380 0.8245 0.8654 0.8545
average () 0.9260 0.8349 0.8282 0.8207
Table 2: Comparison between ML frameworks in terms of AUC in the test subset. Abbreviations are listed in the legend of Fig. 2. The best AUC for each condition is in bold.

one-shot learning

transfer learning

multitask learning

spin-off learning 0.0001 0.0001 0.0001
one-shot learning 0.3507 0.9401
transfer learning 1.0000
spin-off learning 0.0002 0.0001 0.0002
one-shot learning 1.0000 1.0000
transfer learning 1.0000
Table 3: Comparison between ML frameworks using repeated measures ANOVA. Comparisons consider either all conditions or rare conditions (). Significant differences are in bold.

5 Discussion and Conclusions

We have presented spin-off learning, a new machine learning (ML) framework for detecting rare conditions in medical images using deep learning. This framework takes advantage of many annotations available for more frequent conditions in a large image dataset. Spin-off learning was successfully applied to the detection of 41 conditions in fundus photographs from the OPHDIAT diabetic retinopathy (DR) screening program.

This framework takes advantage of an interesting behavior of convolutional neural networks (CNNs): CNNs tend to cluster similar images in feature space, a phenomenon exploited in content-based image retrieval systems for instance

(Tolias et al., 2016). In our context, we observed that conditions unknown to the CNNs are also clustered in feature space (see Fig. 6). A probabilistic framework, based on the t-SNE representation, was thus proposed to take advantage of this observation. Detection results are very good: an average area under the ROC curve (AUC) of 0.9380 was obtained (see Table 2). Detection performance is also good if we consider the rarest conditions alone: the average AUC only drops to 0.9260 when the 30 rarest pathologies are considered (see Table 2). More generally, we observed in Fig. 5 that detection performance is poorly correlated with the frequency of a condition: (or using a logarithmic scale for frequency). Prior to the study, we established that a detector would be considered useful should the AUC exceed 0.8: that cutoff was reached for 37 conditions out of 41.

Spin-off learning was compared to other candidate ML frameworks for detecting rare conditions: one-shot learning, transfer learning and multitask learning. Interestingly, the performance of one-shot learning (Koch et al., 2015) proved to be highly independent of the frequency of a condition. In fact, one-shot learning outperformed spin-off learning for four conditions: all these conditions are among the rarest () and three of them were poorly detected (AUC 0.8) by spin-off learning (see Table 2). However, spin-off learning significantly outperforms one-shot learning, including for rare condition detection (see Table 3). We have shown that multitask learning and transfer learning are not well suited to detect rare conditions. Worse, in multitask learning, the detection of frequent conditions, which is trained simultaneously, is negatively impacted. In transfer learning, we hypothesize that good properties learnt for the detection of frequent conditions are lost when fine-tuning for rare conditions. In summary, spin-off learning clearly is the most relevant ML framework for the target task (see Table 3). However, one-shot learning, which also relies on similarity analysis in CNN feature space (Koch et al., 2015), is an interesting candidate ML framework as well.

Although the probabilistic model is based on the t-SNE dimension reduction technique, which is expression-less, we designed it to be differentiable. This property allows heatmap generation through sensitivity analysis (Quellec et al., 2017). We first defined condition-specific heatmaps (in green in Fig. 7). Even for conditions unknown to the CNNs, we see that the pathological structures are well captured by the CNNs. More interestingly, we also defined condition-independent heatmaps (in gray-level in Fig. 7). Those heatmaps are based on similarity between neighboring images in feature space: they validate the fact that images are clustered by similarity, and that this similarity relies on diagnostically-relevant features. We note that both types of heatmaps lead to similar results, the advantage of the second being that it summarizes relevant information in a single heatmap (instead of 41).

We believe the use of a visualization technique, namely t-SNE, for classification is an interesting feature of spin-off learning. In particular, we found that reducing the feature space to two dimensions, the value generally used for visualization, maximizes classification performance (see section 4.6

). This can be explained in part by more reliable kernel density estimations in low-dimensional feature spaces

(Scott, 1992). One advantage is that we can conveniently browse the image dataset in a 2-D viewer and, with the help of the proposed similarity-based heatmaps, understand how the dataset is organized by CNNs. This could be used to show human readers similar images for decision support (Quellec et al., 2011).

The proposed framework is mostly unsupervised, which can be regarded as a limitation. We note, however, that it can easily be transformed into a supervised framework. The solution is to 1) approximate the t-SNE projection with a multilayer perceptron and 2) optimize this approximation to maximize the separation between probability density functions

and (Patrick and Fischer, 1969)

. The CNN weights can thus be optimized through the probabilistic model. However, the number of degrees of freedom increases significantly, which makes the framework less relevant for rare conditions.

This study has one undeniable limitation: each image was interpreted by a single human reader (an ophthalmologist), who was not obligated to annotate all visible conditions. Therefore, the quality of performance assessment could be improved. However, given the large number of conditions considered in this study, the relevance of the approach was clearly validated.

In conclusion, we have presented the first study on the automatic detection of a large number of conditions in retinal images. A simple ML framework was proposed for this purpose. The results are highly encouraging and open new perspectives for ocular pathology screening. In particular, the trained detectors could be used to generate warnings when rare conditions are detected, both in traditional and automatic screening scenarios. We believe this will favor the adoption of automatic screening systems, which currently focus on the most frequent pathologies and ignore all others.

References

  • Abràmoff et al. (2016) Abràmoff, M. D., Lou, Y., Erginay, A., Clarida, W., Amelon, R., Folk, J. C., Niemeijer, M., Oct. 2016. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Invest Ophthalmol Vis Sci 57 (13), 5200–5206.
  • Ahn et al. (2019) Ahn, J. M., Kim, S., Ahn, K.-S., Cho, S.-H., Lee, K. B., Kim, U. S., Jan. 2019. A deep learning model for the detection of both advanced and early glaucoma using fundus photography. PLoS ONE 14 (1), e0211579.
  • Caruana (1997) Caruana, R., Jul. 1997. Multitask learning. Mach Learn 28 (1), 41–75.
  • Cheplygina (2019) Cheplygina, V., Mar. 2019. Cats or CAT scans: Transfer learning from natural or medical image source data sets? Curr Opin Biomed Eng 9, 21–27.
  • Choi et al. (2017) Choi, J. Y., Yoo, T. K., Seo, J. G., Kwak, J., Um, T. T., Rim, T. H., Nov. 2017. Multi-categorical deep learning neural network to classify retinal images: A pilot study employing small database. PLoS ONE 12 (11), e0187336.
  • Christopher et al. (2018) Christopher, M., Belghith, A., Bowd, C., Proudfoot, J. A., Goldbaum, M. H., Weinreb, R. N., Girkin, C. A., Liebmann, J. M., Zangwill, L. M., Nov. 2018. Performance of deep learning architectures and transfer learning for detecting glaucomatous optic neuropathy in fundus photographs. Sci Rep 8 (1), 16685.
  • Cuadros and Bresnick (2009) Cuadros, J., Bresnick, G., May 2009. EyePACS: an adaptable telemedicine system for diabetic retinopathy screening. J Diabetes Sci Technol 3 (3), 509–516.
  • Diaz-Pinto et al. (2019) Diaz-Pinto, A., Morales, S., Naranjo, V., Köhler, T., Mossi, J. M., Navea, A., Mar. 2019. CNNs for automatic glaucoma assessment using fundus images: An extensive validation. Biomed Eng Online 18 (1), 29.
  • Fauw et al. (2018) Fauw, J. D., Ledsam, J. R., Romera-Paredes, B., Nikolov, S., Tomasev, N., Blackwell, S., Askham, H., Glorot, X., O’Donoghue, B., Visentin, D., Driessche, G. v. d., Lakshminarayanan, B., Meyer, C., Mackinder, F., Bouton, S., Ayoub, K., Chopra, R., King, D., Karthikesalingam, A., Hughes, C. O., Raine, R., Hughes, J., Sim, D. A., Egan, C., Tufail, A., Montgomery, H., Hassabis, D., Rees, G., Back, T., Khaw, P. T., Suleyman, M., Cornebise, J., Keane, P. A., Ronneberger, O., Sep. 2018. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med 24 (9), 1342.
  • Fei-Fei et al. (2006) Fei-Fei, L., Fergus, R., Perona, P., Apr. 2006. One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 28 (4), 594–611.
  • Gargeya and Leng (2017) Gargeya, R., Leng, T., Jul. 2017. Automated identification of diabetic retinopathy using deep learning. Ophthalmology 124 (7), 962–969.
  • Guendel et al. (2019) Guendel, S., Ghesu, F. C., Grbic, S., Gibson, E., Georgescu, B., Maier, A., Comaniciu, D., May 2019. Multi-task learning for chest X-ray abnormality classification on noisy labels. Tech. Rep. arXiv:1905.06362 [cs].
  • Gulshan et al. (2016) Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., Cuadros, J., Kim, R., Raman, R., Nelson, P. C., Mega, J. L., Webster, D. R., Dec. 2016. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316 (22), 2402–2410.
  • He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., Jun. 2016. Deep residual learning for image recognition. In: Proc CVPR. Las Vegas, NV, USA, pp. 770–778.
  • Keel et al. (2018) Keel, S., Wu, J., Lee, P. Y., Scheetz, J., He, M., Dec. 2018. Visualizing deep learning models for the detection of referable diabetic retinopathy and glaucoma. JAMA Ophthalmol.
  • Koch et al. (2015) Koch, G., Zemel, R., Salakhutdinov, R., Jul. 2015. Siamese neural networks for one-shot image recognition. In: Proc ICML. University of Toronto, Lille, France.
  • Li et al. (2018) Li, Z., He, Y., Keel, S., Meng, W., Chang, R. T., He, M., Aug. 2018. Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs. Ophthalmology 125 (8), 1199–1206.
  • Litjens et al. (2017) Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., van der Laak, J. A. W. M., van Ginneken, B., Sánchez, C. I., Dec. 2017. A survey on deep learning in medical image analysis. Med Image Anal 42, 60–88.
  • Maaten and Hinton (2008) Maaten, L. v. d., Hinton, G., Nov. 2008. Visualizing data using t-SNE. J Mach Learn Res 9, 2579–2605.
  • Massin et al. (2008) Massin, P., Chabouis, A., Erginay, A., Viens-Bitker, C., Lecleire-Collet, A., Meas, T., Guillausseau, P.-J., Choupot, G., André, B., Denormandie, P., Jun. 2008. OPHDIAT: a telemedical network screening system for diabetic retinopathy in the Ile-de-France. Diabetes Metab 34 (3), 227–234.
  • Matsuba et al. (2018) Matsuba, S., Tabuchi, H., Ohsugi, H., Enno, H., Ishitobi, N., Masumoto, H., Kiuchi, Y., May 2018. Accuracy of ultra-wide-field fundus ophthalmoscopy-assisted deep learning, a machine-learning technology, for detecting age-related macular degeneration. Int Ophthalmol.
  • Mordan et al. (2018) Mordan, T., Thome, N., Henaff, G., Cord, M., Dec. 2018. Revisiting Multi-Task Learning with ROCK: A Deep Residual Auxiliary Block for Visual Detection. In: Proc NIPS. Montreal, Canada, pp. 1310–1322.
  • Nielsen et al. (2019) Nielsen, K. B., Lautrup, M. L., Andersen, J. K. H., Savarimuthu, T. R., Grauslund, J., Apr. 2019. Deep learning-based algorithms in screening of diabetic retinopathy: A systematic review of diagnostic performance. Ophthalmol Retina 3 (4), 294–304.
  • Parzen (1962) Parzen, E., 1962. On estimation of a probability density function and mode. Ann Math Stat 33 (3), 1065–1076.
  • Pascolini and Mariotti (2012) Pascolini, D., Mariotti, S. P., May 2012. Global estimates of visual impairment: 2010. Br J Ophthalmol 96 (5), 614–618.
  • Patrick and Fischer (1969)

    Patrick, E., Fischer, F., Sep. 1969. Nonparametric feature selection. IEEE Trans Inform Theory 15 (5), 577–584.

  • Pead et al. (2019) Pead, E., Megaw, R., Cameron, J., Fleming, A., Dhillon, B., Trucco, E., MacGillivray, T., 2019. Automated detection of age-related macular degeneration in color fundus photography: A systematic review. Surv Ophthalmol.
  • Phan et al. (2019) Phan, S., Satoh, S., Yoda, Y., Kashiwagi, K., Oshika, T., Japan Ocular Imaging Registry Research Group, May 2019. Evaluation of deep convolutional neural networks for glaucoma detection. Jpn J Ophthalmol 63 (3), 276–283.
  • Quellec et al. (2017) Quellec, G., Charrière, K., Boudi, Y., Cochener, B., Lamard, M., Jul. 2017. Deep image mining for diabetic retinopathy screening. Med Image Anal 39, 178–193.
  • Quellec et al. (2011) Quellec, G., Lamard, M., Cazuguel, G., Bekri, L., Daccache, W., Roux, C., Cochener, B., 2011. Automated assessment of diabetic retinopathy severity using content-based image retrieval in multimodal fundus photographs. Invest Ophthalmol Vis Sci 52 (11), 8342–8348.
  • Quellec et al. (2019) Quellec, G., Lamard, M., Lay, B., Guilcher, A. L., Erginay, A., Cochener, B., Massin, P., Jun. 2019. Instant automatic diagnosis of diabetic retinopathy. Tech. Rep. arXiv:1906.11875 [cs, eess].
  • Raju et al. (2017) Raju, M., Pagidimarri, V., Barreto, R., Kadam, A., Kasivajjala, V., Aswath, A., 2017. Development of a deep learning algorithm for automatic diagnosis of diabetic retinopathy. Stud Health Technol Inform 245, 559–563.
  • Scott (1992) Scott, D., 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, New York, Chicester.
  • Shibata et al. (2018) Shibata, N., Tanito, M., Mitsuhashi, K., Fujino, Y., Matsuura, M., Murata, H., Asaoka, R., Oct. 2018. Development of a deep residual learning algorithm to screen for glaucoma from fundus photography. Sci Rep 8 (1), 14665.
  • Shyam et al. (2017) Shyam, P., Gupta, S., Dukkipati, A., Aug. 2017. Attentive recurrent comparators. In: Proc ICML. Sydney, Australia.
  • Simonyan et al. (2014) Simonyan, K., Vedaldi, A., Zisserman, A., Apr. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In: ICLR Workshop. Calgary, Canada.
  • Sung et al. (2018) Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H. S., Hospedales, T. M., Jun. 2018. Learning to compare: Relation network for few-shot learning. In: Proc CVPR. Salt Lake City, UT, USA.
  • Szegedy et al. (2017)

    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A., Feb. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Proc AAAI. San Francisco, CA, USA, pp. 4278–4284.

  • Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., Jun. 2016. Rethinking the inception architecture for computer vision. In: Proc IEEE CVPR. Las Vegas, NV, USA, pp. 2818–2826.
  • Ting et al. (2017) Ting, D. S. W., Cheung, C. Y.-L., Lim, G., Tan, G. S. W., Quang, N. D., Gan, A., Hamzah, H., Garcia-Franco, R., San Yeo, I. Y., Lee, S. Y., Wong, E. Y. M., Sabanayagam, C., Baskaran, M., Ibrahim, F., Tan, N. C., Finkelstein, E. A., Lamoureux, E. L., Wong, I. Y., Bressler, N. M., Sivaprasad, S., Varma, R., Jonas, J. B., He, M. G., Cheng, C.-Y., Cheung, G. C. M., Aung, T., Hsu, W., Lee, M. L., Wong, T. Y., Dec. 2017. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA 318 (22), 2211–2223.
  • Tolias et al. (2016)

    Tolias, G., Sicre, R., Jégou, H., May 2016. Particular Object Retrieval With Integral Max-Pooling of CNN Activations. In: Proc ICLR. International Conference on Learning Representations. San Juan, Puerto Rico.

  • van der Maaten and Hinton (2008)

    van der Maaten, L., Hinton, G., 2008. Visualizing high-dimensional data using t-SNE. J Mach Learn Res 9, 2579–605.

  • Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D., Dec. 2016. Matching networks for one shot learning. In: Proc NIPS. Barcelona, Spain, pp. 3637–3645.
  • Wang et al. (2018) Wang, J., Ju, R., Chen, Y., Zhang, L., Hu, J., Wu, Y., Dong, W., Zhong, J., Yi, Z., Sep. 2018. Automated retinopathy of prematurity screening using deep neural networks. EBioMedicine 35, 361–368.
  • Zhang et al. (2014) Zhang, Z., Luo, P., Loy, C. C., Tang, X., Sep. 2014. Facial landmark detection by deep multi-task learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (Eds.), Proc ECCV. Lecture Notes in Computer Science. Springer International Publishing, Zurich, Switzerland, pp. 94–108.
  • Zoph et al. (2018) Zoph, B., Vasudevan, V., Shlens, J., Le, Q. V., Jun. 2018. Learning transferable architectures for scalable image recognition. In: Proc IEEE CVPR. Salt Lake City, UT, USA.