Development and Validation of a Novel Prognostic Model for Predicting AMD Progression Using Longitudinal Fundus Images

07/10/2020 ∙ by Joshua Bridge, et al. ∙ University of Liverpool 0

Prognostic models aim to predict the future course of a disease or condition and are a vital component of personalized medicine. Statistical models make use of longitudinal data to capture the temporal aspect of disease progression; however, these models require prior feature extraction. Deep learning avoids explicit feature extraction, meaning we can develop models for images where features are either unknown or impossible to quantify accurately. Previous prognostic models using deep learning with imaging data require annotation during training or only utilize a single time point. We propose a novel deep learning method to predict the progression of diseases using longitudinal imaging data with uneven time intervals, which requires no prior feature extraction. Given previous images from a patient, our method aims to predict whether the patient will progress onto the next stage of the disease. The proposed method uses InceptionV3 to produce feature vectors for each image. In order to account for uneven intervals, a novel interval scaling is proposed. Finally, a Recurrent Neural Network is used to prognosticate the disease. We demonstrate our method on a longitudinal dataset of color fundus images from 4903 eyes with age-related macular degeneration (AMD), taken from the Age-Related Eye Disease Study, to predict progression to late AMD. Our method attains a testing sensitivity of 0.878, a specificity of 0.887, and an area under the receiver operating characteristic of 0.950. We compare our method to previous methods, displaying superior performance in our model. Class activation maps display how the network reaches the final decision.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Prognostic models are an essential component of personalized medicine, allowing health experts to predict the future course of disease in individual patients [25]

. Advances in computing power and an abundance of data have allowed for increasingly sophisticated models to be developed. Most developed prognostic models use statistical methods such as logistic regression; these models require prior feature extraction, either manual or automatic

[5] and are limited in the number of included variables. Feature extraction can be costly and time-consuming, especially in imaging data. Deep learning offers the ability to avoid explicit feature extraction, allowing us to develop models without the need for handcrafted features. For this reason, deep learning is especially useful in imaging data. Prognostic deep learning models have been developed in several fields, primarily ophthalmology [10], cardiology [17], and neurology [13], and several modalities, including magnetic resonance imaging (MRI), optical coherance tomography (OCT), color fundus photography and X-Ray.

Current prognostic models that utilize deep learning to analyze imaging data, either use automatic feature extraction algorithms to extract known features or only consider a single time point. Models developed using feature extraction, train algorithms on annotated images to extract relevant features such as volumes in OCT data; those features are then fed into a traditional statistical model, see [10, 20, 19] for examples. Manual feature extraction is time-consuming and requires expert readers. More recently, Yim et al. [29] proposed a method which automatically segments OCT layers before classification. This method outperformed human experts; however, automatic feature extraction requires annotations during training, which is not always available in situations when the features are unknown or difficult to quantify, such as is the case when using color fundus imaging.

An alternative to explicit feature extraction is to use deep learning to extract features implicitly, such as used by [4, 3]

. Many models take the previous available image and fit a pretrained convolutional neural network (CNN), with Inception V3

[26] being a popular choice due to its generalizability and high performance in a variety of tasks. This method, unlike the feature extraction method, may be applied to any image even when features are not explicitly known; however, this creates a separate issue, by using only one image, these models may fail to capture the temporal pattern across time points.

Here, we develop a prognostic model to predict the progression of disease, from longitudinal images. The method is applicable to any modality even when the causes of progression are unknown or can’t be quanitfied. The proposed method is demonstrated on a dataset consisting of 4903 eyes with age-related macular degeneration (AMD), taken from the AREDS dataset [2]. The method is generalizable to any longitudinal imaging data. We show that by considering the time interval between images and adopting a method from time series analysis, we can provide significantly improved prediction performance.

Our contributions are as follows:

  • Propose a novel method to predict the future prognosis of a patient from longitudinal images

  • Introduce interval scaling which allows for uneven time intervals between visits

  • Demonstrate on the largest longitudinal dataset and attain state-of-the-art performance outperforming other state-of-the-art methods

2 Method

Given images at times , we wish to predict the diagnosis at time , where does not necessarily hold, which is common in a clinical setting.

The proposed method consists of three stages, firstly, we utilize a pretrained CNN, with shared weights, to reduce each image to a single feature vector. Then, the feature vectors are combined, and an interval scaling is applied to account for the uneven time intervals, this weights the most recent time points as being more important in making the final prediction. Finally, a recurrent neural network (RNN) classifies the images as progressing or non-progressing. An overview of the proposed framework is shown in Figure

1.

Figure 1: Overview of the proposed method. For each of the T time points, we fit a CNN with shared weights, resulting in a vector of length F, per image. Each vector is multiplied by a corresponding interval scaling. The scaled vectors are combined into a single T

F matrix and a Gated Recurrent Unit (GRU) with sigmoid activation gives a probability of progression. For simplicity, 3 time points are shown, this method is extendable to any number of time points.

2.1 Inception V3

We begin by fine-tuning a pretrained CNN on each image, with shared weights, to extract feature vectors. In our work, we chose IncpetionV3 [26]

pretrained on ImageNet

[23]

. InceptionV3 increases accuracy over previous networks, while remaining computationally efficient, through the use of factorized kernels, batch normalization, and regularization. InceptionV3 is considered highly generalizable with a greater than 78.1% accuracy on the ImageNet dataset. The InceptionV3 network results in a feature vector of length F=2048 for each image at each time point. This network has previously been used to provide state of the art results in single time point methods

[4, 3], and is used here as a feature extractor.

2.2 Interval scaling

To account for uneven time intervals, we implement a triangular window function to create a smoothing model. Whereas in a simple moving average model, the time points are weighted equally, smoothing models weight values closer to as being more useful in the prediction. For each sequence of images at times, , where is the time point that we want to predict at, we rescale each time such that . The feature vectors for each image are then multiplied by their corresponding time interval scale. This scaling weights the images such that images closer to the time point of interest are considered more important than those observed at further time points, thus allowing the network to account for uneven time intervals.

3 GRU prediction

To predict whether the patient will progress to advanced AMD or not, we combine the interval corrected vectors into a matrix, where corresponds to the number of time points and is the number of features. We apply a Gated Recurrent Unit (GRU) [7]

with a filter size of 1, resulting in a single value. GRU was chosen as opposed to Long Short-Term Memory (LSTM)

[14] units, as GRU is more computationally efficient. LSTM units perform better on longer sequences; however, in this case, we only have three time points [9]

. The sigmoid activation function then scales the value between 0 and 1. Any values greater than a threshold of 0.5 are predicted to be future progressing patients.

4 Experiments

In order to evaluate the performance, the proposed method is demonstrated on a dataset of AMD images with two and three time points and compared to a single time point method.

4.1 Data

Data consists of color fundus images taken from the Age-Related Eye Disease Study (AREDS) [2], the most extensive clinical study into AMD.

AMD is a leading cause of vision loss worldwide [28]. There are two main stages of AMD, early/intermediate, defined by small- to medium-sized drusen, and advanced, defined by geographic atrophy (GA) or neovascularization (nAMD) [2]. Drusen can be observed as yellow-white lipid deposits under the retina, varying greatly in size and morphology [27]. The exact causes of AMD are unknown; however, studies have shown that smoking and genetics are significant risk factors [6]. Risk factors for progression from early/intermediate to advanced AMD are also unknown; however, there is evidence that drusen and optic disk characteristics are important [18, 24]. Vision loss can be avoided with interventions such as anti-VEGF treatment; however, disease progression and the need for treatment are often hard to predict [16]. This highlights the need for accurate prognostic models.

We extracted 4,903 eyes, which had four visits, complete with images and diagnoses at each visit, with no diagnosis of advanced AMD during the first three visits. Advanced AMD was defined as either Central GA, nAMD, or both GA and nAMD. We used the last visit as ground truth to make our prediction based on the first three visits. Of the 4,903 included eyes, 453 (9.2%) progressed to advanced AMD.

We randomly split the data into 60% training (2942 eyes, 272 progressing), 20% validation (981 eyes, 91 progressing), and 20% testing (980 eyes, 90 progressing) datasets. To reduce the possibility of data leakage, patients with both eyes included were kept within the same data split. Example images are given in Figure 2.

0 years 2 years 3 years 7.5 years
Early/intermediate Early/intermediate Early/intermediate Advanced
(a) Progressing patient
0 years 2 years 3 years 4 years
Early/intermediate Early/intermediate Early/intermediate Early/intermediate
(b) Non-progressing patient
Figure 2: Sample images from a progressing patient (top) and non-progressing patient (bottom). The first three images show early/intermediate AMD, while the fourth image shows whether they progressed to advanced AMD or not.

4.2 Preprocessing

Any images in the dataset where the patient had already progressed to advanced AMD, or without the required three previous images plus a fourth prediction image for prediction, were excluded. The images were automatically cropped by first calculating the difference between the original image and the background color, an offset was added to the difference, and the bounding box was calculated from this. Image values were rescaled from between 0 and 255, to between 0 and 1. All images were resized to 256x256 pixels to reduce computational requirements. Right eye images were flipped, such that the optic disc on all images was located on the left. No prior feature extraction or segmentation/registration is required, such that our method is as generalizable to other diseases and modalities as possible. All preprocessing was automated, with no subjective human input required.

4.3 Computing

All analyses were carried out on a Linux machine with a Titan X 12GB GPU and 32GB of memory. Deep learning was conducted in Python 3.7 using the Keras 2.2.4 library

[8]

with TensorFlow

[1]

as the base library. Confidence intervals were calculated using R 3.4.4

[21], with the pROC package [22].

Optimization was carried out with the Adam optimizer [15]

with an initial learning rate of 0.0001. We used binary cross-entropy as the loss function. If the loss did not improve after ten epochs, then the learning rate was reduced to two-thirds. Model checkpoints and early stopping prevented overfitting, with the best model being picked according to the validation loss.

4.4 Metrics

We evaluate model performance using the commonly used area under the receiver operating characteristic curve (AUC)

[12], optimal sensitivity, and optimal specificity, determined by Youden’s index. To assess whether the difference in these measures between models is significant, we construct confidence intervals. De Long’s method [11] is used to construct confidence intervals for AUC, and bootstrapping with 2000 samples is used for sensitivity and specificity to calculate 95% confidence intervals. Results from De Long’s test [11] are also reported.

4.5 Results and comparisons

Results are reported using two and three time points with our method, to assess the benefit of adding additional time points. We compare our results with a method similar to those used in previous work [4, 3], using single time points with a CNN. Taking the last available image, we fine tune InceptionV3 [26] pretrained on ImageNet [23] to classify as progression or no progression.

The proposed method using three time points achieves an AUC, optimal sensitivity, and optimal specificity of 0.950 (0.923, 0.977), 0.878 (0.810, 0.945), and 0.887 (0.866, 0.907), respectively; this is a significant improvement over the single time point method which had AUC, sensitivity, and specificity of 0.857 (0.823, 0.890), 0.867 (0.796, 0.937), and 0.760 (0.731, 0.788). These results show a statistically significant increase in AUC and specificity and a non-significant increase in sensitivity when using the proposed three time point method over the previous single time point methods. De Long’s test gave a p-value 0.0001, indicating a significant difference in AUCs. This significant increase in specificity without a loss in sensitivity shows our model can reduce false positives without increasing false negatives, over the previous model.

The method utilizing two time points gave an AUC, sensitivity, and specificity of 0.932 (0.905, 0.958), 0.811 (0.730, 0.892), 0.760 (0.731, 0.788). The three time point method had a non-significant increase over two time points. This may indicate that in this example more than two time points does not add any significant predictive value. Results are presented in Table 1 and the receiver operating characteristic is shown in Figure 3. Experiments without interval scaling were also conducted and showed a significant decrease in performance.

AUC Sensitvity Specificity
Single image
method
0.857
(0.823, 0.890)
0.867
(0.796, 0.937)
0.760
(0.731, 0.788)
Proposed method
(2 images)
0.932
(0.905, 0.958)
0.811
(0.730, 0.892)
0.892
(0.872, 0.913)
Proposed method
(3 images)
0.950
(0.923, 0.977)
0.878
(0.810, 0.945)
0.887
(0.866, 0.907)
Table 1: Area Under the Receiver Operating Characteristic (AUC) with confidence intervals constructed by De Long’s method. Sensitivity and Specificity with confidence intervals constructed by bootstrapping with samples.
Figure 3: Receiver operating characteristic curve for the single time point InceptionV3 method, the proposed method with 2 time points, and the proposed time point with three time points. Increasing the number of time points appears to increase the area under the curve. Faded bands show 95% confidence intervals.

4.6 Class activation maps

To determine if our network is identifying the correct features and to reduce the black-box nature of deep learning, we create class activation maps [30] for each time point. We altered the top of the network slightly to achieve this, adding a dense layer after the GRU layer. While this altered network showed no significant change in predictive performance, it increased the network size by around a factor of 2. The class activation maps are shown in Figure 4, alongside original images for comparison.

The class activation maps show that areas with high concentrations of drusen are considered relevant by the network; this is expected and shows that our network is identifying the correct features. In some images, the optic disk is also highlighted, confirming that optic disk characteristics are indeed important factors in AMD progression, as observed previously [18, 24]. In images where drusen are challenging to see, the network appears to use the optic disk solely in making a prediction. It is also interesting to note that the network seems to be surer of the area of interest in images that are closer to the prediction time point. In a clinical setting, these maps may are useful when justifying the prediction.

0 years 2 years 3 years 0 years 2 years 3 years
Eye 1 Eye 2
Figure 4: Class activation maps show the areas that the network finds useful in making the prediction. Original images are also shown for reference. The network correctly identifies areas of interest in AMD progression. In blurred images, drusen are difficult to see; the class activations show that the network uses the optic disk in to reach decisions in this case. All example images are taken from the testing dataset.

5 Conclusions

In this work, we proposed a novel deep learning prognostic model to predict the future onset of disease. The proposed method addresses the challenge of analyzing multiple longitudinal images with uneven time points, without the need for prior image annotation. Introducing an interval scaling was shown to improve performance over a single time point method significantly. We show that by taking into account the varying times between observed images, we can significantly improve the performance of a longitudinal prognostic model. Our method provides a statistically significant increase in specificity, which is critical in contexts such as screening. Our method utilizes time intervals meaning we can extend the interval to the observed outcome to predict further into the future; this is useful in a screening context. Future work is required to assess the generalizability of the proposed method to other diseases and to extend its use to a screening context.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. (2016)

    Tensorflow: large-scale machine learning on heterogeneous distributed systems

    .
    arXiv preprint arXiv:1603.04467. Cited by: §4.3.
  • [2] Age-Related Eye Disease Study Research Group (1999) The age-related eye disease study (AREDS): design implications. AREDS report no. 1. Controlled clinical trials 20 (6), pp. 573–600. External Links: Document Cited by: §1, §4.1, §4.1.
  • [3] F. Arcadu, F. Benmansour, A. Maunz, J. Willis, Z. Haskova, and M. Prunotto (2019) Deep learning algorithm predicts diabetic retinopathy progression in individual patients. NPJ Digit Med 2, pp. 92. External Links: Document Cited by: §1, §2.1, §4.5.
  • [4] B. Babenko, S. Balasubramanian, K. E. Blumer, G. S. Corrado, L. Peng, D. R. Webster, N. Hammel, and A. V. Varadarajan (2019) Predicting progression of age-related macular degeneration from fundus images using deep learning. arXiv preprint arXiv:1904.05478. Cited by: §1, §2.1, §4.5.
  • [5] E. Bohn, N. Tangri, B. Gali, B. Henderson, M. M. Sood, P. Komenda, and C. Rigatto (2013) Predicting risk of mortality in dialysis patients: a retrospective cohort study evaluating the prognostic value of a simple chest X-ray. BMC Nephrology 14 (1), pp. 263. External Links: Document Cited by: §1.
  • [6] U. Chakravarthy, T. Y. Wong, A. Fletcher, E. Piault, C. Evans, G. Zlateva, R. Buggage, A. Pleil, and P. Mitchell (2010) Clinical risk factors for age-related macular degeneration: a systematic review and meta-analysis. BMC Ophthalmology 10 (1), pp. 31. External Links: Document Cited by: §4.1.
  • [7] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.
  • [8] F. Chollet et al. (2015) Keras. External Links: Link Cited by: §4.3.
  • [9] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.
  • [10] L. de Sisternes, N. Simon, R. Tibshirani, T. Leng, and D. L. Rubin (2014) Quantitative SD-OCT imaging biomarkers as indicators of age-related macular degeneration progression. Invest Ophthalmol Vis Sci 55 (11), pp. 7093–103. External Links: Document Cited by: §1, §1.
  • [11] E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44 (3), pp. 837–845. External Links: ISSN 0006-341X Cited by: §4.4.
  • [12] F. E. Harrell, R. M. Califf, D. B. Pryor, K. L. Lee, and R. A. Rosati (1982) Evaluating the yield of medical tests. JAMA 247 (18), pp. 2543–2546. External Links: ISSN 0098-7484 Cited by: §4.4.
  • [13] A. Hilario, J. M. Sepulveda, A. Perez-Nuñez, E. Salvador, J. M. Millan, A. Hernandez-Lain, V. Rodriguez-Gonzalez, A. Lagares, and A. Ramos (2014) A prognostic model based on preoperative MRI predicts overall survival in patients with diffuse gliomas. American Journal of Neuroradiology 35 (6), pp. 1096. External Links: Document Cited by: §1.
  • [14] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Document Cited by: §3.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
  • [16] J. L. Kovach, S. G. Schwartz, H. W. Flynn, and I. U. Scott (2012) Anti-VEGF treatment strategies for wet amd. Journal of ophthalmology 2012. External Links: ISSN 2090-004X Cited by: §4.1.
  • [17] J. Kwon, K. Kim, K. Jeon, S. E. Lee, H. Lee, H. Cho, J. O. Choi, E. Jeon, M. Kim, J. Kim, K. Hwang, S. C. Chae, S. H. Baek, S. Kang, D. Choi, B. Yoo, K. H. Kim, H. Park, M. Cho, and B. Oh (2019) Artificial intelligence algorithm for predicting mortality of patients with acute heart failure. PLOS ONE 14 (7), pp. e0219302. External Links: Document Cited by: §1.
  • [18] S. K. Law, Y. H. Sohn, D. Hoffman, K. Small, A. L. Coleman, and J. Caprioli (2004) Optic disk appearance in advanced age-related macular degeneration. American Journal of Ophthalmology 138 (1), pp. 38–45. External Links: Document Cited by: §4.1, §4.6.
  • [19] T. Leng, L. de Sisternes, Q. Chen, J. Ma, V. Mahendra, and D. Rubin (2013) Automated prediction of AMD progression from quantified sd-oct images. Investigative Ophthalmology & Visual Science 54 (15), pp. 4150–4150. External Links: ISSN 1552-5783 Cited by: §1.
  • [20] S. Niu, L. de Sisternes, Q. Chen, D. L. Rubin, and T. Leng (2016) Fully automated prediction of geographic atrophy growth using quantitative spectral-domain optical coherence tomography biomarkers. Ophthalmology 123 (8), pp. 1737–1750. External Links: Document Cited by: §1.
  • [21] R Core Team (2019) R: a language and environment for statistical computing.. R Foundation for Statistical Computing. External Links: Link Cited by: §4.3.
  • [22] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J. Sanchez, and M. Müller (2011)

    PROC: an open-source package for R and S+ to analyze and compare ROC curves

    .
    BMC Bioinformatics 12 (1), pp. 77. External Links: Document Cited by: §4.3.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein (2015) Imagenet large scale visual recognition challenge.

    International Journal of Computer Vision

    115 (3), pp. 211–252.
    External Links: ISSN 0920-5691 Cited by: §2.1, §4.5.
  • [24] T.A. Scheufele, J.G. McHenry, and A.O. Edwards (2004) Optic neuropathy and age related macular degeneration. Investigative Ophthalmology & Visual Science 45 (13), pp. 1627–1627. Cited by: §4.1, §4.6.
  • [25] E. W. Steyerberg, K. G. M. Moons, D. A. van der Windt, J. A. Hayden, P. Perel, S. Schroter, R. D. Riley, H. Hemingway, D. G. Altman, and P. Group (2013) Prognosis research strategy (PROGRESS) 3: prognostic model research. PLoS Medicine 10 (2), pp. e1001381–e1001381. External Links: Document Cited by: §1.
  • [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna Rethinking the inception architecture for computer vision. Conference Proceedings In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2818–2826. Cited by: §1, §2.1, §4.5.
  • [27] B. M. Williams, P. I. Burgess, and Y. Zheng (2019) Chapter 13 - drusen and macular degeneration. Book Section In Computational Retinal Image Analysis, E. Trucco, T. MacGillivray, and Y. Xu (Eds.), pp. 245–272. External Links: Document Cited by: §4.1.
  • [28] W. L. Wong, X. Su, X. Li, C. M. G. Cheung, R. Klein, C. Cheng, and T. Y. Wong (2014) Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. The Lancet Global Health 2 (2), pp. e106–e116. External Links: Document Cited by: §4.1.
  • [29] J. Yim, R. Chopra, T. Spitz, J. Winkens, A. Obika, C. Kelly, H. Askham, M. Lukic, J. Huemer, K. Fasler, G. Moraes, C. Meyer, M. Wilson, J. Dixon, C. Hughes, G. Rees, P. T. Khaw, A. Karthikesalingam, D. King, D. Hassabis, M. Suleyman, T. Back, J. R. Ledsam, P. A. Keane, and J. De Fauw (2020) Predicting conversion to wet age-related macular degeneration using deep learning. Nature Medicine 26 (6), pp. 892–899. Cited by: §1.
  • [30] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba

    Learning deep features for discriminative localization

    .
    Conference Proceedings In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. Cited by: §4.6.