Prognostic models are an essential component of personalized medicine, allowing health experts to predict the future course of disease in individual patients 
. Advances in computing power and an abundance of data have allowed for increasingly sophisticated models to be developed. Most developed prognostic models use statistical methods such as logistic regression; these models require prior feature extraction, either manual or automatic and are limited in the number of included variables. Feature extraction can be costly and time-consuming, especially in imaging data. Deep learning offers the ability to avoid explicit feature extraction, allowing us to develop models without the need for handcrafted features. For this reason, deep learning is especially useful in imaging data. Prognostic deep learning models have been developed in several fields, primarily ophthalmology , cardiology , and neurology , and several modalities, including magnetic resonance imaging (MRI), optical coherance tomography (OCT), color fundus photography and X-Ray.
Current prognostic models that utilize deep learning to analyze imaging data, either use automatic feature extraction algorithms to extract known features or only consider a single time point. Models developed using feature extraction, train algorithms on annotated images to extract relevant features such as volumes in OCT data; those features are then fed into a traditional statistical model, see [10, 20, 19] for examples. Manual feature extraction is time-consuming and requires expert readers. More recently, Yim et al.  proposed a method which automatically segments OCT layers before classification. This method outperformed human experts; however, automatic feature extraction requires annotations during training, which is not always available in situations when the features are unknown or difficult to quantify, such as is the case when using color fundus imaging.
. Many models take the previous available image and fit a pretrained convolutional neural network (CNN), with Inception V3 being a popular choice due to its generalizability and high performance in a variety of tasks. This method, unlike the feature extraction method, may be applied to any image even when features are not explicitly known; however, this creates a separate issue, by using only one image, these models may fail to capture the temporal pattern across time points.
Here, we develop a prognostic model to predict the progression of disease, from longitudinal images. The method is applicable to any modality even when the causes of progression are unknown or can’t be quanitfied. The proposed method is demonstrated on a dataset consisting of 4903 eyes with age-related macular degeneration (AMD), taken from the AREDS dataset . The method is generalizable to any longitudinal imaging data. We show that by considering the time interval between images and adopting a method from time series analysis, we can provide significantly improved prediction performance.
Our contributions are as follows:
Propose a novel method to predict the future prognosis of a patient from longitudinal images
Introduce interval scaling which allows for uneven time intervals between visits
Demonstrate on the largest longitudinal dataset and attain state-of-the-art performance outperforming other state-of-the-art methods
Given images at times , we wish to predict the diagnosis at time , where does not necessarily hold, which is common in a clinical setting.
The proposed method consists of three stages, firstly, we utilize a pretrained CNN, with shared weights, to reduce each image to a single feature vector. Then, the feature vectors are combined, and an interval scaling is applied to account for the uneven time intervals, this weights the most recent time points as being more important in making the final prediction. Finally, a recurrent neural network (RNN) classifies the images as progressing or non-progressing. An overview of the proposed framework is shown in Figure1.
2.1 Inception V3
We begin by fine-tuning a pretrained CNN on each image, with shared weights, to extract feature vectors. In our work, we chose IncpetionV3 
pretrained on ImageNet
. InceptionV3 increases accuracy over previous networks, while remaining computationally efficient, through the use of factorized kernels, batch normalization, and regularization. InceptionV3 is considered highly generalizable with a greater than 78.1% accuracy on the ImageNet dataset. The InceptionV3 network results in a feature vector of length F=2048 for each image at each time point. This network has previously been used to provide state of the art results in single time point methods[4, 3], and is used here as a feature extractor.
2.2 Interval scaling
To account for uneven time intervals, we implement a triangular window function to create a smoothing model. Whereas in a simple moving average model, the time points are weighted equally, smoothing models weight values closer to as being more useful in the prediction. For each sequence of images at times, , where is the time point that we want to predict at, we rescale each time such that . The feature vectors for each image are then multiplied by their corresponding time interval scale. This scaling weights the images such that images closer to the time point of interest are considered more important than those observed at further time points, thus allowing the network to account for uneven time intervals.
3 GRU prediction
To predict whether the patient will progress to advanced AMD or not, we combine the interval corrected vectors into a matrix, where corresponds to the number of time points and is the number of features. We apply a Gated Recurrent Unit (GRU) 
with a filter size of 1, resulting in a single value. GRU was chosen as opposed to Long Short-Term Memory (LSTM) units, as GRU is more computationally efficient. LSTM units perform better on longer sequences; however, in this case, we only have three time points 
. The sigmoid activation function then scales the value between 0 and 1. Any values greater than a threshold of 0.5 are predicted to be future progressing patients.
In order to evaluate the performance, the proposed method is demonstrated on a dataset of AMD images with two and three time points and compared to a single time point method.
Data consists of color fundus images taken from the Age-Related Eye Disease Study (AREDS) , the most extensive clinical study into AMD.
AMD is a leading cause of vision loss worldwide . There are two main stages of AMD, early/intermediate, defined by small- to medium-sized drusen, and advanced, defined by geographic atrophy (GA) or neovascularization (nAMD) . Drusen can be observed as yellow-white lipid deposits under the retina, varying greatly in size and morphology . The exact causes of AMD are unknown; however, studies have shown that smoking and genetics are significant risk factors . Risk factors for progression from early/intermediate to advanced AMD are also unknown; however, there is evidence that drusen and optic disk characteristics are important [18, 24]. Vision loss can be avoided with interventions such as anti-VEGF treatment; however, disease progression and the need for treatment are often hard to predict . This highlights the need for accurate prognostic models.
We extracted 4,903 eyes, which had four visits, complete with images and diagnoses at each visit, with no diagnosis of advanced AMD during the first three visits. Advanced AMD was defined as either Central GA, nAMD, or both GA and nAMD. We used the last visit as ground truth to make our prediction based on the first three visits. Of the 4,903 included eyes, 453 (9.2%) progressed to advanced AMD.
We randomly split the data into 60% training (2942 eyes, 272 progressing), 20% validation (981 eyes, 91 progressing), and 20% testing (980 eyes, 90 progressing) datasets. To reduce the possibility of data leakage, patients with both eyes included were kept within the same data split. Example images are given in Figure 2.
|0 years||2 years||3 years||7.5 years|
|(a) Progressing patient|
|0 years||2 years||3 years||4 years|
|(b) Non-progressing patient|
Any images in the dataset where the patient had already progressed to advanced AMD, or without the required three previous images plus a fourth prediction image for prediction, were excluded. The images were automatically cropped by first calculating the difference between the original image and the background color, an offset was added to the difference, and the bounding box was calculated from this. Image values were rescaled from between 0 and 255, to between 0 and 1. All images were resized to 256x256 pixels to reduce computational requirements. Right eye images were flipped, such that the optic disc on all images was located on the left. No prior feature extraction or segmentation/registration is required, such that our method is as generalizable to other diseases and modalities as possible. All preprocessing was automated, with no subjective human input required.
All analyses were carried out on a Linux machine with a Titan X 12GB GPU and 32GB of memory. Deep learning was conducted in Python 3.7 using the Keras 2.2.4 library
as the base library. Confidence intervals were calculated using R 3.4.4, with the pROC package .
Optimization was carried out with the Adam optimizer 
with an initial learning rate of 0.0001. We used binary cross-entropy as the loss function. If the loss did not improve after ten epochs, then the learning rate was reduced to two-thirds. Model checkpoints and early stopping prevented overfitting, with the best model being picked according to the validation loss.
We evaluate model performance using the commonly used area under the receiver operating characteristic curve (AUC), optimal sensitivity, and optimal specificity, determined by Youden’s index. To assess whether the difference in these measures between models is significant, we construct confidence intervals. De Long’s method  is used to construct confidence intervals for AUC, and bootstrapping with 2000 samples is used for sensitivity and specificity to calculate 95% confidence intervals. Results from De Long’s test  are also reported.
4.5 Results and comparisons
Results are reported using two and three time points with our method, to assess the benefit of adding additional time points. We compare our results with a method similar to those used in previous work [4, 3], using single time points with a CNN. Taking the last available image, we fine tune InceptionV3  pretrained on ImageNet  to classify as progression or no progression.
The proposed method using three time points achieves an AUC, optimal sensitivity, and optimal specificity of 0.950 (0.923, 0.977), 0.878 (0.810, 0.945), and 0.887 (0.866, 0.907), respectively; this is a significant improvement over the single time point method which had AUC, sensitivity, and specificity of 0.857 (0.823, 0.890), 0.867 (0.796, 0.937), and 0.760 (0.731, 0.788). These results show a statistically significant increase in AUC and specificity and a non-significant increase in sensitivity when using the proposed three time point method over the previous single time point methods. De Long’s test gave a p-value 0.0001, indicating a significant difference in AUCs. This significant increase in specificity without a loss in sensitivity shows our model can reduce false positives without increasing false negatives, over the previous model.
The method utilizing two time points gave an AUC, sensitivity, and specificity of 0.932 (0.905, 0.958), 0.811 (0.730, 0.892), 0.760 (0.731, 0.788). The three time point method had a non-significant increase over two time points. This may indicate that in this example more than two time points does not add any significant predictive value. Results are presented in Table 1 and the receiver operating characteristic is shown in Figure 3. Experiments without interval scaling were also conducted and showed a significant decrease in performance.
4.6 Class activation maps
To determine if our network is identifying the correct features and to reduce the black-box nature of deep learning, we create class activation maps  for each time point. We altered the top of the network slightly to achieve this, adding a dense layer after the GRU layer. While this altered network showed no significant change in predictive performance, it increased the network size by around a factor of 2. The class activation maps are shown in Figure 4, alongside original images for comparison.
The class activation maps show that areas with high concentrations of drusen are considered relevant by the network; this is expected and shows that our network is identifying the correct features. In some images, the optic disk is also highlighted, confirming that optic disk characteristics are indeed important factors in AMD progression, as observed previously [18, 24]. In images where drusen are challenging to see, the network appears to use the optic disk solely in making a prediction. It is also interesting to note that the network seems to be surer of the area of interest in images that are closer to the prediction time point. In a clinical setting, these maps may are useful when justifying the prediction.
|0 years||2 years||3 years||0 years||2 years||3 years|
|Eye 1||Eye 2|
In this work, we proposed a novel deep learning prognostic model to predict the future onset of disease. The proposed method addresses the challenge of analyzing multiple longitudinal images with uneven time points, without the need for prior image annotation. Introducing an interval scaling was shown to improve performance over a single time point method significantly. We show that by taking into account the varying times between observed images, we can significantly improve the performance of a longitudinal prognostic model. Our method provides a statistically significant increase in specificity, which is critical in contexts such as screening. Our method utilizes time intervals meaning we can extend the interval to the observed outcome to predict further into the future; this is useful in a screening context. Future work is required to assess the generalizability of the proposed method to other diseases and to extend its use to a screening context.
Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. Cited by: §4.3.
-  (1999) The age-related eye disease study (AREDS): design implications. AREDS report no. 1. Controlled clinical trials 20 (6), pp. 573–600. External Links: Cited by: §1, §4.1, §4.1.
-  (2019) Deep learning algorithm predicts diabetic retinopathy progression in individual patients. NPJ Digit Med 2, pp. 92. External Links: Cited by: §1, §2.1, §4.5.
-  (2019) Predicting progression of age-related macular degeneration from fundus images using deep learning. arXiv preprint arXiv:1904.05478. Cited by: §1, §2.1, §4.5.
-  (2013) Predicting risk of mortality in dialysis patients: a retrospective cohort study evaluating the prognostic value of a simple chest X-ray. BMC Nephrology 14 (1), pp. 263. External Links: Cited by: §1.
-  (2010) Clinical risk factors for age-related macular degeneration: a systematic review and meta-analysis. BMC Ophthalmology 10 (1), pp. 31. External Links: Cited by: §4.1.
-  (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.
-  (2015) Keras. External Links: Cited by: §4.3.
-  (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.
-  (2014) Quantitative SD-OCT imaging biomarkers as indicators of age-related macular degeneration progression. Invest Ophthalmol Vis Sci 55 (11), pp. 7093–103. External Links: Cited by: §1, §1.
-  (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44 (3), pp. 837–845. External Links: Cited by: §4.4.
-  (1982) Evaluating the yield of medical tests. JAMA 247 (18), pp. 2543–2546. External Links: Cited by: §4.4.
-  (2014) A prognostic model based on preoperative MRI predicts overall survival in patients with diffuse gliomas. American Journal of Neuroradiology 35 (6), pp. 1096. External Links: Cited by: §1.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: §3.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
-  (2012) Anti-VEGF treatment strategies for wet amd. Journal of ophthalmology 2012. External Links: Cited by: §4.1.
-  (2019) Artificial intelligence algorithm for predicting mortality of patients with acute heart failure. PLOS ONE 14 (7), pp. e0219302. External Links: Cited by: §1.
-  (2004) Optic disk appearance in advanced age-related macular degeneration. American Journal of Ophthalmology 138 (1), pp. 38–45. External Links: Cited by: §4.1, §4.6.
-  (2013) Automated prediction of AMD progression from quantified sd-oct images. Investigative Ophthalmology & Visual Science 54 (15), pp. 4150–4150. External Links: Cited by: §1.
-  (2016) Fully automated prediction of geographic atrophy growth using quantitative spectral-domain optical coherence tomography biomarkers. Ophthalmology 123 (8), pp. 1737–1750. External Links: Cited by: §1.
-  (2019) R: a language and environment for statistical computing.. R Foundation for Statistical Computing. External Links: Cited by: §4.3.
PROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12 (1), pp. 77. External Links: Cited by: §4.3.
Imagenet large scale visual recognition challenge.
International Journal of Computer Vision115 (3), pp. 211–252. External Links: Cited by: §2.1, §4.5.
-  (2004) Optic neuropathy and age related macular degeneration. Investigative Ophthalmology & Visual Science 45 (13), pp. 1627–1627. Cited by: §4.1, §4.6.
-  (2013) Prognosis research strategy (PROGRESS) 3: prognostic model research. PLoS Medicine 10 (2), pp. e1001381–e1001381. External Links: Cited by: §1.
Rethinking the inception architecture for computer vision.
Conference Proceedings In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. Cited by: §1, §2.1, §4.5.
-  (2019) Chapter 13 - drusen and macular degeneration. Book Section In Computational Retinal Image Analysis, E. Trucco, T. MacGillivray, and Y. Xu (Eds.), pp. 245–272. External Links: Cited by: §4.1.
-  (2014) Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. The Lancet Global Health 2 (2), pp. e106–e116. External Links: Cited by: §4.1.
-  (2020) Predicting conversion to wet age-related macular degeneration using deep learning. Nature Medicine 26 (6), pp. 892–899. Cited by: §1.
Learning deep features for discriminative localization. Conference Proceedings In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. Cited by: §4.6.