Due to a rapid advancement of medical imaging, the amount of longitudinal imaging data is rapidly growing [Fujimoto_2016]. Longitudinal imaging is an especially effective observational approach, used to explore how disease processes develop over time in a number of patients, providing a good indication of disease progression. It enables personalized precision medicine [Vogl_2017] and it is a great source for automated image analysis. However, the automated modelling of disease progression faces many challenges: despite the large amount of data, associated human-level annotations are rarely available, which leads to several limitations in the current modelling methods. They are either limited to the annotated cross-sectional samples and miss most of the temporal information. Or, they are based on known handcrafted features and simplified models of low complexity.
To overcome these limitations, we propose a solution based on self-supervised learning. Self-supervised learning consists of learning an auxiliary or a so-called pretext
task on a dataset without the need for human annotations to generate a generic representation. This representation can then be transferred to solve complex supervised tasks with a limited amount of data. We propose a pretext task that exploits the availability of large numbers of unlabelled longitudinal images, focuses on learning temporal-specific patterns and that is not limited by irregular time-sampling or lack of quality in image registration, which are common issues in longitudinal datasets. The learnt representation is compact and allows for transfer learning to more specific problems with limited amount of data. We demonstrate the capability of our proposed method on a longitudinal retinal optical coherence tomography (OCT) dataset of patients with early/intermediate age-related macular degeneration (AMD).
In the current ophthalmic clinical practice, optical coherence tomography (OCT) is the most commonly used retinal imaging modality. It provides 3-dimensional in-vivo information of the (pathological) retina with a micrometer resolution. Typically, volumetric OCTs are rasterized scans, where each B-Scan is a cross-sectional image of the retinal morphology. AMD is a major epidemic among the elderly and advanced stage of AMD is the most common cause of blindness in industrialized countries, with two identified main forms: geographic atrophy (GA) and choroidal neovascularization (CNV). The progression from early or intermediate, symptomless stages of AMD to advanced stage is extremely variable between patients and very difficult to estimate clinically. Robust and accurate prediction at the individual patient level is a critically important medical need in order to manage or prevent irreversible vision loss.
Current analysis of clinical data and disease course development evolves around traditional statistical approaches, where time-series models are fit to a limited amount of known biomarkers describing the disease status [Vogl_2017, Vogl2017a]. Such models are problematic in case of a disease such as AMD where the underlying mechanisms are still poorly understood and main biomarkers are yet undiscovered [schmidt-erfurth_prediction_2018]. Disease progression modelling from longitudinal imaging data has been most active in the field of Neuroimaging for modeling the progression of Alzheimer’s disease, largely due to the public availability of a longitudinal brain magnetic resonance images under Alzheimer’s Disease Neuroimaging Initiative (ADNI). There, a variety of regression-based methods have been applied to fit logistic or polynomial functions to the longitudinal dynamic of each imaging biomarker [sabuncu_event_2014]. Other efforts have been focusing on non-parametric Gaussian process (GP) models [lorenzi_efficient_2015]
but the specification of the joint covariance structure of the image features to account for spatial and temporal correlation has been found to still be computationally prohibitive. In addition, these methods are often linear, and/or treat the data as cross-sectional, and thus do not exploit non-linear relationships. Self-supervised learning was already successfully applied to time-series video data in the field of computer vision, where Lee et al. developed a solution based on time-shuffling[DBLP:journals/corr/abs-1708-01246], which inspired our method.
We propose a novel self-supervised task, suited for learning a compact representation of longitudinal imaging data that captures time-specific patterns without the need of segmentation or a priori information. We assume that the information of the future evolution and disease progression is encoded in an observed series of images to a certain degree. Hence, we train our model on a pretext task: estimating the time interval between pairs of images of the same patient. Thus, an implicit aging model is built resulting in a compact representation that contains knowledge about healthy and disease evolution. As this task does not rely on annotations, perfect registration or regular sampling intervals, we are able to incorporate large unlabelled longitudinal datasets without the need of time and cost intensive pre-processing or annotation generation. We demonstrate that the model is able to learn the given pretext task and that it is capable of capturing the longitudinal evolution. Furthermore, we show that such a representation can be transferred to other longitudinal problems such as a prediction or survival estimation setting with limited amount of training data. In our case, we predict the future conversion to advanced stage of AMD within a certain time interval. In contrast to a model learnt from scratch or transferred from non longitudinal task (ex. autoencoder), we observe a boost in accuracy when fine-tuning a model trained on our new pretext task.
2 Self-supervised learning of spatio-temporal representations
In this Section, we will present the pretext task that we chose for self-supervised learning of longitudinal imaging data, the deep networks that we implemented to solve it, and finally how we extract the representations to transfer them to different problems.
Self-supervised learning paradigm
to learn the pretext task, we train a deep Siamese network (Fig. 1) [bromley1994signature]. The Siamese structure reflects the symmetry of the problem and allows to train a single encoder, which can be later transferred either as a fixed feature extractor or as a pretrained network for fine-tuning.
Let , be two OCT images acquired from the same patient, and , be the corresponding time-points of acquisition. These images are encoded into a compact representation (, ) by the encoder network, . The pair-interval network, , predicts the time interval or relative time difference between the pair of B-Scans, . Note that the order of the pair do not need to be chronological.
The entire siamese network is trained by minimizing the loss of this regression task. After the training, the encoder network can be transferred to extract features for other tasks.
the encoder is implemented as a deep convolutional network, with three blocks of three layers (each layer: 3x3 convolution layer with batch normalization and ReLU activation with 16,32,64 channels for block 1,2,3) with a max pooling layer at the end of each block. The last block is followed by a fully connected layer (128 units), which outputs the encoded version of the B-scan (denoted asvgg). We also tested a version with skip connections followed by concatenation between the blocks (denoted as dense). The pair-interval network has two fully connected layers and outputs the estimated time interval.
The network is trained by minimizing L2 loss with gradient-descent algorithm Adam [kingma2014adam]. We trained the network for 600k steps and computed validation loss every 12k steps. We kept the model with highest validation loss.
Transfer of representations
The representations, , extracted from the encoder network, are used as input for a classification problem. This transfer allows to evaluate whether these representations are containing meaningful information regarding the patient-specific evolution of AMD. We directly transfer the trained encoder from the deep siamese network to a classification task by adding a final block to perform classification (Fig 2). The classification block consists of two fully connected layers, the first one with ReLU activation, the last one with softmax. The resulting network is fine-tuned by minimizing cross-entropy.
3 Experiments and Results
Here, we provide details about the self-supervised training and its evaluation with respect to the time prediction. Then, we transfer the representation obtained from self-supervised training to a classification task, where we predict from longitudinal retinal OCT the conversion from intermediate to advanced AMD within different time intervals.
The longitudinal dataset used for training and validation contains 3308 OCT scans from 221 patients (420 eyes) diagnosed with intermediate AMD. Follow-up scans were acquired in a three or six months interval up to 7 years, and were included in the dataset up to the time-point of conversion to advanced AMD. Follow-up acquisitions were automatically registered by scanner software (Spectralis OCT ®, Heidelberg Engineering, GER). Within this population, 48 eyes converted to GA, an advanced stage of AMD.
The patients are divided in 6 fixed folds to perform cross-validation. For the pretext task and the prediction of conversion, we used one fold fold as test, one as validation and the remaining folds as training data. The pretraining with self-supervised learning use the same training sets as the prediction of conversion.
First, the bottom-most layer of the retina, the Bruch’s membrane (BM), was segmented using [chen_three-dimensional_2012], followed by a flattening of the concave structure and an alignment of the BM over all scans. Finally, scans were cropped to the same physical field of view (6 mm 0.5 mm) and resampled to .
3.2 Learning the pretext task
The success of learning the pretext task was evaluated using R and the mean absolute error (MAE) of the interval prediction. In addition, we verified how well the network could predict the temporal order of samples, which is done by evaluating the accuracy of predicting the correct sign of the interval. To obtain a volume prediction (the network is trained on B-Scans), we took the mean of all scan predictions. The best performance was achieved by the vgg-like model with a R of 0.566, a MAE of 7.69 months and an accuracy for order prediction of 0.843 (See Table 1). Fig. 3 displays MAE and order prediction accuracy for the different time intervals (the visit intervals are roughly a multiple of three months). We observed that the network was able to predict the order, even for the smallest interval (3 months) with an accuracy of 0.66 (a random performance yielding 0.5). The absolute time-interval regression error was greater for large intervals (Fig 3
), with a tendency to underestimate the interval, probably because of a uniform distribution of training intervals centered on zero. However the relative error was decreasing for larger intervals. These results show that it is possible to estimate the time interval between two OCTs to a certain extent, which allows to learn a generic evolution model for the retina. In the next experiment, we verified that this model contained relevant longitudinal information to solve a specific clinical prediction task.
3.3 Conversion to advanced AMD classification
We applied the representation on a binary classification task, where we predicted from a single OCT representation, whether a patient eye will convert to GA within defined intervals of 6 months, 12 and 18 months (three separated binary problems). For the evaluation dataset we used the same 6 folds and their train, validation and test subsets. However, we included only one OCT per patient, in order to simulate a single visit. For patients who converted we chose the acquisition having the largest distance to conversion within the given interval. For patients showing no conversion within the study we chose the acquisition within the given interval with the furthest distance to the last patients acquisition. For each OCT volume, we used the trained encoder to extract a representation vector from the central B-Scan.
We restricted the input to a single time-point to verify that the encoded information allows to evaluate directly the stage of the patient. We fine-tuned the encoder on each fold and kept the epoch with the highest validation loss. We tested two different baselines with the same structure as the model transferred from the self-supervised learning, (i) trained from scratch (no transfer), and (ii) transferred from an autoencoder trained on cross-sectional OCTs with mean square error as reconstruction loss. The Autoencoder baseline allows to verify that our method is learning specific longitudinal features. The networks pretrained with our method or with an autoencoder are fine-tuned using ADAM optimizer by minimizing cross-entropy. After 20 epochs, the best epoch is selected using validation AuC. We performed a grid search on the learning rate, number of features in the classification block, and dropout rate, each setting is repeated 5 times. The best hyperparameter combination was chosen based on the best average validation AuC. The final cross-validated test performance was evaluated using ROC AuC and average precision. We observed that all the networks rapidly overfit on the dataset, which was expected given the size of the dataset (around 260 training samples) and the high capacity of the network. For all intervals, the self-supervised method shows best performance for both ROC AuC and average precision. On the other hand, transfer from the OCT autoencoder is only marginally better than the network trained from scratch (Table2). The difference between the transfer from OCT autoencoder and our new self-supervised task shows that the latter captures longitudinal information, which is not available in the autoencoder. Our method allows to quickly train a deep network on this challenging task with a small number of annotations.
|Model||6 m.||12 m.||18 m.|
|Training from scratch (i)||0.640 0.067||0.651 0.076||0.676 0.095|
|OCT autoencoder (ii)||0.650 0.144||0.519 0.060||0.677 0.088|
|Self-supervised (ours)||0.753 0.061||0.784 0.067||0.773 0.074|
|Model||6 m.||12 m.||18 m.|
|Training from scratch (i)||0.309 0.114||0.282 0.0125||0.300 0.107|
|OCT autoencoder (ii)||0.277 0.152||0.283 0.144||0.329 0.117|
|Self-supervised (ours)||0.367 0.084||0.394 0.115||0.463 0.133|
Conversion classification of patients suffering from intermediate AMD. We performed a 6-fold cross-validation and display the ROC AuC and average precision (mean and standard deviation) for three settings: conversion within 6 months (m.), 12 months and 18 months.
4 Discussion and conclusions
Effective modeling of disease progression from longitudinal data has been a long pursued goal in medical image analysis. We presented a method based on self-supervised learning, which builds an implicit evolution model by taking benefit from longitudinal unlabelled data. This method allows to build representations in an unsupervised way that captures time-specific patterns in the data. The representation can be transferred to solve many longitudinal problems, such as patient-specific early prediction or risk estimation. Unlike reconstruction based methods, the pretext task allows to train on irregular longitudinal data, with irregular time-sampling or limited anatomical registration. The trained encoder can be transferred easily to related longitudinal problems with limited amount of annotated data. In this paper, we applied the method on longitudinal OCTs of patients with intermediate AMD. The learned features were successfully transferred to the problem of predicting incoming disease onset to advanced AMD. There are, however, some limitations in the proposed method. The method is trained on single B-scans instead on the full volume, which highly reduces the memory footprint, but introduces some intermediate steps to generate a patient representation and makes the pretext task harder, as the evolution of each OCT volume might not be uniformly distributed. Although we demonstrated the capability of our approach on retinal OCT scans, the method is not limited to this imaging modality or anatomical region, and may be applied to other longitudinal medical imaging datasets as well.
This work was funded by the Christian Doppler Research Association, the Austrian Federal Ministry for Digital and Economic Affairs and the National Foundation for Research, Technology and Development. We thank the NVIDIA corporation for a GPU donation.