Log In Sign Up

TINC: Temporally Informed Non-Contrastive Learning for Disease Progression Modeling in Retinal OCT Volumes

by   Taha Emre, et al.
MedUni Wien

Recent contrastive learning methods achieved state-of-the-art in low label regimes. However, the training requires large batch sizes and heavy augmentations to create multiple views of an image. With non-contrastive methods, the negatives are implicitly incorporated in the loss, allowing different images and modalities as pairs. Although the meta-information (i.e., age, sex) in medical imaging is abundant, the annotations are noisy and prone to class imbalance. In this work, we exploited already existing temporal information (different visits from a patient) in a longitudinal optical coherence tomography (OCT) dataset using temporally informed non-contrastive loss (TINC) without increasing complexity and need for negative pairs. Moreover, our novel pair-forming scheme can avoid heavy augmentations and implicitly incorporates the temporal information in the pairs. Finally, these representations learned from the pretraining are more successful in predicting disease progression where the temporal information is crucial for the downstream task. More specifically, our model outperforms existing models in predicting the risk of conversion within a time frame from intermediate age-related macular degeneration (AMD) to the late wet-AMD stage.


Metadata-enhanced contrastive learning from retinal optical coherence tomography images

Supervised deep learning algorithms hold great potential to automate scr...

Learning Spatio-Temporal Model of Disease Progression with NeuralODEs from Longitudinal Volumetric Data

Robust forecasting of the future anatomical changes inflicted by an ongo...

Modeling Disease Progression In Retinal OCTs With Longitudinal Self-Supervised Learning

Longitudinal imaging is capable of capturing the static anatomical struc...

Using Navigational Information to Learn Visual Representations

Children learn to build a visual representation of the world from unsupe...

On the duality between contrastive and non-contrastive self-supervised learning

Recent approaches in self-supervised learning of image representations c...

Weakly-supervised Temporal Path Representation Learning with Contrastive Curriculum Learning – Extended Version

In step with the digitalization of transportation, we are witnessing a g...

1 Introduction

The scarcity of manually annotated labels is a major limitation for the classification tasks in medical image analysis. Self-supervised learning (SSL) showed a great promise in exploiting the availability of unlabeled medical data by outperforming models trained from random weights or pretrained with non-medical images in difficult supervised settings 

[1]. Traditional SSL methods rely on pretext tasks that are believed to be semantically relevant to the downstream task, such as jigsaw-puzzle solving, and rotation angle prediction.

In recent years, contrastive learning (CL) methods surpassed the pretext-based SSL in unsupervised representation learning. They learn similar representations of two heavily augmented views (positive pairs) of a sample while pushing away the others in the representation space as negatives. The goal is to find representations that are semantically meaningful and robust to image perturbations. Following this, CL methods have also been adapted for medical images. Li et al. [11] addressed the class imbalance problem by sample re-weighting during contrastive training and devised an augmentation scheme for 3D volumes. Chen et al. [7] proposed a sampling strategy by feeding two frames as pairs from an ultrasound video to encode the temporal information for the CL. They found that sampling and augmentation strategies were crucial for the downstream task. Azizi et al. [1] showed that if two different images of a patient included the same pathology, they formed more informative positive pairs than the heavily augmented pairs. Also, they reported that if the supervised and the unsupervised data were mixed for the CL, the downstream task’s performance was increased. These methods were based on contrastive InfoNCE loss [15]. However, the success of the CL largely depends on the quality of the negative samples [8]

. This introduces two challenges for medical imaging; (i) large batch sizes, and (ii) explicit negatives in a batch. Furthermore, particularly in longitudinal studies, the large batch sizes (over a thousand) are not compatible with the number of patients (a few hundred) and would create negative pairs from the same patient.

The developments in non-contrastive learning methods [6, 20, 3] avoid the need for explicit negatives and consequently for large batch sizes. They implicitly learn to push the negatives with stop-gradient [6, 9], clustering [5], or creating discrepancy between pairs through a specific loss [3, 20]. Barlow Twins’ [20]loss function achieves that by making the correlation matrix of the embeddings of the two views close to identity matrix and VICReg [3]

by calculating a Huber variance loss within a batch of embeddings. Especially, VICReg requires no architectural trick, large batch sizes, or normalization. Both of them use a simple Siamese Network and a multi-layer perceptron (MLP) projector on the representations for the embeddings. They also allow constructing pairs from different images, and unlike contrastive methods, it is even possible to use multiple modalities.

In this paper, we focus on non-contrastive SSL in longitudinal imaging datasets, and we propose a new similarity loss to exploit the temporal meta-information without increasing the complexity of the non-contrastive training. Also, we introduce a new pair-forming strategy by using scans from different visits of a patient as inputs to the model. In this regard, our work is one of the first to build on non-contrastive learning with continuous labels (time difference between visits).

Clinical background

The optical coherence tomography (OCT) imaging is widely used in clinical practice to provide a 3D volume of the retina as a series of cross-sectional slices called B-scans. Age-related macular degeneration (AMD) is the leading cause of vision loss in the elderly population [4]

. It progresses from the early/intermediate stage with few visual symptoms to a late stage with severe vision loss. Conversion to late-stage could take two forms, wet and dry AMD. Wet-AMD is defined by the formation of new vessels. An intravitreal injection can improve the patients’ vision, but it is most effective when applied soon upon conversion to wet-AMD. This motivated the medical imaging research communities to develop risk estimation models for conversion to wet-AMD. AMD progression prediction in OCT has been studied using statistical methods 

[16, 18, 14] based on biomarkers and genomics. Schmidt-Erfurth et al. [14]

approached the conversion prediction as a survival problem and used a cox model. Initial deep learning models works showed the importance of standardized preprocessing in B-scans 

[13]. Recently,  [19] exploited surrogate tasks like retinal biomarker segmentation in OCT volumes to improve the performance. On the other hand, [2] used imaging biomarkers and demographics to train an RNN on sequential data. As an SSL approach, Rivail et al. [12] pre-trained a Siamese network by predicting the time difference between the branches.


We improve on learning representations such that they (a) capture temporal meta-information in longitudinal imaging data, (b) do not suffer from dimensional collapse (only a small part of representation space is useful [10]). Our contribution is two-folds. First, we proposed a simple yet effective similarity loss called TINC to implicitly embed the temporal information without increasing the non-contrastive loss complexity. We hypothesized that the B-scans acquired closer in time should be close in the representation space. We chose VICReg [3] to use TINC with because its similarity loss explicitly reduces the distance between representations and allows alternatives in its place. Second, instead of aggressive augmentation for pair generation from a single B-scan, we formed pairs by picking two moderately augmented versions of two different B-scans from a patient’s OCT volumes acquired at different times. Finally, TINC had increased performance in the difficult task of predicting conversion from the intermediate to wet-AMD within a clinically-relevant time interval of 6 months.

2 Method

We modified the VICReg’s invariance (similarity) term by constraining it with the normalized absolute time difference (no temporal order among the inputs) between the input images. The original invariance term is the mean-squared error (MSE) between two unnormalized embeddings. The time difference acts as a margin on how low the invariance term can get. As the time difference between two visits increases, the similarity between a pair of the respective B-scans should decrease. Thus, the distance measure should lie within a margin, not on a point like the VICReg invariance term does. In other terms, the representations should slightly differ due to the time difference between the scans of a patient.

Figure 1: The overall workflow. Two B-scans sampled from different visits of a patient are fed to the network. and are the transformations for the views and . An encoder produces representations and , then a projector expands them to embeddings and , on which the loss is calculated.

Given a batch of patients with multiple visits, let visits and be the components of pair of time points randomly sampled from each available patients’ visit dates within a certain time interval. Randomly selected B-scans from the OCT volumes at times and , are augmented by random augmentations and for the two views and . First, the encoder produces representations and , then the projector (also called expander) expands the representations to embeddings and with embedding dimension (Fig. 1).

The loss terms of the original VICReg works as follows: (Eq. 1) is the invariance term or the similarity loss, (Eq. 2) is the variance term to keep a variance margin between different pairs of embeddings (prevents representation collapse), and (Eq. 3) is the covariance term that forces each component to be as informative as possible (prevents dimensional collapse).


where in

, std is the standard deviation of the

which is a vector of the

th embedding component values along the batch. For , we used as 1. In covariance term , is the covariance matrix of an embedding vector. and are calculated for and separately. is the MSE between two vectors. Finally, the total loss is the weighted sum of these three.

2.0.1 Temporally Informed Non-Contrastive Loss

The temporal label is defined as the difference between visit dates and . is the absolute value of the difference scaled to 0-1 using a Min-Max scaler with given and . We use as a margin, where the distance between the two embeddings should not be greater than it. Inspired by the epsilon insensitive loss from support vector regression, the invariance term in VICReg is replaced with our TINC loss:


The TINC term forces the distance between the two representations to be within a non-zero margin, set proportionally to the time difference between visits. As gets close to 1, the margin becomes wider, and the loss does not enforce a strict similarity. On the other hand, when is close to 0, TINC loss becomes similar to MSE between the two representations. In principle, the values of the embedding components could diminish, resulting in a collapse. Then the distance would be smaller than the margin, not contributing to the overall loss. However, the variance term in VICReg counteracts it by enforcing a standard deviation between different pairs, preventing the components from having infinitesimal values. We kept the variance and covariance losses from VICReg unmodified.

3 Experiments & Results


The self-supervised and supervised trainings were performed and evaluated on a longitudinal dataset of OCT volumes of 1,096 patients fellow eyes111The other eye that is not part of the interventional study from a clinical trial222NCT00891735. studying the efficacy of wet-AMD treatment of the study eyes. Patients had both eyes scanned monthly over two years with Cirrus OCT (Zeiss, Dublin, US) on monthly basis. The volumes consisted of 128 B-scans with a resolution of 5121024 px covering a volume of mm.

For the wet-AMD conversion prediction task, we selected fellow eyes that either remained in the intermediate AMD stage throughout the trial, or converted to wet-AMD during the trial, excluding those that had late AMD from baseline or converted to late dry-AMD. The final supervised dataset consisted of 463 eyes and 10,096 OCT scans with 117 converter eyes, and 346 non-converters. The rest of the eyes were included in the unsupervised dataset, which consisted of 541 eyes and 12,494 volumes. Following [17, 19, 12, 2], wet-AMD conversion was defined as a binary classification task, i.e., predicting whether an eye is going to convert to wet-AMD within a clinically-relevant 6 months time-frame. For the supervised training, eyes were split into 60% for training, 20% for validation, and 20% for testing, stratified by the detected conversion. The evaluation is reported on the scan level.

Preprocessing & Augmentations

In the supervised setting, we extracted from each OCT volume a set of 6 B-scans covering the central 0.28 mm, whereas 9 B-scans covering the central 0.42 mm were extracted for the SSL. To standardize the view, the retina in each B-scan was flattened with a quadratic fit to the RPE layer (segmentations are obtained with IOWA Reference Algorithm [21]). Then, each B-scan was cropped to mm and resized to

pixels, with intensities normalized between 0-1. All B-scans from an OCT volume were assigned the same conversion label for the supervised training. But during validation and testing, conversion probability of a volume was computed by picking the B-scan with maximum probability among the B-scans.

For the supervised training augmentations, random translation, rotation (max 10 degrees) and horizontal flip were used. When forming the pairs for the SSL, we followed  [3, 1] but with an increased minimum area ratio for the random cropping from 0.08 to 0.4, because in OCT volumes the noise ratio is higher than in natural images (Fig. 2), which makes small crops uninformative. Also, random grayscaling augmentation is not applicable. We picked the B-scans of the same patient from different visits with the time difference in the range of days. The time difference acts an additional augmentation, which makes the task non-trivial. Additionally, two large crops from two B-scans yield similar color histograms, preventing network to memorize color histograms and overfit (Fig. 2(d)). Following the protocol in [1], the supervised and unsupervised data were combined for the SSL.

Figure 2: Two examples of different random crop strategies. a: flattened B-scans, b: small crop area ratio [3], c: big crop area ratio between 0.4-0.8, d: color histograms of c


ResNet-50 was chosen as the encoder backbone, and an MLP with two hidden layers with batch normalization as the projector, similar to 

[3] except the dimensions were chosen as 4096 for all the SSL steps. In SSL, we used AdamW with batch size of 128, learning rate of and a weight decay of

for 400 epochs. Following 

[3, 6], a cosine learning rate scheduler with a warm-up of 10 epochs was used. In VICReg, the coefficient of the invariance term was fixed to 25, and we found improved performance when the coefficients of the variance and the covariance terms were set to 5 and 1, respectively. For the Barlow Twins, the coefficient for the redundancy reduction term was kept at 0.005.

For the downstream task, we provided the results from both the linear evaluation and the fine-tuning. The performance was evaluated with area under the receiver operating curve (AUROC) and the precision-recall curve (PRAUC). The linear evaluation was conducted by training a linear layer on top of the pre-trained & frozen encoder. It is trained with Adam optimizer, batch size of 128, learning rate of , and 5-to-1 class weights in the cross-entropy loss for 10 epochs. Fine-tuning had the same parameters with the addition of weight decay for 100 epochs. When training from scratch, the model was trained for 300 epochs. The best epoch was selected as the one with the highest AUROC score on the validation set. The learning rate was selected between - . The weight decay was selected between - including 0.

3.0.1 Experiments

We report results for wet-AMD conversion prediction task from linear evaluation and fine-tuning. We compared TINC against popular non-contrastive learning methods Barlow Twins and VICReg, which can accept different images as input pairs. When testing our new pair-forming scheme, we used the original VICReg and Barlow Twins along with their modified versions. In order to show the performance of TINC, we compared it against ResNet50 trained from scratch, VICReg and Barlow Twins modified with our new input scheme, and VICReg with additional explicit time difference prediction loss term.

Self-supervised learning AUROC PRAUC
VICReg [3] Representational collapse
Barlow Twins [20] 0.686 0.103
VICReg w. two visits 0.685 0.085
Barlow Twins w. two visits 0.708 0.098
VICReg + Explicit Time Difference 0.701 0.107
0.738 0.112
Table 1: Linear evaluation results of SSL approaches with ResNet50 backbone. is the proposed VICReg with TINC loss. "w. two visits" indicates the model modified with new pair-forming scheme.
Backbone (random initialization) 0.713 0.110
AMDNet [13] (random initialization) 0.676 0.087
Barlow Twins w. two visits 0.692 0.091
VICReg w. two visits 0.737 0.117
0.756 0.142
Table 2: Model performances for the finetuning and training from scratch

Although random crop & resize are the most crucial augmentations for non-contrastive learning, small crop area may not be ideal for OCT images due to the loss in contextual information. To verify this, we first tested Barlow Twins and VICReg with their original augmentations and input scheme as baselines followed by training them with the proposed pair-forming scheme and larger random crops. We observed (first section of Table 1) that with vanilla VICReg, its similarity loss quickly reached close to zero, and the representations collapsed. A significant improvement in performance was observed for both the methods with the proposed input scheme. The AUROC of Barlow Twins increased from 0.686 to 0.708 and VICReg achieved 0.685 AUROC score on linear evaluation.

On linear evaluation, TINC loss clearly outperformed both Barlow Twins and VICReg even after modifying them with our novel pair-forming scheme (1). TINC achieved 0.738 AUROC, while modified VICReg and Barlow Twins achieved 0.708 and 0.685 respectively. TINC captures the temporal information better with its temporally induced margin based approach leading to these improvements.

With end-to-end fine-tuning (Table 2) the performance improvement due to the proposed TINC loss is more apparent, even after optimizing VICReg and Barlow Twins with our input pair scheme. We also compared our results against AMDNet [13], an architecture specifically designed for 2D B-scans. Interestingly, AMDNet could not outperform a ResNet-50 initialized with random weights.

In order to demonstrate that the temporal information is crucial in the conversion prediction task, we added time difference as an additional loss term to VICReg. For this, we concatenated the two embeddings and fed them to an MLP to obtain the time difference predictions. The labels are calculated as w.r.t input order and scaled between -1 and 1. The MSE between the labels and the predictions is added besides the other VICReg loss terms. This clearly improved AUROC (Table 1, line 1-4), but the additional term increased the complexity of VICReg training. Whereas TINC implicitly uses the temporal information with its margin so that the representation capture the anatomical changes due to the disease progression. The experiments demonstrated that the TINC approach performs better in the downstream AMD conversion prediction task than adding the time difference as a separate term.

Additionally, we modified our loss as a squared epsilon insensitive loss to have smoother boundaries, but it degraded the performance. With Barlow Twins, we observed that the fine-tuning AUROC performance was worse than its linear evaluation by 0.016. This can be explained by the fact that in Barlow Twins, the fine-tuning reached the peak validation score within 10 epochs, same as the number of linear training epochs.

4 Discussion & Conclusion

The temporal information is one of the key factors to correctly model disease progression. However, popular contrastive and non-contrastive methods are not designed specifically to capture that. Additionally, they require strong augmentations to create two views of an image, which are not always applicable to medical images. We proposed TINC as a modified similarity term of the recent non-contrastive method VICReg, without increasing its complexity. Models trained with TINC outperformed the original VICReg and Barlow Twins in the task of predicting conversion to wet-AMD. Also TINC is not task or dataset specific, it is applicable to any longitudinal imaging dataset. Moreover, we proposed a new input pair-forming scheme for OCT volumes from different time points, which replaced the heavy augmentations required in the original VICReg and Barlow Twins and improved the performance.

4.0.1 Acknowledgements

The work has been partially funded by FWF Austrian Science Fund (FG 9-N), and a Wellcome Trust Collaborative Award (PINNACLE Ref. 210572/Z/18/Z).


  • [1] S. Azizi, B. Mustafa, F. Ryan, Z. Beaver, J. Freyberg, J. Deaton, A. Loh, A. Karthikesalingam, S. Kornblith, T. Chen, et al. (2021) Big self-supervised models advance medical image classification. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    pp. 3478–3488. Cited by: §1, §1, §3.
  • [2] I. Banerjee, L. de Sisternes, J. A. Hallak, T. Leng, A. Osborne, P. J. Rosenfeld, G. Gregori, M. Durbin, and D. Rubin (2020-09-22) Prediction of age-related macular degeneration disease using a sequential deep learning approach on longitudinal sd-oct imaging biomarkers. Scientific Reports 10 (1), pp. 15434. External Links: ISSN 2045-2322, Document Cited by: §1, §3.
  • [3] A. Bardes, J. Ponce, and Y. LeCun (2022) VICReg: variance-invariance-covariance regularization for self-supervised learning. In International Conference on Learning Representations, Cited by: §1, §1, Figure 2, §3, §3, Table 1.
  • [4] N. M. Bressler (2004-04) Age-Related Macular Degeneration Is the Leading Cause of Blindness. JAMA 291 (15), pp. 1900–1901. External Links: ISSN 0098-7484, Document, Cited by: §1.
  • [5] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems 33, pp. 9912–9924. Cited by: §1.
  • [6] X. Chen and K. He (2021) Exploring simple siamese representation learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 15750–15758. Cited by: §1, §3.
  • [7] Y. Chen, C. Zhang, L. Liu, C. Feng, C. Dong, Y. Luo, and X. Wan (2021) USCL: pretraining deep ultrasound image diagnosis model through video contrastive representation learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 627–637. Cited by: §1.
  • [8] A. Ermolov, A. Siarohin, E. Sangineto, and N. Sebe (2021) Whitening for self-supervised representation learning. In

    International Conference on Machine Learning

    pp. 3015–3024. Cited by: §1.
  • [9] J. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. (2020) Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, pp. 21271–21284. Cited by: §1.
  • [10] L. Jing, P. Vincent, Y. LeCun, and Y. Tian (2022) Understanding dimensional collapse in contrastive self-supervised learning. In International Conference on Learning Representations, Cited by: §1.
  • [11] H. Li, F. Xue, K. Chaitanya, S. Luo, I. Ezhov, B. Wiestler, J. Zhang, and B. Menze (2021) Imbalance-aware self-supervised learning for 3d radiomic representations. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 36–46. Cited by: §1.
  • [12] A. Rivail, U. Schmidt-Erfurth, W. Vogl, S. M. Waldstein, S. Riedl, C. Grechenig, Z. Wu, and H. Bogunovic (2019) Modeling disease progression in retinal octs with longitudinal self-supervised learning. In International Workshop on PRedictive Intelligence In MEdicine, pp. 44–52. Cited by: §1, §3.
  • [13] D. B. Russakoff, A. Lamin, J. D. Oakley, A. M. Dubis, and S. Sivaprasad (2019) Deep learning for prediction of amd progression: a pilot study. Investigative ophthalmology & visual science 60 (2), pp. 712–722. Cited by: §1, §3.0.1, Table 2.
  • [14] U. Schmidt-Erfurth, S. M. Waldstein, S. Klimscha, A. Sadeghipour, X. Hu, B. S. Gerendas, A. Osborne, and H. Bogunović (2018)

    Prediction of individual disease conversion in early amd using artificial intelligence

    Investigative ophthalmology & visual science 59 (8), pp. 3199–3208. Cited by: §1.
  • [15] A. van den Oord, Y. Li, and O. Vinyals (2019) Representation learning with contrastive predictive coding. External Links: 1807.03748 Cited by: §1.
  • [16] Z. Wu, H. Bogunović, R. Asgari, U. Schmidt-Erfurth, and R. H. Guymer (2021) Predicting progression of age-related macular degeneration using oct and fundus photography. Ophthalmology Retina 5 (2), pp. 118–125. External Links: ISSN 2468-6530, Document Cited by: §1.
  • [17] Q. Yan, D. E. Weeks, H. Xin, A. Swaroop, E. Y. Chew, H. Huang, Y. Ding, and W. Chen (2020) Deep-learning-based prediction of late age-related macular degeneration progression. Nature machine intelligence 2 (2), pp. 141–150. Cited by: §3.
  • [18] J. Yang, Q. Zhang, E. H. Motulsky, M. Thulliez, Y. Shi, C. Lyu, L. de Sisternes, M. K. Durbin, W. Feuer, R. K. Wang, G. Gregori, and P. J. Rosenfeld (2019) Two-year risk of exudation in eyes with nonexudative age-related macular degeneration and subclinical neovascularization detected with swept source optical coherence tomography angiography. American Journal of Ophthalmology 208, pp. 1–11. External Links: ISSN 0002-9394, Document Cited by: §1.
  • [19] J. Yim, R. Chopra, T. Spitz, J. Winkens, A. Obika, C. Kelly, H. Askham, M. Lukic, J. Huemer, K. Fasler, et al. (2020) Predicting conversion to wet age-related macular degeneration using deep learning. Nature Medicine 26 (6), pp. 892–899. Cited by: §1, §3.
  • [20] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny (2021) Barlow twins: self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp. 12310–12320. Cited by: §1, Table 1.
  • [21] L. Zhang, G. H. S. Buitendijk, K. Lee, M. Sonka, H. Springelkamp, A. Hofman, J. R. Vingerling, R. F. Mullins, C. C. W. Klaver, and M. D. Abràmoff (2015-05) Validity of Automated Choroidal Segmentation in SS-OCT and SD-OCT. Investigative Ophthalmology & Visual Science 56 (5), pp. 3202–3211. External Links: ISSN 1552-5783, Document, Cited by: §3.