1 Introduction
Retinal fundus images are non-invasive and low-cost. They are important for ophthalmology and also capture a detailed picture of the retinal vasculature. Thus, they can be used for studying and potentially predicting diseases such as diabetes, stroke, hypertension and neurovascular disease [macgillivray2014retinal]. To analyse the relationships between aspects of the retina and other quantities of interest, retinal traits (also called features, parameters or phenotypes) are used as a quantitative description of a specific aspect of the retinal image. Reducing a complex image to a single, meaningful number is necessary to use standard statistical methods yet a challenging task. It is challenging to identify a potentially salient aspect of the retina in the first place and to then design a method that can reliably quantify this aspect. This is further complicated by the large variability in retinal images stemming from idiosyncrasies of the imaged retinas (e.g. due to retinal diseases or rare phenotypes) and image quality (e.g. due to operator inexperience or time pressures in large scale cohort studies). Thus, pipelines for extracting such retinal traits tend to be complex and comprise of multiple steps, and can only be applied to images of sufficient quality.
Poor image quality is a key problem in retinal image analysis. Particularly for large scale studies such as UK Biobank, many images are of poor quality being blurred, obscured, or hazy [macgillivray2015suitability]. Imaging artefacts such as noise, non-uniform illumination or blur can also lead to poor vessel segmentations [mookiah2021review]. Previous work analysing 2,690 UK Biobank participants found that only 60% had an image that could be adequately analysed by VAMPIRE [macgillivray2015suitability]. Two recent large-scale studies using retinal Fractal Dimension (FD) for predicting cardiovascular disease risk discarded 26% [zekavat2022deep] and 43% [velasco2021decreased] of the images in UK Biobank. Although necessary, this is unfortunate as it leads to lower sample sizes and makes it hard to study rare diseases in particular.

We hypothesise that it is possible to approximate pipelines for calculating retinal traits with a single, simpler step and propose Deep Approximation of Retinal Traits (DART). Fig. 1
gives a high-level overview of our approach. DART trains a deep neural network (DNN) to predict the output of an original method (OM) for calculating a retinal trait. We can then train the model to be robust to image quality issues by synthetically degrading the input images during training and asking the DNN model to predict the output of the OM on the original high quality image. The intuition behind this approach is that obtaining a high quality segmentation of the entire retina is a much harder task than describing an aspect of the vasculature like vascular complexity directly. DART offers a segmentation-free way of computing retinal traits related to the vasculature, but can also be applied to any other retinal image analysis method like feature extraction for disease grading or pathology segmentation.
In the present work, we focus on retinal Fractal Dimension (FD), a key retinal trait that has been used to predict cardiovascular disease risk [velasco2021decreased, zekavat2022deep] and is associated with neurodegeneration and stroke [lemmens2020systematic]. We use FD as calculated by VAMPIRE [trucco2013novel] with the multifractal [stosic2006multifractal] method as the OM we apply DART to. At minimum, FDDART should have very high agreement with FDVAMPIRE on high quality images so that it can be interpreted in the same way. To be a useful method, it should further be robust to image quality issues and efficient. Robustness would enable researchers to discard fewer images than currently necessary while efficiency allows to conduct analyses at large scale without requiring large compute resources.
2 Deep Approximation of Retinal Traits (DART)
2.1 Motivation and theory
We hypothesise that it is possible to approximate the entire pipeline of an original method (OM) for calculating a retinal trait in a single, simpler step. We denote the distribution of high quality retinal fundus images as , where each image has dimensions height H, width W, and channels C. The OM can be interpreted as a function that maps from the image space to one-dimensional retinal trait space (in our case, FD) , i.e. given an image the FD computed by the OM is . Our goal is to find an alternative function that is both simpler than and has high agreement with for all images of sufficient quality that the OM can be used, i.e. for all .
Designing such a simpler function by hand would be very challenging. Thus, we use a deep neural network (DNN). DNNs are universal function approximators in theory and very effective for image analysis in practice. We can then find a good approximation of by simply updating the model parameters (weights, biases, normalisation layer parameters) to minimise some differentiable measure of divergence between and , e.g. mean squared error.
2.1.1 Accuracy
The output of the OM is fully determined by the given image, so we would expect that very high accuracy can be achieved. This contrasts with other problems, e.g. clincians take into account additional information like symptoms and family history, and might disagree with each other or even themselves if shown the same image multiple times.
2.1.2 Simplicity & Efficiency
Some readers might not perceive DNNs as simple or efficient. However, modern pipelines for retinal image analysis tend to use DNNs for vessel segmentation, so not requiring additional steps implies strictly lower complexity both computationally and in terms of required code. Furthermore, segmentation models tend to have an encoder-decoder structure (e.g. UNet) whereas models for classification/regression only need an encoder and small prediction head, making them more parameter-, memory-, and compute-efficient. Finally, given the widespread adoption of deep learning, the frameworks are very mature and can be very efficiently GPU-accelerated.
2.1.3 Robustness
We hypothesise that there images of lower quality that are such that a) current pipelines would not produce a useful FD number, but b) there is still sufficient information to give an accurate estimate of the FD number we would have obtained on a counterfactual high quality image. For example, in an image with an obstruction, only part of the retina might be visible. Thus, the resulting vessel segmentation map would be poor and the FD of this map would be very different from that of the counterfactual high quality image, yet the visible parts of the retina might contain sufficient information about the vascular complexity of the retina as a whole to recover an accurate estimate of the FD.
As we do not observe counterfactual high quality images or objective ground truth FD values, we artificially degrade high quality images with a degradation function and train our model to minimise the difference between the predicted FD for the degraded image and the OM’s FD for the high quality image . If there indeed is sufficient information in the degraded images, then our model should be able to predict the OM’s FD from the high quality image reasonably well. However, this is a much harder task than matching the OM on high quality images, as the degradations lose information and for a given degraded image there are multiple possible counterfactual high quality images.
2.2 Implementation
2.2.1 Model & Training
Our model consists of a pretrained ResNet18 [he2016deep]
backbone that extracts a feature map from the images, followed by spatial average pool and a small multi-layer perceptron with a two hidden layers with 128 and 32 units, and a single output. Each hidden layer is followed by a layernorm
[ba2016layer] and GELU [hendrycks2016gaussian] activation. No activation is applied to the final output. ResNet is a well-established architecture that has been shown to perform competitively with more recent architectures when using modern training techniques [bello2021revisiting, wightman2021resnet]. We use Resnet18 as it is the most light-weight member of the Resnet family. We initialise the backbones with pre-trained weights on natural images from Instagram [yalniz2019billion]. Those images are very different from retinal images, thus this is merely a minor refinement on random initialisation. We resize images to 224x224 pixels for computational efficiency and lower memory requirements. Apart from standard normalisation using channel-wise ImageNet mean and standard deviations, no further preprocessing is done and all 3 colour channels are kept.
We train our model using a batchsize of 256 to minimise the mean squared error between prediction and target after normalizing the target to zero mean and unit variance, using mean and standard deviation from the training data to avoid data leakage. The model output can then be mapped back to FD range by applying the inverse transformation. We use the AdamW optimiser
[loshchilov2017decoupled] (, weight decay of ) and a cosine learning rate schedule [loshchilov2016sgdr]. We train for 35 epochs with a linear learning rate warmup from
to for 5 epochs, followed by 3 cycles of 10 epochs each. During each cycle, the current epoch learning rate is set according to a cosine schedule, and after each cycle is decayed by taking the square root. We apply generic data augmentations (horizontal () and vertical flip (), mild affine transformations (, rotation by up to ±10°, shear of up to ±5°, and scaling by ±5%)) as well as the image degradations described in the next section with(sampling all 5 levels uniformly) to the images during training. We implemented our code in Python 3.9 using PyTorch and timm
[rw2019timm] and plan to make it publicly available upon publication.2.2.2 Synthetic degradations
max width=0.9max totalheight=1
Severity | 1 | 2 | 3 | 4 | 5 |
Brightness/Contrast/Gamma | ±5% | ±10% | ±15% | ±20% | ±25% |
Mini Artifacts (holes, height, width) | 2-20/1-3/5-8 | 2-24/1-5/5-12 | 2-28/1-5/5-16 | 2-32/1-3/5-20 | 2-40/1-3/5-24 |
Square Artifacts (side length) | 25 | 50 | 75 | 100 | 125 |
Chop Artifacts (% of image removed) | 10-15 | 10-25 | 10-35 | 10-45 | 10-50 |
Advanced Blur (kernel size, sigma) | 3-5/0.2-0.5 | 3-7/0.2-0.7 | 3-9/0.2-0.8 | 3-11/0.2-0.9 | 3-13/0.2-1.0 |
Gaussian Noise (variance) | 1-10 | 5-10 | 5-20 | 5-25 | 5-30 |
We focus on three types of quality issues in retinal images [mookiah2021review, macgillivray2015suitability]
: Lighting issues, artifacts/obstructions, and imaging issues. To simulate general lighting issues, we independently change brightness, contrast and gamma of the image. To simulate artifacts/obstructions and severely inconsistent lighting, we introduce one of three artifacts: 1) many smaller rectangular holes placed across the retina, b) a single large square hole, or c) we “chop” off the bottom or top part of the image. The latter is inspired by the observation that in UK Biobank some images only have the top or bottom part properly illuminated. To simulate general imaging issues, we add pixel-wise Gaussian noise and blur the image. Standard isotropic Gaussian blur kernels do not mimic realistic image blur, so we use an advanced anisotropic blurring technique developed for image super-resolution
[wang2021real] where the standard deviations for both dimensions of the kernel are sampled independently, and the kernel is then rotated and has some noise added before being applied to the image.We specify degradation parameters for five levels of severity, shown in Table 1. For a given level, we sample parameters for each image independently from the given ranges. Degradations are applied after images have already been downsized to 224x224. We apply an artifact with where is the severity. If an image was chosen to have an artifact applied to it, we then choose Mini Artifacts with , Square Artifact with , and Chop Artifact with . Degradations are implemented using the albumentations package [info11020125].

3 Experiments
3.1 Data
We apply our DART framework multi-fractal FD [stosic2006multifractal] calculated with VAMPIRE [trucco2013novel]. We use only images that had been identified as high quality in a previous study [velasco2021decreased] as for those images FDVAMPIRE should be reliable and can be considered as a reasonable “ground-truth”. We randomly split the data into train, validation, and test sets containing 70, 10, and 20% of the participants in UK Biobank, resulting in 52,242 / 7,478 / 14,907 images belonging to 32,300 / 4,614 / 9,229 participants in each set. We split at the participant level such that no images of the same participant occur in different sets. Images are cropped to square to remove black non-retinal regions and processed at 224x224 as described above.
3.2 Results
3.2.1 Agreement & Robustness
max width=0.9max totalheight=1 Degradations Pearson (p-value) Spearman (p-value) OLS Regression fit None 0.9160 0.9572 (0.0000) 0.9561 (0.0000) y=0.01 + 1.00x Severity 1 0.8957 0.9467 (0.0000) 0.9446 (0.0000) y=0.01 + 0.99x Severity 2 0.8859 0.9414 (0.0000) 0.9396 (0.0000) y=0.01 + 0.99x Severity 3 0.8623 0.9287 (0.0000) 0.9282 (0.0000) y=0.00 + 1.00x Severity 4 0.8309 0.9116 (0.0000) 0.9103 (0.0000) y=0.01 + 0.99x Severity 5 0.7773 0.8817 (0.0000) 0.8840 (0.0000) y=0.02 + 0.99x
We find very high agreement between FDVAMPIRE and FDDART on the original images with Pearson and . Table 2 shows results for different levels of degradations. When degrading the images and asking our model to predict the FDVAMPIRE obtained from the high quality image, agreements goes down as the images become more degraded, which is what we would expect as these degradations remove substantial information about the retinal vasculature. However, despite this, we still observe good agreement with the FDVAMPIRE obtained on the original image even at severity level 5 where extreme degradations are applied (Pearson and ). This suggests that DART can recover good estimates of the retinal trait that would have been obtained from a counterfactual high quality image even if the available image has very poor quality. Thus, this might allow for discarding much fewer images than currently necessary.
![]() from original images for different levels of degradation. |
![]() residuals. |
For comparison, a previous study comparing FD for arteries and veins separately between VAMPIRE and SIVA [mcgrory2018towards] found very poor agreement between the measures of the two tools ( and for arteries and veins, respectively). Another study comparing vessel caliber-related retinal traits obtained with VAMPIRE, SIVA, and IVAN found that they agreed with Pearson s of 0.29 to 0.86. Thus, the observed agreement between FDVAMPIRE and FDDART with a Pearson and is very high, and even when DART is applied the most degraded images the agreement (Pearson and ) is higher than what could be expected when using two different tools on the same high quality images.
Finally, our method shows very low bias even as degradation severity is increased (Fig. 5). The best OLS fit is very close to the identity line for all levels of severity, or equivalently, the optimal linear translation function from FDDART to FDVAMPIRE is almost simply the identity function. This also implies that no post-hoc adjustment for image quality is needed and FDDART values obtained for images of varying quality are on the same scale out-of-the-box. As degradation severity increases, the variance of the residuals also increases but most residuals are still less than one interquartile range (IQR), a robust equivalent of the standard deviation, even when applying the strongest degradation.
3.2.2 Speed
Images were loaded into RAM so that hard disk speed is not a factor. We then measured the time it took to process all 52,242 training images, including normalisation, moving them from RAM to GPU VRAM, as well as the time to move the results back to RAM. We used a modern workstation (Intel i9-9920X 24 core CPU, single Nvidia RTX A6000 24GB GPU, 126GB of RAM) and a batchsize of 440. With ResNet18 as backbone, our model processed all 52,242 images in 48.5s ± 93.6 ms (mean±std over 5 runs), yielding a rate of 1077 img/s.
4 Conclusion
We have shown that we can use DART to approximate the multi-step pipeline for obtaining FDVAMPIRE with very high agreement. Our resulting model can compute FDDART for over 1,000img/s using a GPU. Furthermore, our model can compute FDDART values from severely degraded images that still match the FDVAMPIRE values obtained on the high quality images well. This could allow researchers interested in studying retinal traits to discard fewer images than currently necessary and thus have higher sample sizes. We consider these to be very encouraging initial results.
There are a number of directions for future work. First, the proposed framework can be easily applied to other retinal traits like vessel tortuosity or width, or FD as calculated by other pipelines. We would expect that this would be similarly successful. Second, the robustness of the resulting DART model should be evaluated in more depth and the cases with extreme residuals should be manually examined. We expect that robustness can be further improved, especially if we identify common failure cases and use those as data augmentations. Third, many straight-forward, incremental technical improvements should be possible such as improved training procedures to further increase performance, trying different architectures and resolutions, and speeding up inference speed further through common tricks like fusing batch norm layers into the convolutional layers. Finally, we hope that our approach will eventually enable other researchers to conduct better analyses, e.g. by not having to discard as many images and thus having a larger sample size available.
Acknowledgements
We thank our colleagues for their help and support.
This research has been conducted using the UK Biobank Resource under project 72144. This work was supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. For the purpose of open access, the author has applied a creative commons attribution (CC BY) licence to any author accepted manuscript version arising.