Optical coherence tomography (OCT) is a non-invasive imaging technique that provides high-resolution volumetric visualization of the retina [li2017statistical]. However, it offers poor contrast between vessels and nerve tissue layers [gao2016optical]
. This can be overcome by decoupling the dynamic blood flow within vessels from stationary nerve tissue by decorrelating multiple cross-sectional images (B-scans) taken at the same spatial location. By computing the variance of these repeated B-scans, we obtain an OCT angiography (OCT-A) volume that has better visualization of retinal vasculature than traditional OCT[jia2012split]. In contrast to other techniques such as fluorescein angiography (FA), OCT-A is advantageous because it both provides depth-resolved information in 3D and is free of risks related to dye leakage or potential allergic reaction [gao2016optical]. OCT-A is popular for studying various retinal pathologies [burke2017application, ishibazawa2015optical]. Recent usage of the vascular plexus density as a disease severity indicator [hollo2018comparison] highlights the need for vessel segmentation in OCT-A.
Unlike magnetic resonance angiography (MRA) and computed tomography angiography (CTA), OCT-A suffers from severe speckle noise, which induces poor contrast and vessel discontinuity. Consequently, unsupervised vessel segmentation approaches [bozkurt2020texture, zhao2018vascular, 993126, Lorigo:2001jv, Vasilevskiy01fluxmaximizing] developed for other modalities do not translate well to OCT-A. Denoising OCT/OCT-A images has thus been an active topic of research [oguz2020self, hu2020retinal, devalla2019deep]. The noise is compounded in OCT-A due to the unpredictable patterns of blood flow as well as artifacts caused by residual registration errors, which lead to insufficient suppression of stationary tissue. This severe noise level, coupled with the intricate detail of the retinal capillaries, leads to a fundamental roadblock to 3D segmentation of the retinal blood vessels: the task is too challenging for unsupervised methods, and yet, obtaining manual segmentations to train supervised models is prohibitively expensive. For instance, a single patch capturing only about 5% of the whole fovea (Fig. 4f) took approximately 30 hours to manually segment. The large inter-subject variability and the vast inter-rater variability which is inevitable in such a detailed task make the creation of a suitably large manual training dataset intractable.
As a workaround, retinal vessel segmentation attempts have been largely limited to 2D images with better SNR, such as the depth-projection of the OCT-A [giarratano2019automated]. This only produces a single 2D segmentation out of a whole 3D volume, evidently sacrificing the 3D depth information. Similar approaches to segment inherently 2D data such as fundus images have also been reported [lahiri2016deep]. Recently, Liu et al. [liu2020variational] proposed an unsupervised 2D vessel segmentation method using two registered modalities from two different imaging devices. Unfortunately, multiple scans of a single subject are not typically available in practice. Further, the extension to 3D can be problematic due to inaccurate volumetric registration between modalities. Zhang et al. proposed the optimal oriented flux [law2008three] (OOF) for 3D OCT-A segmentation , but, as their focus is shape analysis, neither a detailed discussion nor any numerical evaluation on segmentation are provided.
We propose the local intensity fusion encoder (LIFE), a self-supervised method to segment 3D retinal vasculature from OCT-A. LIFE requires neither manual delineation nor multiple acquisition devices. To our best knowledge, it is the first label-free learning method with quantitative validation of 3D OCT-A vessel segmentation. Fig. 1 summarizes the pipeline. Our novel contributions are:
2.1 Local intensity fusion: LIF
Small capillaries have low intensity in OCT-A since they have slower blood flow, and are therefore hard to distinguish from the ubiquitous speckle noise. We exploit the similarity of vasculature between consecutive en-face OCT-A slices to improve the image quality. This local intensity fusion (LIF) technique derives from the Joint Label Fusion [wang2012multi] and related synthesis methods [fleishman2017joint, 10.1117/12.2550009, oguz2020self].
Joint label fusion (JLF) [wang2012multi] is a well-known multi-atlas label fusion method for segmentation. In JLF, a library of atlases with known segmentations is deformably registered to the target image to obtain . Locally varying weight maps are computed for each atlas based on the local residual registration error between and . The weighted sum of the provides the consensus segmentation on the target image.
JLF has been extended to joint intensity fusion (JIF), an image synthesis method that does not require atlas segmentations. JIF has been used for lesion in-painting [fleishman2017joint] and cross-modality synthesis [10.1117/12.2550009]. Here, we propose a JIF variant, LIF, performing fusion between the 2D en-face slices of a 3D OCT-A volume.
Instead of an external group of atlases, for each 2D en-face slice of a 3D OCT-A volume, adjacent slices within an R-neighborhood are regarded as our group of ’atlases’ for . Note that the atlas is the target itself, represented as the image with a red rim in Fig. 1. We perform registration using the greedy software [yushkevich2016fast]. While closely related, we note that the self-fusion method reported in [oguz2020self] for tissue layer enhancement is not suitable for vessel enhancement as it tends to substantially blur and distort blood vessels [hu2020retinal].
|(a) original||(b) LIF||(c) CE-LIF|
Similar to a 1-D Gaussian filter along the depth axis, LIF has a blurring effect that improves the homogeneity of vessels without dilating their thickness in the en-face image. Further, it can also smooth the speckle noise in the background while raising the overall intensity level, as shown in Fig. 2a/2b. In order to make vessels stand out better, we introduce the contrast enhanced local intensity fusion (CE-LIF)111https://pillow.readthedocs.io/en/stable/reference/ImageEnhance.html in Fig. 2c. However, intensity fusion of en-face images sacrifices the accuracy of vessel diameter in the depth direction. Specifically, some vessels existing exclusively in neighboring images are inadvertently projected on the target slice. For example, the small red box in Fig. 2
highlights a phantom vessel caused by incorrect fusion. As a result, LIF and CE-LIF are not appropriate for direct use in application, in spite of the desirable improvement they offer in visibility of capillaries (e.g., large red box). In the following section, we propose a novel method that allows us to leverage LIF as an auxiliary modality for feature extraction during which these excessive projections will be filtered out.
2.2 Cross-modality feature extraction: LIFE
Liu et al. [liu2020variational] introduced an important concept for unsupervised feature extraction. Two depth-projected 2D OCT-A images, and
, are acquired using different devices on the same retina. If they are well aligned, then aside from noise and difference in style, the majority of the anatomical structure would be the same. A variational autoencoder (VAE) is set as a pix2pix translator fromto in which the latent space keeps full resolution.
If is well reconstructed (), then the latent feature map can be regarded as the common features between and , namely, vasculature. The encoder is considered a segmentation network (Seg-Net) and the decoder a synthesis network (Syn-Net).
Unfortunately, this method has several drawbacks in practice. Imaging the same retina with different devices is rarely possible even in research settings and unrealistic in clinical practice. Furthermore, the 3D extension does not appear straightforward due to the differences in image spacing between OCT devices and the difficulty of volumetric registration in these very noisy images. In contrast, we propose to use a single OCT-A volume and its LIF, , as the two modalities. This removes the need for multiple devices or registration, and allows us to produce a 3D segmentation by operating on individual en-face OCT-A slices rather than a single depth-projection image. We call the new translator network local intensity fusion encoder (LIFE).
Fig. 3 shows the network architecture. To reduce the influence of speckle noise, we train a residual U-Net as a denoising network (Dn-Net), supervised by LIF. For the encoder, we implement a more complex model (R2U-Net) [alom2018recurrent] than Liu et al. [liu2020variational], supervised by CE-LIF. As the decoder we use a shallow, residual U-Net to balance computational power and segmentation performance. The reparameterization trick enables gradient back propagation when sampling is involved in a deep network [kingma2013auto]. This sampling is achieved by , where , and and
are mean and standard deviation of the latent space. The intensity ranges of all images are normalized to. To introduce some blurring effect, both and
norm are added to the VAE loss function:
where (i,j) are pixel coordinates, N is the number of pixels, is CE-LIF and is the output of Syn-Net. and
are hyperparameters. Eq.2 is also used as the loss for the Dn-Net, with the LIF image as , and .
As discussed above, LIF enhances the appearance of blood vessels but also introduces phantom vessels because of fusion. The set of input vessel features in will thus be a subset of in . Because LIFE works to extract , the phantom features that exist only in will be cancelled out as long as the model is properly trained without suffering from overfitting.
2.3 Experimental details
Preprocessing for motion artifact removal. Decorrelation allows OCT-A to emphasize vessels while other tissue types get suppressed (Fig. 4a). However, this requires the repeated OCT B-scans to be precisely aligned. Any registration errors cause motion artifacts, such that stationary tissue is not properly suppressed (Fig. 4b). These appear as horizontal artifacts in en-face images (Fig. 4c). We remove these artifacts by matching the histogram of the artifact B-scan to its closest well-decorrelated neighbor (Fig. 4d).
Binarization. To binarize the latent space estimated by LIFE, we apply the Perona-Malik diffusion equation [perona1990scale] followed by the global Otsu threshold [otsu1979threshold]. Any islands smaller than 30 voxels are removed.
Dataset. The OCT volumes were acquired with pix. (spectral lines frames repeated frames) [malone2019handheld, el2018spectrally]. OCT-A is performed on motion-corrected [guizar2008efficient]
OCT volumes using singular value decomposition. We manually crop the volume to only retain the depth slices that contain most of the vessels near the fovea, between the ganglion cell layer (GCL) and inner plexiform layer (IPL). Three fovea volumes are used for training and one for testing. As the number of slices between GCL and IPL is limited, we aggressively augment the dataset by randomly cropping and flipping 10 windows of sizefor each en-face image. To evaluate on vessels differing in size, we labeled 3 interacting plexus near the fovea, displayed in Fig. 4e. A smaller ROI () cropped in the center (Fig. 4f) is used for numerical evaluation.
To further evaluate the method, we train and test our model on OCT-A of zebrafish eyes, which have a simple vessel structure ideal for easy manual labeling. This also allows us to test the generalizability of our method to images from different species. Furthermore, the fish dataset contains stronger speckle noise than the human data, which allows us to test the robustness of the method to high noise. 3 volumes ( each) are labeled for testing and 5 volumes are used for training. All manual labelling is done on ITKSnap [py06nimg].
Due to the lack of labeled data, no supervised learning method is applicable. Similar to our approach that follows the enhance + binarize pattern, we apply Frangi’s multi-scale vesselness filter[frangi1998multiscale] and optimally oriented flux (OOF) [8879535, law2008three]
respectively to enhance the artifact-removed original image, then use the same binarization steps described above. We also present results using Otsu thresholding and k-means clustering.
All networks are trained on an NVIDIA RTX 2080TI 11GB GPU for 50 epochs with batch size set to 2. For the first 3 epochs, the entire network uses the same Adam optimizer with learning rate of 0.001. After that, LIFE and decoder are separately optimized with starting learning rates of 0.002 and 0.0001 respectively in order to distribute more workload on the LIFE. Both networks decay every 3 epochs with at a rate of 0.5.
|human input||human latent||fish input||fish latent|
Fig. 5 displays examples of extracted latent images. It is visually evident that LIFE successfully highlights the vasculature. Compared with the raw input, even delicate capillaries show improved homogeneity and separability from the background. Fig. 6 illustrates 2D segmentation results within the manually segmented ROI, where LIFE can be seen to have better sensitivity and connectivity than the baseline methods. Fig. 6 also shows a 3D rendering (via marching cubes [10.1145/37402.37422]) of each method. In the middle row, we filtered out the false positives (FP) to highlight the false negatives (FN). These omitted FP areas are highlighted in yellow in the bottom row. It is easy to see that these FPs are often distributed along horizontal lines, caused by unresolved motion artifacts. Hessian-based methods appear especially sensitive to motion artifacts and noise; hence Frangi’s method and OOF introduce excessive FP. Clearly, LIFE achieves the best preservation in small capillaries, such as the areas highlighted in white boxes, without introducing too many FPs.
Fig. 7 shows that LIFE has superior performance on the zebrafish data. The white boxes highlight that only LIFE can capture smaller branches.
Fig. 8 shows quantitative evaluation across B-scans, and Table 1 across the whole volume. Consistent with our qualitative assessments, LIFE significantly () and dramatically (over 0.20 Dice gain) outperforms the baseline methods on both human and fish data.
Finally, we directly binarize LIF and CE-LIF as additional baselines. The Dice scores on the human data are 0.5293 and 0.4892, well below LIFE (0.7736).
4 Discussion and Conclusion
We proposed a method for 3D segmentation of fovea vessels and capillaries from OCT-A volumes that requires neither manual annotation nor multiple image acquisitions to train. The introduction of the LIF modality brings many benefits for the method. Since LIF is directly computed from the input data, no inter-volume registration is needed between the two modalities input to LIFE. Further, rather than purely depending on image intensity, LIF exploits local structural information to enhance small features like capillaries. Still, there are some disadvantages to overcome in future research. For instance, LIFE cannot directly provide a binarized output and hence the crude thresholding method used for binarization influences the segmentation performance.
This work is supported by NIH R01EY031769, NIH R01EY030490 and Vanderbilt University Discovery Grant Program.