Fused Detection of Retinal Biomarkers in OCT Volumes

07/16/2019 ∙ by Thomas Kurmann, et al. ∙ 2

Optical Coherence Tomography (OCT) is the primary imaging modality for detecting pathological biomarkers associated to retinal diseases such as Age-Related Macular Degeneration. In practice, clinical diagnosis and treatment strategies are closely linked to biomarkers visible in OCT volumes and the ability to identify these plays an important role in the development of ophthalmic pharmaceutical products. In this context, we present a method that automatically predicts the presence of biomarkers in OCT cross-sections by incorporating information from the entire volume. We do so by adding a bidirectional LSTM to fuse the outputs of a Convolutional Neural Network that predicts individual biomarkers. We thus avoid the need to use pixel-wise annotations to train our method, and instead provide fine-grained biomarker information regardless. On a dataset of 416 volumes, we show that our approach imposes coherence between biomarker predictions across volume slices and our predictions are superior to several existing approaches.



There are no comments yet.


page 2

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Age-Related Macular Degeneration (AMD) and Diabetic Macular Edema (DME) are chronic sight-threatening conditions that affect over 250 million people world wide [1]. To diagnose and manage these diseases, Optical Coherence Tomography (OCT) is the standard of care to image the retina safely and quickly (see Fig. 1). However, with a growing global patient population and over 30 million volumetric OCT scans acquired each year, the resources needed to assess these has already surpassed the capacity of knowledgeable experts to do so [1].

For ophthalmologists, identifying biological markers of the retina, or biomarkers, plays a critical role in both clinical routine and research. Biomarkers can include the presence of different types of fluid buildups in the retina, retinal shape and thickness characteristics, the presence of cysts, atrophy or scar tissue. Beyond this, biomarkers are paramount to assess disease severity in clinical routine and have a major role in the development of new pharmaceutical therapeutics. With over a dozen clinical and research biomarkers, their identification is both challenging and time consuming due to their number, size, shape and extent.

Figure 1: (left) Example of an OCT volume of a patient with AMD. Slices are misaligned, even after post-processing registration. (right) four slices extracted from the volume, each containing a different set of biomarkers indicated for each image.

To support clinicians with OCT-based diagnosis, numerous automated methods have attempted to segment and identify specific biomarkers from OCT scans. For instance, retinal layer [2, 3, 4] and fluid [4] segmentation, as well as drusen detection [5]

have previously been proposed. While these methods perform well, they are limited in the number of biomarkers they consider at a time and often use pixel-wise annotations to train supervised machine learning frameworks. Given the enormous annotation task involved in manually segmenting volumes, they are often trained and evaluated on relatively small amounts of data (20 to 100 volumes) 

[2, 4, 6].

Instead, we present a novel strategy that automatically identifies the presence of a wide range of biomarkers throughout an OCT volume. Our method does not require biomarker segmentation annotations, but rather biomarker tags as to which are present on a given OCT slice. Using a large dataset of OCT slices and annotated tags, our approach then estimates all biomarkers on each slice of a new volume separately. We do this first seperately, without considering adjacent slices, as these are typically highly anisotropic and not aligned within the volume. We then treat these predictions as sufficient statistics for each slice and impose biomarker coherence across slices using a bidirectional Long short-term memory (LSTM) neural network 

[7]. By doing so, we force our network to learn the wanted biomarker co-dependencies within the volume from slice predictions only, so as to avoid dealing with anisotropic and non-registered slices common in OCT volumes. We show in our experiments that this leads to superior performances over a number of existing methods.

2 Method

We describe here our approach for predicting biomarkers across all slices in an OCT volume. Formally, we wish to predict the presence of different biomarkers in a volume using a deep network, , that maps from a volume of slices,

, to a set of predicted probabilities

. We denote as the estimated probability that biomarker occurs in slice .

While there are many possible network architectures for , one simple approach would be to express as copies of the same CNN, , whereby each slice in the volume is individually predicted. However, such an architecture ignores the fact that biomarkers are deeply correlated across an entire volume. The other extreme would be to define as a single 3D CNN. Doing so however would be difficult because (1) 3D CNNs assume spatial coherence in their convolutional layers and (2) the output of would be of dimension . While (1) strongly violates OCT volume structure because there are they typically display non-rigid transformations between consecutive OCT slices, (2) would imply training with an enormous amount of training data.

For these reasons, we take an intermediate approach between the above mentioned extremes and express our network as a composition , where processes slices individually and produces a -dimensional descriptor for each slice. Then, fuses all  slice descriptors and predicts the biomarker probabilities for each slice, whereby taking into account the information of the entire volume. Fig. 2 depicts our framework and we detail each of its components in the subsequent sections.

2.1 Slice network 

When presented with a volume ,  processes each slice independently using the same slice convolutional network that maps from a single slice to a -dimensional descriptor. The output of  is then the concatenation of the individual descriptors,


In our experiments, we implemented  as the convolutional part of a Dilated Residual Network [8] up to the global pooling layer.

2.2 Volume fusion network 

Let  be the set of descriptors of a volume  computed by . The fusion network  takes  and produces a final probability prediction .

The most straightforward architecture for  would be a multilayer perception (MLP), which is typical after convolutional layers. However, MLPs make no assumptions about the underlying nature of the data. Consequently, MLPs are hard to train, requiring either huge amounts of training data or resort to aggressive data augmentation techniques, particularly when the dimensionality of the input space is large as in this case. More importantly, a MLP would ignore two important aspects about : (1) the rows of  belong to the same feature space that share a common distribution; (2) volumes have spatial structure with respect to the biomarkers within them and slices that are nearby to one another have similar descriptors.

Figure 2: Overview diagram of our apporach including the CNN, , the bidirectional LSTM, , and the output of the fully connected stage. OCT volumes consist of 49 slices.

To account for this, we use an LSTM to process slices in a sequential way and implicitly leverage spatial dependencies, while performing the same operations on every input (i. e., implicitly assuming a common distribution in the input space). Formally, our LSTM is a network that receives a descriptor and the previous -dimensional LSTM state, to produce a new state. We use the LSTM to iteratively process the descriptors  generating a sequence of LSTM states,


where is the descriptor on slice . Additionally, since the underlying distribution of OCT volumes is symmetric111Flipping the slice order in a volume produces another statistically correct volume., we use the same LSTM to process the descriptors backwards,


generating a second sequence of LSTM states. Initial states are  and , respectively.

Note that at each position , and  combine the information from the current descriptor 

with additional information coming from neighboring slices. We then concatenate both states in a single vector and feed it to a final fully connected layer 

that computes the estimated probabilities. The complete volume fusion network  is the concatenation of the outputs of  for all the slices:


2.3 Training

Training requires a dataset of  annotated volumes,  where for each volume  a set of binary labels  is provided.  is 1 if biomarker  is present in slice 

. We then use the standard binary cross entropy as our loss function,


where  is the estimation of our network for a given volume. The goal during training is then to minimize the expected value of the loss  over the training dataset .

While we could perform this minimization with a gradient-based method in an end-to-end fashion from scratch, we found that a two-stage training procedure helped boost performances at test time. In the first stage, we train the slice network  alone to predict the biomarkers of individual slices. More specifically, we append a temporary fully connected layer  at the end of , and then minimize a cross entropy loss while presenting to the network randomly sampled slices from . In the second stage, we fix the weights of  and minimize the loss of Eq. (5) for the whole architecture  updating only the weights of the volume fusion network .

3 Experiments

3.1 Data

Our dataset consists of 416 volumes (originating from 327 individuals with Age-Related Macula Degeneration and Diabetic Retinopathy) whereby each volume scan consists of  slices for a total of 20’384 slices. Volumes were obtained using the Heidelberg Spectralis OCT and each OCT slice has a resolution of pixels. Trained annotators provided slice level annotations for 11 common biomarkers: Subretinal Fluid (SRF), Interetinal Fluid (IRF), Hyperreflective Foci (HF), Drusen, Reticular Pseudodrusen (RPD), Epiretinal Membrane (ERM), Geographyic Atrophy (GA), Outer Retinal Atrophy (ORA), Intraretinal Cysts (IRC), Fibrovascular PED (FPED) and Healthy. The dataset was randomly split for training and testing purposes, making sure that no volume from the same individual was present in both the training and test sets. These sets contained a total of 18’179 and 2’205 slices, respectively. The distribution of biomarkers in the training and test sets are reported in Table 1. For all our experiments, we performed 10-fold cross validation, where the training set was split into a training (90%) and validation (10%) set.

3.2 Parameters and baselines

For our approach, we set for the size of the descriptors and for the size of the LSTM hidden state. We train the fusion stage using a batch size of 4 volumes, while training using SGD with momentum of  and a base learning rate of 

which we decrease after 10 epochs of no improvement in the validation score.

To demonstrate the performance of our approach, we compare it to the following baselines:

  • Base: the output of (e.g. no slice fusion).

  • MLP: output of size using the sized feature matrix from the Baseclassifier.

  • Conv-BLSTM: fuses the last convolutional channels of with a size of and a hidden state of channels. This is then followed by a global pooling and a fully connected layer.

  • Conditional Random Field (CRF):

    trained to learn co-occurrence of biomarkers within each slice and to smooth the predictions for each biomarker along different slices of the volume. Logit outputs of the

    Base classifier are used as unary terms, and learned pairwise weights are constrained to be positive to enforce submodularity of the CRF energy. We use the method from [9] for training and standard graph-cuts for inference at test time.

For all methods we use the same Base classifier and train it as a multi-label classification task using a learning rate of , a batch size of 32 with SGD and a momentum of 

. Rotational and flipping data augmentation was applied during training. We retain the best model for evaluation and do not perform test time augmentation or cropping. The network was pre-trained on ImageNet 


Our primary evaluation metric are the micro and macro mean Average Precision (mAP). In addition, we also report the Exact Match Ratio (EMR) which is equal to the percentage of slices predicted without failing to detect any biomarker in it. The mAP of the CRF baseline is not directly comparable as the CRF output is binary, hence allowing only a single preciscion-recall point to be evaluated. We therefore also state the maximum F1 scores for each method.

3.3 Results

Table 1 reports the performances of all methods. Using the proposed method we see an increase in mAP across all biomarkers except for GA and ORA. Both biomarkers have a very low sample size in the test set. The proposed method outperforms all other fusion methods in terms of mAP and F1 score and considerably improves over the unfused baseline, which confirms our hypothesis that inter-slice dependencies can be used to increase the per slice performance. The poor performance of the Convolutional BLSTM can be explained due to the misalignment of adjacent slices.

In Fig. 3, we show a typical example illustrating the performance improving ability of our proposed method. In particular, we show here the prediction of our approach on each slice for each biomarker and highlight three consecutive slices of the tested volume (right). For comparison, we also show the corresponding ground-truth (top left) and the outcome from the Base classifier (middle left). Here we see that our approach is capable of inferring more accurately the set of biomarkers across the different slices.

Biomarker Base MLP Conv-BLSTM CRF Proposed
Healthy (5310/494) 0.7970.023 0.7300.025 0.7950.013 - 0.8000.022
SRF (942/103) 0.8470.024 0.7960.043 0.8770.030 - 0.9050.017
IRF (2019/339) 0.6910.044 0.7050.039 0.7010.052 - 0.7610.047
HF (4261/684) 0.8770.008 0.8390.010 0.8630.018 - 0.8960.007
Drusen (3990/399) 0.762 0.024 0.7310.024 0.7660.029 - 0.7750.038
RPD (1620/146) 0.2910.044 0.3020.036 0.2880.069 - 0.3350.077
ERM (4338/670) 0.8850.009 0.8490.014 0.8500.022 - 0.9030.010
GA (897/67) 0.5570.063 0.2340.047 0.3300.049 - 0.5560.057
ORA (1999/84) 0.1510.018 0.1050.008 0.1430.025 - 0.1310.019
IRC (3097/553) 0.9320.006 0.8800.011 0.9280.012 - 0.940 0.006
FPED (3654/387) 0.9310.007 0.920 0.008 0.9360.009 - 0.9490.006
mAP (micro) 0.8140.006 0.7680.012 0.7940.010 0.5990.003 0.8340.012
mAP (macro) 0.7020.008 0.6450.009 0.6800.012 0.5230.006 0.7230.014
EMR* 0.4230.015 0.1640.048 0.4130.019 0.4400.003 0.4380.011
F1* 0.6760.006 0.5020.024 0.6760.011 0.6490.013 0.6940.009
Table 1: Experimental results comparing our proposed method to other approaches. The per-biomaker scores are shown as mean Average Precision (mAP). The training and test label occurrence is stated beside the biomarker name (training/test). (*) threshold taken at the max F1 score.
Figure 3: Volume classification example, where (top left) depicts the ground truth, (middle left) the Base classification and (bottom left) our proposed method. Warm colors indicate higher likelihood of presence. (Right) Three consecutives slices where the Base classifier incorrectly misses biomarker IRF in the center slice (yellow). Our proposed method manages to fuse information from adjacent slices (pink and orange) to infer the proper prediction.

4 Conclusion

We have presented a novel method to identify pathological biomarkers in OCT slices. Our approach involves detecting biomarkers first slice by slice in the OCT volume and then using a bidirectional LSTM to coherently adjust predictions. As far as we are aware, we are the first to demonstrate that such fine-grained biomarker detection can be achieved in the context of retinal diseases. We have shown that our approach performs well on a substantial patient dataset outperforming other common fusion methods. Future efforts will be focused on extending these results to infer pixel-wise segmentations of found biomarkers relying solely on the per-image labels.


This work received partial financial support from the Innosuisse Grant #6362.1 PFLS-LS.


  • [1] Bourne, R., et al.: Magnitude, temporal trends, and projections of the global prevalence of blindness and distance and near vision impairment: a systematic review and meta-analysis. Lancet Global Health 5 (2017) e888 – e897
  • [2] Apostolopoulos, S., De Zanet, S., Ciller, C., Wolf, S., Sznitman, R.: Pathological OCT Retinal Layer Segmentation Using Branch Residual U-Shape Networks. In: Medical Image Computing and Computer-Assisted Intervention. (2017) 294–301
  • [3] Hussain, M.A., Bhuiyan, A., Turpin, A., Luu, C.D., Smith, R.T., Guymer, R.H., Kotagiri, R.: Automatic Identification of Pathology-Distorted Retinal Layer Boundaries Using SD-OCT Imaging. IEEE Transactions on Biomedical Engineering 64(7) (2017) 1638–1649
  • [4] Roy, A.G., Conjeti, S., Karri, S.P.K., Sheet, D., Katouzian, A., Wachinger, C., Navab, N.: ReLayNet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks. Biomedical Optics Express 8(8) (2017)
  • [5] Zhao, R., Camino, A., Wang, J., Hagag, A.M., Lu, Y., Bailey, S.T., Flaxel, C.J., Hwang, T.S., Huang, D., Li, D., Jia, Y.: Automated drusen detection in dry age-related macular degeneration by multiple-depth,en faceoptical coherence tomography. Biomedical optics express 8(11) (2017) 5049–5064
  • [6] Bogunovic, H., et al.: RETOUCH -the retinal OCT fluid detection and segmentation benchmark and challenge. IEEE Trans. Med. Imaging (February 2019)
  • [7] Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional lstm networks for improved phoneme classification and recognition. In: Artificial Neural Networks: Formal Models and Their Applications. (2005) 799–804
  • [8] Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. (May 2017)
  • [9] Szummer, M., Kohli, P., Hoiem, D.: Learning crfs using graph cuts. Volume 5303. (10 2008) 582–595
  • [10] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge.

    International Journal of Computer Vision

    115(3) (2015) 211–252