Small, well-defined, dot-shaped lesions with equal or higher reflectivity than the retinal pigment epithelium (RPE), visualized by optical coherence tomography (OCT), have been termed hyperreflective foci (HRF). They have been shown to occur in various retinal diseases, including neovascular age-related macular degeneration, diabetic retinopathy and retinal vein occlusion [1, 2, 3]. Histopathologic analyses have proposed several etiologies, of which the concept of activated, migrating RPE cells currently seems to be the most widely accepted in the setting of age-related macular degeration . HRF may also represent lipid exudates, which are frequently seen in diabetic maculopathy . Numerous studies have linked the presence of HRF to be related to progression of disease [1, 6]. Furthermore, location and presence of HRF have been proposed as a negative prognostic factor for visual function [3, 7]. Given that formation and particularly migration of HRF have been shown to occur in the course of various retinal diseases, robust identification presents the first important step in further exploring the role of these lesions in pathomorphologic disease dynamics. Manual identification and counting of HRF throughout OCT volumes, usually consisting of a few dozen to hundreds of B-scan sections, is error-prone and tedious. Automated identification allows easy analysis of HRF presence, load and dynamics and promotes reproducible research on this highly relevant biomarker.
Here, we apply supervised machine learning and deep neural networks for accurate fully automated HRF segmentation.
Residual networks (ResNet), introduced by He et al. , improved the state-of-the-art of diverse visual recognition tasks. They won the classification, detection and localization challenge of the in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and the Microsoft Common Objects in Context (MSCOCO)
detection and segmentation challenge. Residual networks ease the training of deep neural networks by introducing residual units, which allow even very deep networks learning the identity mapping. Deep learning based semantic segmentation approaches, as for exampleFCN of Long et al.  or DeepLab of Chen et al. 
have shown to gain higher segmentation accuracy as opposed to the formulation as an image-level convolutional neural network (CNN) based classification problem. Since its introduction by Ronneberger et al. , U-Net is a widely used architecture for semantic segmentation. Ronneberger et al. showed that such a network architecture can be trained with only a few training imagess. It won the International Symposium on Biomedical Imaging (ISBI) cell tracking challenge (2015) by training this network on transmitted light microscopy images. Furthermore, this architecture allows fast end-to-end training and inference, and can be trained from only a few training samples . Both architectures, ResNet and U-Net, are state-of-the-art components for image classification and segmentation tasks. Anas et al.  presented a residual U-Net (ResUNet), a U-Net that implements residual units, for clinical target-volume delineation in prostate brachytherapy. In contrast to our work, previous studies only focus on manual or semi-automated analysis of HRF. In the literature we find manual analysis of HRF in SD-OCT scans , where HRF was correlated with disease progression of age-related macular degeneration (AMD). Korot et al.  developed a semi-automated pipeline to perform quantification of vitreous HRF in SD-OCT scans. They evaluated the repeatability of the algorithm by comparing the results on successive OCT scans of the same patients. Since Korot et al. did not leverage machine learning methods, no training step was involved. Furthermore, no pixel-level ground-truth data was used for performance analysis of the algorithm.
In this paper, we propose to leverage deep neural networks for fullly automated segmentation of HRF in spectral-domain OCT (SD-OCT) scans. We utilize a ResUNet for accurate HRF segmentation, described in Section 2.2. We perform comprehensive evaluation of different semantic segmentation architectures, and the influence of different training objectives (cross-entropy loss vs. Dice loss), and single vendor vs. joint vendor training. Experiments (Section 3) on labeled data, extracted from SD-OCT scans, show that this approach segments HRF with high accuracy. To the best of our knowledge, this is the first published work on fully automated detection and segmentation of HRF.
2 Semantic segmentation of hyperreflective foci
Image classification describes the task of classifying the entirety of pixels of an image into a single class out of a prespecified number of object classes, depending on the main object visible in the image. The classification result will most likely conform to the class label of the most prominent image object. In contrast, semantic segmentation[15, 16, 9, 10, 11] performs classification on each pixel of an image in a single pass and thus allows not only to detect but also to localize multiple objects in an image. This can be very run-time efficient, as opposed to performing image segmentation based on an image-level classification approach. The latter is time consuming, because it involves extraction of multiple small image patches, classifying these patches and finally aggregating the classification outputs into a a pixel-level classification map, i.e. the segmentation of the whole image. In the following, we describe the semantic segmentation approach in more detail.
2.1 Data representation
The data comprises tuples of medical images and pixel-level ground truth annotations , with , where is an intensity image of size and is a binary image of the same size carrying the information about the pixel-level presence of the object of interest (in our case, a pixel value of 1 indicates the occurrence of HRF in an image on pixel-level). We extract small 2D image patches of size , with and , from each image at randomly sampled positions. We extract analogously 2D patches of the same size from the corresponding positions of the annotation image , resulting in data , with . The overall data is divided into disjoint sets, used for training, validation, or testing of the model.
2.2 Semantic segmentation methodology
We leverage deep learning based semantic segmentation
to obtain a mapping from intensity images to corresponding images of dense pixel-level class labels. The underlying feed-forward neural network comprises two main building blocks, which are jointly trained. First, anencoder transforms the input image into a low-dimensional abstract context representation. Secondly, a decoder
maps the low-dimensional embedding, i.e. the output of the encoder, to a full input resolution image of corresponding class label predictions. The most basic processing units of encoder and decoder are convolutional layers. Typically, the encoder produces successively smaller resolution feature maps through the utilization of strided convolutions or convolution with stride 1 followed by a pooling-layer. The decoder produces successively larger resolution images through the utilization of the unpooling operation
or implementing fractionally strided convolutions[17, 18]. The network is trained end-to-end, i.e. parameter updates in every update iteration are based on tuples of intensity images and corresponding images of target labels.
Residual U-Net based semantic segmentation
The U-Net architecture is based on a contracting path (encoder) and an expanding path (decoder). The main contribution of Ronneberger et al.  in the conception of the U-Net architecture is the concatenation of the feature maps of every layer of the encoder with the feature maps of the corresponding level of the decoder. In this way, higher resolution context information can be propagated to the last decoder layers, which improves localization and allows precise segmentation. The main building blocks of ResNet are residual units 
. Residual units not only learn the mapping from inputs to outputs but also learn residual functions between inputs and outputs of individual layers and thus allow even very deep networks learning the identity mapping. Residual units implement “shortcut connections”, which perform the identity mapping by skipping one or more layers. The outputs of the shortcut connections are added the the outputs of the skipped layers. Since shortcut connections do not add further model parameters, nor increase the computational complexity, end-to-end training of even very deep networks by stochastic gradient descent (SGD) is enabled.
Both architectures can be combined to build up a residual U-Net (ResUNet), a deep neural network for semantic segmentation with an U-Net architecture with residual units as individual layers.
During training, we learn the mapping by training a deep neural network . During testing, the model yields images of dense pixel-level predictions for unseen testing images .
2.3 Training objectives
We train the networks on two objective functions, cross entropy loss and dice loss.
2.3.1 Cross entropy loss
The cross entropy loss is widely used objective function in classification problems. A pixel-level cross-entropy is computed between network predictions and target labels :
where is the i-th output element (i.e. output at a single pixel) and , with denotes the class. The predictions for this multi-class definition of the cross entropy loss are computed with a pixel-wise softmax function applied on the network outputs :
The softmax function maps a vector of arbitrary real valued network outputsto a vector of values ranging from 0 to 1 that sum to 1. At each pixel location, the cross entropy loss penalizes the deviation of the network prediction from the ground truth labels. In the binary case, the cross entropy loss is defined as:
and class probabilities, which are the sigmoidal outputs of a neural network:
2.3.2 Dice loss
The Sørensen–Dice index [19, 20] (another commonly used denotation is Dice similarity coefficient (DSC)) is a performance measure used for binary classifier performance evaluation. For binary segmentation problems, the DSC can be utilized to quantify the “similarity” of predicted and the true segmentation, and is defined by:
where is the number of true positives, is the number of false negatives, and is the number of false positives. Possible values of DSC range from 0.0 to 1.0. A perfect classifier or segmentation model achieves a DSC of 1.0. For binary classification or segmentation problems, the DSC can not only be utilized to evaluate the performance of a trained model on the test set, but also as objective function during training. When a model is trained to minimize the objective function, a smooth Dice loss can be defined as follows:
between real valued network predictions and binary target labels , where is the i-th pixel.
Data, Data Selection and Preprocessing
We trained and evaluated the method on clinical SD-OCT scans of the retina acquired with devices of two different OCT vendors (Cirrus HD-OCT, Carl Zeiss Meditec, and Spectralis OCT, Heidelberg Engineering). Cirrus scans comprise image slices (“B-scans“) with an image resolution of pixels (pixel sizes ), whereas Spectralis comprise 49 B-scans with an image resolution of pixels (pixel sizes ) in row-, and column-direction, respectively. Since images of both vendors show different numbers of image rows (i.e. different pixel sizes in row-direction), we rescale the images of both vendors to an uniform image resolution of pixels, resulting in row-direction to a pixel size of for Cirrus images and of for Spectralis images. This row-dimension choice was a trade-off between keeping annotation information and moving the pixels beyond isotropy. As a second preprocessing step, the gray values were normalized on a OCT scan basis to range from 0 to 1.
The overall dataset comprises 145 OCT scans from different retinal diseases including age-related macular degeneration (AMD, 60 scans), diabetic macular edema (DME, 43 scans) and retinal vein occlusion (RVO, 42 scans). To keep annotation effort in check, approximately only every image of each OCT scan was annotated by clinical retina experts. For training and evaluation, we took only those images that had at least a single pixel annotated as HRF. The dataset was split into a training set (119 OCT scans, 1051 images), a validation set (6 OCT scans, 41 images), and a test set (20 OCT scans, 137 images). Table 1 lists full details on the data split regarding OCT scans and on to the number of images (B-scans). The split was performed on a patient basis so that images of each patient are only assigned to one of these sets. For training we extracted image patches of size pixels at randomly sampled positions.
In addition to the OCT scans, used for training, validation, and testing, we had images extracted from 3 additional OCT scans (2 Cirrus and 1 Spectralis) annotated independently by two clinical retina experts. Based on these cases, we evaluate the inter-rater variability as baseline for achievable maximal accuracy.
HRF appear in OCT data as bright spots, and their segmentation can be formulated as pixel-wise binary classification problem. We examine whether state-of-the art semantic segmentation models (ResUNnet) are required or the application of if even a simple approach suffices. We evaluate the semantic segmentation performance of the following three model architectures:
SemSeg is a simple semantic segmentation model with convolutional layers as main units, where the encoder and decoder comprise four layers with and filters of size pixels, respectively.
ResUnet is a residual U-Net with basically the same number of layers (and feature maps per layer) as the SemSeg model, i.e., the encoder and decoder comprise four layers with and filters of size pixels, respectively. In this model, the standard convolutional layers of the SemSeg model are replace with 1 residual unit per scale.
ResUnet is similar (same number of layers) to the ResUnet model but is a slightly more complex. The encoder and decoder comprise four layers with and filters of size pixels, respectively. In the ResUnet model, we use 3 residual units per layer.
To examine the influence of training objectives on the segmentation performance, each of these models is trained on a (1) cross entropy loss or (2) Dice loss
. Receiver operating characteristic (ROC) curves can present a possibly misleading optimistic visualization of the model performance if the class distribution of the data has a strong skew. Precision-recall curves are an alternative to ROC curves on data with high class imbalance [22, 23, 24]. Since our data comprises highly imbalanced classes, i.e. the coverage of HRF even in positive OCT scans is relative small, we use precision-recall curves as summary statistic to visualize the model performance. We report the average precision (AP) to quantitatively summarize the precision-recall curve. We report area under the ROC curve (AUC) values for the sake of completeness only.
For training the ResUNet models, we utilized the Deep Learning Toolkit for Medical Imaging (DLTK) 
. All models were trained for 200 epochs, and model parameters were stored at the best performing epoch on the validation set. After model selection and hyperparameter tuning, the final performance was evaluated on the test set using the learned model parameters. We utilized the stochastic optimizer Adam
during training. All experiments were performed using Python 2.7 with the TensorFlow library version 1.2, CUDA 8.0, and a Titan X graphics processing unit.
Results demonstrate the applicability of all examined semantic segmentation algorithms. Detailed quantitative results are listed in Tables 2 to 4, which show the ResUNet model jointly trained on Cirrus and Spectralis data utilizing a cross-entropy loss is the best performing model. This observation holds true for testing data of both vendors, with the best AP on Cirrus data of (DSC of ) and with the best AP on Spectralis data of (DSC of ). The superiority of the ResUNet model is also evident through the precision-recall curves shown in Figure 1. Qualitative segmentation results of this model are shown for Cirrus OCT data in Figure 2(a) and for Spectralis OCT data in Figure 2(b).
Comparison to the inter-rater variability
In addition to the OCT scans, used for training, testing, and evaluation, we had 25 images extracted from 3 additional OCT scans (2 Cirrus and 1 Spectralis) annotated independently by two clinical retina experts. The DSC over the double annotated images is , which is a measure for the inter-rater variability. The average accuracy (DSC) of the best performing model (ResUNet) over all test images of Cirrus scans and Spectralis scans is and thus lies in the order of magnitude of the inter-rater variability. Even-though we only had a limited number of double annotated images, results suggest that the presented approach is highly accurate.
We applied semantic segmentation for fully automated segmentation of HRF in retinal OCT scans. This is the first time that a fully automated method for the detection and segmentation of HRF is reported. Based on results of all experiments, evidence suggests, that the utilization of cross entropy training loss should be given preference over a Dice-based training objective. Results demonstrate the general applicability of all examined semantic segmentation algorithms. However, ResU-Nets are able to detect HRF with slightly higher accuracy and handle the visual variability of input images best, i.e. joint training on OCT scans of different vendors. Since we trained one model on data acquired with different devices (i.e., vendors) from patients with different retinal diseases including AMD, DME and RVO, – as against training specific models on individual diseases – the algorithm can safely be applied for screening in all of them even if the pathophysiological origins are different.
This work has received funding from IBM, FWF (I2714-B31), OeNB (15356, 15929), the Austrian Federal Ministry of Science, Research and Economy (CDL OPTIMA). We gratefully acknowledge the support of NVIDIA Corporation with the donation of a GPU used for this research.
-  Christenbury, J.G., Folgar, F.A., O’Connell, R.V., Chiu, S.J., Farsiu, S., Toth, C.A.: Progression of intermediate age-related macular degeneration with proliferation and inner retinal migration of hyperreflective foci. Ophthalmology 120(5) (2013) 1038–1045
-  De Benedetto, U., Sacconi, R., Pierro, L., Lattanzio, R., Bandello, F.: Optical coherence tomographic hyperreflective foci in early stages of diabetic retinopathy. Retina 35(3) (2015) 449–453
-  Chatziralli, I.P., Sergentanis, T.N., Sivaprasad, S.: Hyperreflective foci as an independent visual outcome predictor in macular edema due to retinal vascular diseases treated with intravitreal dexamethasone or ranibizumab. Retina 36(12) (2016) 2319–2328
-  Curcio, C.A., Zanzottera, E.C., Ach, T., Balaratnasingam, C., Freund, K.B.: Activated retinal pigment epithelium, an optical coherence tomography biomarker for progression in age-related macular degeneration. Investigative ophthalmology & visual science 58(6) (2017) BIO211–BIO226
-  Bolz, M., Schmidt-Erfurth, U., Deak, G., Mylonas, G., Kriechbaum, K., Scholda, C.: Optical coherence tomographic hyperreflective foci: a morphologic sign of lipid extravasation in diabetic macular edema. Ophthalmology 116(5) (2009) 914–920
-  Sleiman, K., Veerappan, M., Winter, K.P., McCall, M.N., Yiu, G., Farsiu, S., Chew, E.Y., Clemons, T., Toth, C.A., Wong, W., et al.: Optical coherence tomography predictors of risk for progression to non-neovascular atrophic age-related macular degeneration. Ophthalmology 124(12) (2017) 1764–1777
-  Segal, O., Barayev, E., Nemet, A.Y., Geffen, N., Vainer, I., Mimouni, M.: Prognostic value of hyperreflective foci in neovascular age-related macular degeneration treated with bevacizumab. Retina 36(11) (2016) 2175–2182
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 3431–3440
-  Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: International Conference on Learning Representations. (2015)
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D.,
Huang, C., Torr, P.H.:
Conditional random fields as recurrent neural networks.In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1529–1537
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, Springer (2015) 234–241
-  Anas, E.M.A., Nouranian, S., Mahdavi, S.S., Spadinger, I., Morris, W.J., Salcudean, S.E., Mousavi, P., Abolmaesumi, P.: Clinical target-volume delineation in prostate brachytherapy using residual neural networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2017) 365–373
-  Korot, E., Comer, G., Steffens, T., Antonetti, D.A.: Algorithm for the measure of vitreous hyperreflective foci in optical coherence tomographic scans of patients with diabetic macular edema. JAMA ophthalmology 134(1) (2016) 15–20
-  Mostajabi, M., Yadollahpour, P., Shakhnarovich, G.: Feedforward semantic segmentation with zoom-out features. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 3376–3385
-  Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1520–1528
-  Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434 (2015)
-  Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016)
-  Sørensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biol. Skr. 5(4) (1948) 1–34
-  Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3) (1945) 297–302
-  Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on Machine learning, ACM (2006) 233–240
Goadrich, M., Oliphant, L., Shavlik, J.:
Learning ensembles of first-order clauses for recall-precision
curves: A case study in biomedical information extraction.
In: International Conference on Inductive Logic Programming, Springer (2004) 98–115
-  Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Comparative experiments on learning information extractors for proteins and their interactions. Artificial intelligence in medicine 33(2) (2005) 139–155
-  Craven, M., Bockhorst, J.: Markov networks for detecting overalpping elements in sequence data. In: Advances in Neural Information Processing Systems. (2005) 193–200
-  Pawlowski, N., Ktena, S.I., Lee, M.C., Kainz, B., Rueckert, D., Glocker, B., Rajchl, M.: Dltk: State of the art reference implementations for deep learning on medical images. arXiv preprint arXiv:1711.06853 (2017)
-  Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
-  Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015) Software available from tensorflow.org.