Glaucoma has become the leading cause of blindness worldwide, according to [jonas2018]. It is characterized by causing progressive structural and functional damage to the retinal optic nerve head (ONH). Recent studies advocate that roughly 50% of people suffering from glaucoma in the world are undiagnosed and ageing populations suggest that the impact of glaucoma will continue to rise, affecting 111.8 million people in 2040 [reference2_intro]. Therefore, early treatment of this chronic disease could be essential to prevent irreversible vision loss.
Currently, a complete glaucoma study usually includes medical history, fundus photography, visual field (VF) analysis, tonometry and optic nerve imaging tests such as optical coherence tomography (OCT). Most of the state-of-the-art studies addressed the glaucoma detection via fundus image analysis, making use of visual field tests and relevant parameters like the intraocular pressure (IOP) [kim2017, wang2019]. Specifically, J. Gómez-Valverde et al. [gomez2019automatic] performed a comparison between convolutional neural networks (CNNs) trained from scratch and using fine-tuning techniques. Also, the authors in [shibata2018development, christopher2018performance]
considered the use of transfer learning and fine-tuning methods applied to very popular state-of-the-art network architectures to identify glaucoma on fundus images. Other studies such as[muhammad2017hybrid, thakoor2019]
carried out a combination between OCT B-scans and fundus images to obtain an RNFL thickness probability map which was used as an input to the CNNs. In this paper, contrary to the studies of the literature, we propose an end-to-end system for glaucoma detection based only on raw circumpapillary OCT images, without using another kind of images or external expensive tests related to the VF and IOP parameters. It is important to highlight that circumpapillary OCT images as shown in Fig.1 correspond to circular scans located around the ONH, where rich information about different retinal layers structures can be found. Additionally, several studies claimed that circumpapillary retinal nerve fiber layer (RNFL) is essential to detect early glaucomatous damage [reference_objetive1, reference_objetive2, reference_objetive3]. For that reason, one of the main novelties of this paper is focused on demonstrating that a single circumpapillary OCT image may be of great interest when carrying out an accurate glaucoma detection.
We propose two different data-driven learning strategies to develop computer-aided diagnosis systems capable of discerning between glaucomatous and healthy eyes just from B-scans around the ONH. Several CNNs trained from scratch and different fine-tuned state-of-the-art architectures were considered. Furthermore, we propose, for the first time in this kind of images, the class activation maps computation in order to compare the location information reported by the clinicians with the heat maps generated by the developed models. Heat maps allow highlighting the regions in which the networks pay attention to determine the class of each specific sample.
The experiments detailed in this paper were performed on a private database composed of 249 OCT images of dimensions pixels. In particular, 156 normal and 93 glaucomatous circumpapillary samples were analysed from 89 and 59 patients, respectively. Each B-scan was diagnosed by experts ophthalmologists from Oftalvist Ophthalmic Clinic. Note that Heidelberg Spectrallis OCT system was employed to acquire the circumpapillary OCT images with an axial resolution of 4-5 m.
3.1 Data Partitioning
A data partitioning stage was carried out to divide the database into different training and test sets. Specifically, of the circumpapillary images, which corresponds to 73 glaucomatous and 124 normal samples, from 12 and 18 patients respectively, composed the training set, whereas the test set was defined by of the data (20 with glaucoma and 32 normal B-scans from 12 and 18 patients). In addition, for the training set, we also performed an internal cross-validation (ICV) stage to control the overfitting, as well as to select the best neural network hyper-parameters. Finally, the independent test set was used to evaluate the definitive predictive models, which were created using the entire training set.
3.2 Learning from scratch
Similarly to the methodology exposed in [gomez2019automatic], we propose in this paper the use of shallow CNNs from scratch to address the glaucoma detection, taking into account the significant differences between our grey-scale circumpapillary OCT images and other large databases containing natural images, which are widely used for transfer-learning techniques.
During the internal cross-validation (ICV) stage, an empirical exploration was carried out to determine the best hyper-parameter combination in terms of minimisation of the binary cross-entropy loss function. Different network architectures composed of diverse learning blocks were developed. In particular, convolutional, pooling, batch normalisation and dropout layers were considered to address the feature extraction stage. The variable components of each layer, such as the convolutional filters, pooling size, dropout coefficients, as well as the number of convolutional layers in each block were optimised during the experimental phase. Regarding the top model, the use of flatten, dropout and fully-connected layers with a different number of neurons was studied. Also, global max and global average pooling layers were analysed in order to reduce the number of trainable parameters. Moreover, we implemented an optimal weighting factor of [1.35, 0.79] during the training of the models to alleviate the unbalanced problem between classes.
After the ICV stage, the best CNN architecture was found using four convolutional blocks, as it is detailed in Table 1
. It is remarkable the use of the global max-pooling (GMP) layer applied in the last block, which allows extracting the maximum activation of each convolutional filter before the classification layer. Also, note that batch normalization and dropout layers were not used because no improvement was reported during the validation phase. Only a dense layer with asoftmax activation and 2 neurons, corresponding to glaucoma and healthy classes, was defined.
|Layer name||Output shape||Filter size|
|Input layer||496 x 768 x 1||N/A|
|Conv1_1||496 x 768 x 32||3 x 3 x 32|
|MaxPooling||248 x 384 x 32||2 x 2 x 32|
|Conv2_1||248 x 384 x 64||3 x 3 x 64|
|MaxPooling||124 x 192 x 64||2 x 2 x 64|
|Conv3_1||124 x 192 x 128||3 x 3 x 128|
|MaxPooling||62 x 96 x 128||2 x 2 x 128|
|Conv4_1||62 x 96 x 256||3 x 3 x 256|
The optimal hyper-parameters combination was achieved by training the CNNs during 150 epochs, using Adadelta optimizer with a learning rate of 0.05 and a batch size of 16. It should be noticed that we also proposed the use of data augmentation (DA) techniques[wong2016] to elucidate how important is the creation of artificial samples when addressing small databases. Specifically, a factor ratio of 0.2 was applied here to perform random geometric and dense elastic transformations from the original images.
3.3 Learning by fine tuning
Deeper architectures networks could improve the models’ performance, but a large number of images annotated by experts would be necessary for training a deep CNN from scratch. For this reason, we propose in this section the use of fine-tuning techniques [hoo2016], which allows training CNNs with greater depth using the weights pre-trained on large databases, without the need to train from scratch. In particular, we applied a deep fine-tuning [tajbakhsh2016] strategy to transfer the wide knowledge acquired by several state-of-the-art networks, such as VGG16, VGG19, InceptionV3, Xception and ResNet, when they were trained on the large ImageNet data set. Attending to the small database used in this work, only the coefficients of the last convolutional blocks (4 and 5) were retrained with the specific knowledge corresponding to the circumpapillary OCT images. The rest of coefficients were frozen with the values of the weights pre-trained with 14 million of natural images contained in Imagenet database.
Additionally, similarly to the proposed learning from scratch strategy, an empirical exploration of different hyper-parameters and top-model architectures was considered for all networks. It is important to notice that InceptionV3, Xception and ResNet architectures reported a poor performance due to their extensive depth (42, 36 and 53 convolutional layers, respectively). However, the family of VGG architectures achieved the best performance, in line with the findings in the literature [gomez2019automatic]. Specifically, VGG16 base model is composed of five convolutional blocks according to Fig. 2, where blue boxes correspond to convolutional layers activated with ReLu functions and red-grey boxes represent max-pooling layers. VGG19 base model is composed of the same architecture, but including an extra convolutional layer in the last three blocks.
A top model composed of global max pooling and dropout layers with a coefficient of 0.4, followed by a softmax layer with two neurons, provided the best model performance when VGG architectures were fine-tuned (see Fig.2). Regarding the selection of hyper-parameters combination, Adadelta optimizer with a learning rate of 0.001 reported the best learning curves when the model was forward, and backward, propagated during 125 epochs with a batch size of 16, trying to minimise the binary cross-entropy loss function.
Note that an initial down-sampling of the original images was necessary to alleviate the GPU memory problems during the training phase. Besides, replicating the channels of the grey-scale was necessary to adapt the input images in order to fine tune the CNNs. Data augmentation (DA) techniques with a factor of 0.2 were also considered.
4 Results and discussion
4.1 Validation results
In this stage, we present the results achieved during the ICV stage for each of the proposed CNNs. We expose in Table 2 a comparison of the CNNs trained from scratch, in terms of mean standard deviation. Several figures of merit are calculated to evidence the differences between using or not data augmentation (DA) techniques. In particular, sensitivity (SN), specificity (SPC), positive predictive value (PPV), negative predictive value (NPV), F-score (FS), accuracy (ACC) and area under the ROC curve (AUC) are employed.
|Without DA||With DA|
Significant differences between CNNs trained with and without data augmentation techniques can be appreciated in Table 2, especially related to the sensitivity and specificity metrics. Worth noting that the learning curves relative to the CNN trained without implementing DA algorithms reported slight overfitting during the validation phase. This fact is evidenced in the high sensitivity standard deviation of the model.
Additionally, we also detail in Table 3 the validation results achieved from the fine-tuned VGG networks, since they provided a considerable outperforming with respect to the rest of state-of-the-art architectures during the ICV stage. Specifically, VGG16 reaches better results for all figures of merit, although both architectures report similar behaviour. In comparison to the CNNs trained from scratch, VGG16 provides the best model performance too.
4.2 Test results
In order to provide reliable results, an independent test set was used to carry out the prediction stage. Table 4 shows a comparison between all proposed deep-learning models to evaluate their prediction ability by means of different figures of merit. Additionally, we expose in Fig. 3 the ROC curve relative to each proposed CNN to visualise the differences.
|FS without DA||FS with DA||VGG16||VGG19|
Test results exposed in Fig. 4 are in line with those achieved during the validation phase. However, due to the randomness effect of the data partitioning (which is accentuated in small databases), significant differences may exist in the prediction of each subset. This fact mainly affects to the CNNs trained from scratch because all the weights of the network were trained with the images of a specific subset, whereas the proposed fine-tuned architectures keep most of the weights frozen. Regarding the ROC curves comparison, Fig. 3 shows that fine-tuned CNNs report a significant improvement in relation to the networks trained from scratch.
It is important to remark that an objective comparison with other state-of-the-art studies is difficult because there are no public databases of circumpapillary OCT images. Additionally, each group of researchers addresses glaucoma detection using a different kind of images. Notwithstanding, we detail a subjective comparison with other works based on similar methodologies applied to fundus images. In particular, [gomez2019automatic] fine-tuned the VGG19 architecture and achieved an AUC of 0.94 predicting glaucoma. Also, [christopher2018performance] reached an AUC of 0.91 applying transfer learning techniques to the ResNet architecture. Otherwise, authors in [chen2015glaucoma] proposed a CNN from scratch obtaining AUC values of 0.83 and 0.89 from two independent databases. Basing on this, the proposed learning methodology exceeds the state-of-the-art results, achieving an AUC of 0.96 during the prediction of the test set.
Class Activation Maps (CAMs)
We compute the class activation maps to generate heat maps highlighting the interesting regions in which the proposed model is paying attention to determine the class of each specific circumpapillary OCT image. In Fig. 4, we expose the CAMs relative to random specific glaucomatous and normal samples in order to elucidate what is VGG19 taking into account to discern between classes.
The findings from the CAMs are directly in line with the reported by expert clinicians, who claim that a thickening of the RNFL is intimately linked with healthy patients, whereas a thinning of the RNFL evidence a glaucomatous case. That is just what heat maps in Fig. 4 reveal. Therefore, the results suggest that the proposed circumpapillary OCT-based methodology can provide a great added value for glaucoma diagnosis taking into account that information similar to that of specialists is reported by the model without including any previous clinician knowledge.
In this paper, two different deep-learning methodologies have been performed to elucidate the added value enclosed in the circumpapillary OCT images for glaucoma detection. The reported results suggest the fine-tuned VGG family of architectures as the most promising networks. The extracted CAMs evidence the great potential of the proposed model since it is able to highlight areas such as the RNFL, in line with the clinical interpretation. In future research lines, external validation of the proposed strategy with large databases is considered.