I Introduction
Glaucoma is one of the serious vision threatening ocular disorders, where there is a gradual degeneration in the optic nerve head (ONH) of the retina. Screening is done using 2D fundus imaging for the assessment of optic disc (OD) and cup. Vertical cup to disc ratio (CDR), which is a quantitative measure for measuring the enlargement of cup with respect to disc, is an important indicator of the disease and requires accurate delineation of OD and the cup, which is typically done by a skilled grader.
There have been many works on optic disc and cup segmentation. and [1] provides a survey of different techniques. Many methods based on morphological techniques [2]
and deformable energy based models
[3][6] and graph cuts [8]have been proposed. Recently, upon the advent of deep learning, the Unet
[4] like fully convolutional architectures have been used for many kinds of semantic segmentation task. For the case of glaucoma, recently [10] proposed the use CNNs where filters are learned in a greedy fashion and the image is passed through the CNN to get pixelwise predictions for disc and cup segmentation. We recently proposed an end to end fully convolutional network for the task of joint optic disc and cup segmentation [5]. The work also explored the use of adversarial training for segmentation task.Depth is also an important cue for assessment of glaucoma and explicitly measuring depth requires complicated imaging techniques such as stereoscopic imaging or optical coherence tomography (OCT). But using these modalities at large scale is infeasible due to their cost, difficulty in operating and portability. This necessitates the need of a method for depth estimation from a single image. Single image depth estimation is highly challenging task and has been explored using deep learning [11] [12] and the authors of [13] perform not only depth estimation, but also surface normal estimation and semantic segmentation with a common architecture. In the case of retinal imaging, there have been a few works for depth estimation. A method for estimating depth from stereo is proposed in [14]. Single image depth estimation using a coupled sparse dictionary based supervision method is proposed in [15]. A fast marching based depth estimation is proposed in [16]. Most of these single image retinal depth estimation methods rely predominantly on the image intensities and hence not fairly robust, thus necessitating the need of a robust method.
The main contribution of this work are follows

We propose a new scheme for depth estimation of monocular fundus images using a fully convolutional network.

We also explore the effectiveness of depth estimation as a proxy task for joint optic disc and cup segmentation. Since the cupping phenomenon occurs in the optic nerve head (ONH), leading to the variations in depth in ONH, the task of depth estimation could naturally serve as a pretraining method for segmentation.
To the best of our knowledge this is the first work addressing monocular retinal depth estimation using a deep learning.
Ii Methods
Our work consists of two main parts depth estimation and joint optic disccup segmentation. For both these tasks, we employ a fully convolutional network architecture proposed by us in [5]. The proposed network in [5] is a generative adversarial network (GAN) based architecture consisting of a generator and a discriminator. The generator is a Unet [4]
like encoderdecoder type architecture with residual connections. The discriminator has a standard CNN architecture employed for classification. The readers are advised to refer
[5] for more details about the architecture. In our setup, the generator is tasked with learning the required mappings for both the cases of depth estimation and joint optic disccup segmentation.Single image depth estimation is a challenging task and more so in our case because of the unavailability of a large dataset. Hence, we first collect retinal images of various datasets and crop the region of interest, which in this case is the region around optic disc. With this dataset and with the same architecture as the generator, we train a denoising autoencoder in order to learn the retinal representations. Thus, we pretrain the generator part of the network as a denoising autoencoder. Once pretrained, we proceed with depth estimation and disccup segmentation (refer figure Fig. 2) as described in the subsequent subsections.
Iia Depth Estimation
For the first task, our goal is to train a fully convolutional network to predict depth from a single RGB fundus image. For this, we employ the pretrained generator network to learn a mapping between the fundus image and the corresponding depth map. We solve for depth estimation as a regression problem. Given an RGB fundus image and the corresponding depth map , our network learns the mapping . As in the case with any regression problem, we employ the standard loss function. L_L2(G_depth)=E_I,d∼p_data(I,d)[∥(d  G_depth(I)∥_2] Additionally, we also augment loss with adversarial loss so as to improve the depth estimation. G_depth^* = arg G_depthmin D_depthmax( L_GAN(G_depth,D_depth) + λ(L_L2(G_depth)) where is the discriminator network for depth and is the adversarial loss given by: L_GAN(G_depth,D_depth)=E_d∼p_data(d)[log(D_depth(d))] + E_I∼p_data(I) [log (1D_depth(G_depth(I))]
IiB Joint Optic Disc and Cup Segmentation
The phenomena of cupping leads to an increase in the relative depth between the cup and the disc. This serves as the main motivation for us to explore depth estimation as a proxy task for joint optic disccup segmentation. Our goal in this task is to train a fully convolution network for the task of segmentation. We use a GAN based framework where the generator is tasked with learning a mapping between an RGB image as input and the corresponding segmentation map . The generator needs to produce outputs so as to fool an adversarially trained discriminator which is trained to discriminate between generated segmentation map and real segmentation map. The final objective function for segmentation is given by  G_segment^* = arg G_segmentmin D_segmentmax( L_GAN(G_segment,D_segment) + λ(L_L1(G_segment)) where and are adversarial loss and loss functions given by  L_GAN(G_segment,D_segment)=E_y∼p_data(y)[log(D_segment(y))] + E_x∼p_data(x) [log (1D_segment(G_segment(x))] L_L1(G_segment)=E_x,y∼p_data(x,y)[∥(y  G_segment(x)∥_1] It is to be noted that in the equations (IIA) and (IIA) and also (IIB) and (IIB), corresponds to the distribution of and the expectation of the loglikelihood of the pair
being sampled from the underlying probability distribution of real pairs
is represented by .Since we proposed depth estimation as a proxy task, we initialize the weights of the generator with the weights of fully trained network which was trained for depth estimation.
Iii Experiments and Results
Since the first task in our work is to perform pretraining using a denoising autoencoder, we first collect retinal images from various sources such as RIMONE, MESSIDOR, DRIVE, STARE etc. We then crop the OD region and add noise to the images and train a denoising autoencoder with different generator architectures such as Unet[4] and ResUnet[5].
We then use the pretrained deep networks for estimating depth. We use the INSPIREstereo dataset [14] for estimating the depth. The dataset consists of color fundus images along with the ground truth depth map obtained from OCT.
We use five fold crossvalidation which in our case turns out to be validation images and training images. We do very heavy data augmentation on the training set with various levels of zoom, gamma jitter along with standard techniques such as flips and rotations, inturn blowing up the training data by a factor of to aid in training a deep network. We then train the network for epochs with optimizer. The results of depth estimation are shown in figure 3. The metrics used for quantitative evaluation of depth maps are root mean squared error (RMSE) given by
and correlation coefficient given by
where and are estimated depth maps and ground truth depth maps respectively, and is the pixel index. We enlist the values obtained for four experiments
Unet with only loss (Unet)
Residual Unet with only loss (ResUnet)
Unet with adversarial loss (UGAN)
Residual Unet with adversarial loss (ResUGAN)
The values obtained are listed in Table. 1. It can be seen from the table that deep learning based methods yield superior results compared to the other methods for monocular depth estimation.
Method  Correlation Coefficient  RMSE  
Mean  StdDev  Mean  StdDev  
[14]      0.1592  0.0879 
[15]  0.8000  0.1200     
[16]  0.8225    0.1532  0.1206 
Proposed Unet  0.8322  0.1077  0.0190  0.0094 
Proposed ResUnet  0.8984  0.0698  0.0124  0.0079 
Proposed UGAN  0.9268  0.0377  0.0105  0.0080 
Proposed ResUGAN  0.9269  0.0434  0.0099  0.0054 
For the task of joint optic disc and cup segmentation, we use the RIMONE dataset containing labeled images for optic disc and cup. Instead of training from scratch, we use the respective depth pretrained networks for weight initialization. Accordingly, we again have the four experiments for segmentation
Depth pretrained Unet and ResUnet without adversarial loss (DP Unet and DP ResUnet respectively) and depth pretrained Unet and ResUnet with adversarial loss (DP UGAN and DP ResUGAN respectively).
The delineated outputs can be seen in figure Fig. 4, and the quantitative metrics employed for semantic segmentation are Fscore and Intersection over Union (IOU) measures. The values are tabulated in table 2.
It is interesting to note that depth pretraining leads to improved segmentation accuracy in the case of Unet compared to the network trained from scratch. For the case of ResUnet, it leads to similar performance but pretraining leads to consistent results in the cases of adversarial training and without adversarial training. Perhaps, one of the reasons for depth pretrained models not giving significantly better results compared models trained from scratch could be the dataset bias since the RIMONE dataset and INSPIREstereo dataset seem to differ significantly in terms of quality and luminance and color distribution when examined visually. Also, it can be seen from figure Fig. 4 that the depth estimation for RIMONE dataset doesn’t seem to yield very accurate results. Also, inavailability of a large dataset for depth estimation could also be one of the causes for not outperforming the models trained from scratch.
Method  Optic Disc  Optic Cup  
FMeasure  IOU  FMeasure  IOU  
[2]  0.901  0.842     
[7]  0.931  0.880  0.801  0.764 
[9]  0.892  0.829  0.744  0.732 
[10]  0.942  0.890  0.824  0.802 
Unet [5]  0.973  0.886  0.927  0.749 
UGAN [5]  0.984  0.949  0.779  0.675 
ResUnet [5]  0.977  0.901  0.945  0.786 
ResUGAN [5]  0.987  0.961  0.906  0.739 
DP Unet  0.9841  0.9472  0.9347  0.7395 
DP UGAN  0.9841  0.9497  0.9285  0.7390 
DP ResUnet  0.9857  0.9575  0.9354  0.7458 
DP ResUGAN  0.9861  0.9575  0.9354  0.7488 
Iv Conclusion
In this work, we proposed a new method for monocular retinal depth estimation using deep learning. It was seen that although this method outperforms other existing methods for the depth estimation by large margin in terms of the usual metrics, its generalization ability is one of the main concerns. The study of depth training as as a proxy task for joint optic disccup segmentation highlights the issue of generalization ability. One of the ways to address this would be to use better augmentation techniques to remove the dataset bias. In future, we would also like to explore methods for using the depth information explicitly for semantic segmentation.
References
 [1] Almazroa, Ahmed, et al. Optic Disc and Optic Cup Segmentation Methodologies for Glaucoma Image Detection: A Survey, J. Ophthalmol, 2015.

[2]
A.Aquino, M. GegundezArias, and D. Marin., Detecting the optic disc boundary in digital fundus images using morphological edge detection and feature extraction techniques, IEEE Trans. Med. Imag, vol. 20, no. 11, pp. 18601869, 2010.
 [3] J. Lowell, A. Hunter, D. Steel, A. Basu, R. Ryder, E. Fletcher et al., “Optic nerve head segmentation,” IEEE Trans. Med. Imag. , vol. 23, no. 2, pp. 256–264, 2004
 [4] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015
 [5] S. M. Shankaranarayana, K. Ram, K. Mitra, and M. Sivaprakasam, “Joint 20 optic disc and cup segmentation using fully convolutional and adversarial 21 networks,” in Fetal, Infant and Ophthalmic Medical Image Analysis . 22 Springer, 2017, pp. 168–176.

[6]
J. Xu, O. Chutatape, E. Sung, C. Zheng, and P. Chew, “Optic disk feature extraction via modified deformable model technique for glaucoma analysis,” Pattern Recognit. , vol. 40, no. 7, pp. 2063–2076, 2007.
 [7] G.D. Joshi, J. Sivaswamy, and S.R. Krishnadas, Optic disk and cup segmentation from monocular color retinal images for glaucoma assessment, IEEE Trans. Med. Imag, vol. 30, no. 6, pp. 11921205, 2011.
 [8] Y. Zheng, D. Stambolian, J. OBrien, and J. C. Gee, “Optic disc and cup segmentation from color fundus photograph using graph cut with priors,” in MICCAI 2013, 2013, pp. 75–82
 [9] J. Cheng, J. Liu, and et. al., Superpixel classification based optic disc and optic cup segmentation for glaucoma screening, IEEE Trans. Med. Imag, vol. 32, no. 6, pp. 10191032, 2013.
 [10] J. Zilly, J. M. Buhmann, and D. Mahapatra, “Glaucoma detection using entropy sampling and ensemble learning for automatic optic cup and disc segmentation,” Computerized Medical Imaging and Graphics , vol. 55, pp. 28–41, 2017
 [11] Eigen, David, Christian Puhrsch, and Rob Fergus. ”Depth map prediction from a single image using a multiscale deep network.” In Advances in neural information processing systems, pp. 23662374. 201

[12]
Liu, Fayao, Chunhua Shen, and Guosheng Lin. ”Deep convolutional neural fields for depth estimation from a single image.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 51625170. 2015.
 [13] Eigen, David, and Rob Fergus. ”Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 26502658. 2015.
 [14] L. Tang, M. K. Garvin, K. Lee, W. L. W. Alward, Y. H. Kwon and M. D. Abramoff, ”Robust Multiscale Stereo Matching from Fundus Images with Radiometric Differences,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 22452258, Nov. 2011. doi: 10.1109/TPAMI.2011.6
 [15] Chakravarty, A. and Sivaswamy, J., 2014, September. Coupled sparse dictionary for depthbased cup segmentation from single color fundus image. In International Conference on Medical Image Computing and ComputerAssisted Intervention (pp. 747754). Springer, Cham.
 [16] Ramaswamy, Akshaya; Ram, Keerthi; and Sivaprakasam, Mohanasankar. A Depth Based Approach to Glaucoma Detection Using Retinal Fundus Images. In: Chen X, Garvin MK, Liu J, Trucco E, Xu Y editors. Proceedings of the Ophthalmic Medical Image Analysis Third International Workshop, OMIA 2016, Held in Conjunction with MICCAI 2016, Athens, Greece, October 21, 2016. 9–16.
 [17] Chakravarty, A. and Sivaswamy, J., 2017. Joint optic disc and cup boundary extraction from monocular fundus images. Computer methods and programs in biomedicine, 147, pp.5161.
 [18] Shankaranarayana, S.M., Ram, K., Mitra, K. and Sivaprakasam, M., 2019. Fully convolutional networks for monocular retinal depth estimation and optic disccup segmentation. IEEE journal of biomedical and health informatics, 23(4), pp.14171426.
Comments
There are no comments yet.