Monocular Retinal Depth Estimation and Joint Optic Disc and Cup Segmentation using Adversarial Networks

by   Sharath M Shankaranarayana, et al.

One of the important parameters for the assessment of glaucoma is optic nerve head (ONH) evaluation, which usually involves depth estimation and subsequent optic disc and cup boundary extraction. Depth is usually obtained explicitly from imaging modalities like optical coherence tomography (OCT) and is very challenging to estimate depth from a single RGB image. To this end, we propose a novel method using adversarial network to predict depth map from a single image. The proposed depth estimation technique is trained and evaluated using individual retinal images from INSPIRE-stereo dataset. We obtain a very high average correlation coefficient of 0.92 upon five fold cross validation outperforming the state of the art. We then use the depth estimation process as a proxy task for joint optic disc and cup segmentation.



There are no comments yet.


page 1

page 2

page 3

page 4


Depth Estimation from Single Image using Sparse Representations

Monocular depth estimation is an interesting and challenging problem as ...

Geometry meets semantics for semi-supervised monocular depth estimation

Depth estimation from a single image represents a very exciting challeng...

Exploiting Depth Information for Wildlife Monitoring

Camera traps are a proven tool in biology and specifically biodiversity ...

Waterdrop Stereo

This paper introduces depth estimation from water drops. The key idea is...

Fully Convolutional Networks for Monocular Retinal Depth Estimation and Optic Disc-Cup Segmentation

Glaucoma is a serious ocular disorder for which the screening and diagno...

Pano3D: A Holistic Benchmark and a Solid Baseline for 360^o Depth Estimation

Pano3D is a new benchmark for depth estimation from spherical panoramas....

Parallax Motion Effect Generation Through Instance Segmentation And Depth Estimation

Stereo vision is a growing topic in computer vision due to the innumerab...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Glaucoma is one of the serious vision threatening ocular disorders, where there is a gradual degeneration in the optic nerve head (ONH) of the retina. Screening is done using 2D fundus imaging for the assessment of optic disc (OD) and cup. Vertical cup to disc ratio (CDR), which is a quantitative measure for measuring the enlargement of cup with respect to disc, is an important indicator of the disease and requires accurate delineation of OD and the cup, which is typically done by a skilled grader.

There have been many works on optic disc and cup segmentation. and [1] provides a survey of different techniques. Many methods based on morphological techniques [2]

and deformable energy based models

[3][6] and graph cuts [8]

have been proposed. Recently, upon the advent of deep learning, the U-net

[4] like fully convolutional architectures have been used for many kinds of semantic segmentation task. For the case of glaucoma, recently [10] proposed the use CNNs where filters are learned in a greedy fashion and the image is passed through the CNN to get pixelwise predictions for disc and cup segmentation. We recently proposed an end to end fully convolutional network for the task of joint optic disc and cup segmentation [5]. The work also explored the use of adversarial training for segmentation task.

Depth is also an important cue for assessment of glaucoma and explicitly measuring depth requires complicated imaging techniques such as stereoscopic imaging or optical coherence tomography (OCT). But using these modalities at large scale is infeasible due to their cost, difficulty in operating and portability. This necessitates the need of a method for depth estimation from a single image. Single image depth estimation is highly challenging task and has been explored using deep learning [11] [12] and the authors of [13] perform not only depth estimation, but also surface normal estimation and semantic segmentation with a common architecture. In the case of retinal imaging, there have been a few works for depth estimation. A method for estimating depth from stereo is proposed in [14]. Single image depth estimation using a coupled sparse dictionary based supervision method is proposed in [15]. A fast marching based depth estimation is proposed in [16]. Most of these single image retinal depth estimation methods rely predominantly on the image intensities and hence not fairly robust, thus necessitating the need of a robust method.

The main contribution of this work are follows

  1. We propose a new scheme for depth estimation of monocular fundus images using a fully convolutional network.

  2. We also explore the effectiveness of depth estimation as a proxy task for joint optic disc and cup segmentation. Since the cupping phenomenon occurs in the optic nerve head (ONH), leading to the variations in depth in ONH, the task of depth estimation could naturally serve as a pretraining method for segmentation.

To the best of our knowledge this is the first work addressing monocular retinal depth estimation using a deep learning.

Fig. 1: Sample Results from our method

Ii Methods

Our work consists of two main parts- depth estimation and joint optic disc-cup segmentation. For both these tasks, we employ a fully convolutional network architecture proposed by us in [5]. The proposed network in [5] is a generative adversarial network (GAN) based architecture consisting of a generator and a discriminator. The generator is a U-net [4]

like encoder-decoder type architecture with residual connections. The discriminator has a standard CNN architecture employed for classification. The readers are advised to refer

[5] for more details about the architecture. In our set-up, the generator is tasked with learning the required mappings for both the cases of depth estimation and joint optic disc-cup segmentation.

Fig. 2:

Our proposed framework: the first part (a) consists of pretraining using a denoising autoencoder which serves as weight initialization for part (b), which consists of depth estimation and also serves as weight initialization for part (c), which consists of segmentation

Single image depth estimation is a challenging task and more so in our case because of the unavailability of a large dataset. Hence, we first collect retinal images of various datasets and crop the region of interest, which in this case is the region around optic disc. With this dataset and with the same architecture as the generator, we train a denoising autoencoder in order to learn the retinal representations. Thus, we pretrain the generator part of the network as a denoising autoencoder. Once pretrained, we proceed with depth estimation and disc-cup segmentation (refer figure Fig. 2) as described in the subsequent subsections.

Ii-a Depth Estimation

For the first task, our goal is to train a fully convolutional network to predict depth from a single RGB fundus image. For this, we employ the pretrained generator network to learn a mapping between the fundus image and the corresponding depth map. We solve for depth estimation as a regression problem. Given an RGB fundus image and the corresponding depth map , our network learns the mapping . As in the case with any regression problem, we employ the standard loss function. L_L2(G_depth)=E_I,d∼p_data(I,d)[∥(d - G_depth(I)∥_2] Additionally, we also augment loss with adversarial loss so as to improve the depth estimation. G_depth^* = arg G_depthmin D_depthmax( L_GAN(G_depth,D_depth) + λ(L_L2(G_depth)) where is the discriminator network for depth and is the adversarial loss given by: L_GAN(G_depth,D_depth)=E_d∼p_data(d)[log(D_depth(d))] + E_I∼p_data(I) [log (1-D_depth(G_depth(I))]

Ii-B Joint Optic Disc and Cup Segmentation

The phenomena of cupping leads to an increase in the relative depth between the cup and the disc. This serves as the main motivation for us to explore depth estimation as a proxy task for joint optic disc-cup segmentation. Our goal in this task is to train a fully convolution network for the task of segmentation. We use a GAN based framework where the generator is tasked with learning a mapping between an RGB image as input and the corresponding segmentation map . The generator needs to produce outputs so as to fool an adversarially trained discriminator which is trained to discriminate between generated segmentation map and real segmentation map. The final objective function for segmentation is given by - G_segment^* = arg G_segmentmin D_segmentmax( L_GAN(G_segment,D_segment) + λ(L_L1(G_segment)) where and are adversarial loss and loss functions given by - L_GAN(G_segment,D_segment)=E_y∼p_data(y)[log(D_segment(y))] + E_x∼p_data(x) [log (1-D_segment(G_segment(x))] L_L1(G_segment)=E_x,y∼p_data(x,y)[∥(y - G_segment(x)∥_1] It is to be noted that in the equations (II-A) and (II-A) and also (II-B) and (II-B), corresponds to the distribution of and the expectation of the log-likelihood of the pair

being sampled from the underlying probability distribution of real pairs

is represented by .

Since we proposed depth estimation as a proxy task, we initialize the weights of the generator with the weights of fully trained network which was trained for depth estimation.

Iii Experiments and Results

Fig. 3: Qualitative results for depth estimation with the input image and depth maps and corresponding surface reconstruction

Since the first task in our work is to perform pretraining using a denoising autoencoder, we first collect retinal images from various sources such as RIMONE, MESSIDOR, DRIVE, STARE etc. We then crop the OD region and add noise to the images and train a denoising autoencoder with different generator architectures such as Unet[4] and ResUnet[5].

We then use the pretrained deep networks for estimating depth. We use the INSPIRE-stereo dataset [14] for estimating the depth. The dataset consists of color fundus images along with the ground truth depth map obtained from OCT. We use five fold cross-validation which in our case turns out to be validation images and training images. We do very heavy data augmentation on the training set with various levels of zoom, gamma jitter along with standard techniques such as flips and rotations, inturn blowing up the training data by a factor of to aid in training a deep network. We then train the network for epochs with optimizer. The results of depth estimation are shown in figure 3. The metrics used for quantitative evaluation of depth maps are root mean squared error (RMSE) given by

and correlation coefficient given by

where and are estimated depth maps and ground truth depth maps respectively, and is the pixel index. We enlist the values obtained for four experiments-
U-net with only loss (U-net)
Residual U-net with only loss (ResU-net)
U-net with adversarial loss (U-GAN)
Residual U-net with adversarial loss (ResU-GAN)
The values obtained are listed in Table. 1. It can be seen from the table that deep learning based methods yield superior results compared to the other methods for monocular depth estimation.

Method Correlation Coefficient RMSE
Mean Std-Dev Mean Std-Dev
[14] - - 0.1592 0.0879
[15] 0.8000 0.1200 - -
[16] 0.8225 - 0.1532 0.1206
Proposed U-net 0.8322 0.1077 0.0190 0.0094
Proposed ResU-net 0.8984 0.0698 0.0124 0.0079
Proposed U-GAN 0.9268 0.0377 0.0105 0.0080
Proposed ResU-GAN 0.9269 0.0434 0.0099 0.0054
TABLE I: Comparison of various Depth estimation methods

For the task of joint optic disc and cup segmentation, we use the RIM-ONE dataset containing labeled images for optic disc and cup. Instead of training from scratch, we use the respective depth pretrained networks for weight initialization. Accordingly, we again have the four experiments for segmentation-
Depth pretrained U-net and ResU-net without adversarial loss (DP U-net and DP ResU-net respectively) and depth pretrained U-net and ResU-net with adversarial loss (DP U-GAN and DP ResU-GAN respectively).

The delineated outputs can be seen in figure Fig. 4, and the quantitative metrics employed for semantic segmentation are F-score and Intersection over Union (IOU) measures. The values are tabulated in table 2.

It is interesting to note that depth pretraining leads to improved segmentation accuracy in the case of U-net compared to the network trained from scratch. For the case of ResU-net, it leads to similar performance but pretraining leads to consistent results in the cases of adversarial training and without adversarial training. Perhaps, one of the reasons for depth pretrained models not giving significantly better results compared models trained from scratch could be the dataset bias since the RIMONE dataset and INSPIRE-stereo dataset seem to differ significantly in terms of quality and luminance and color distribution when examined visually. Also, it can be seen from figure Fig. 4 that the depth estimation for RIMONE dataset doesn’t seem to yield very accurate results. Also, in-availability of a large dataset for depth estimation could also be one of the causes for not outperforming the models trained from scratch.

Method Optic Disc Optic Cup
F-Measure IOU F-Measure IOU
[2] 0.901 0.842 - -
[7] 0.931 0.880 0.801 0.764
[9] 0.892 0.829 0.744 0.732
[10] 0.942 0.890 0.824 0.802
U-net [5] 0.973 0.886 0.927 0.749
U-GAN [5] 0.984 0.949 0.779 0.675
ResU-net [5] 0.977 0.901 0.945 0.786
ResU-GAN [5] 0.987 0.961 0.906 0.739
DP U-net 0.9841 0.9472 0.9347 0.7395
DP U-GAN 0.9841 0.9497 0.9285 0.7390
DP ResU-net 0.9857 0.9575 0.9354 0.7458
DP ResU-GAN 0.9861 0.9575 0.9354 0.7488
TABLE II: Comparison of various segmentation methods
Fig. 4: Results for segmentation: the first four columns show the results for network trained from scratch and the last four columns show the results for depth pretrained networks while the middle column displays the estimated depth map for RIMONE dataset

Iv Conclusion

In this work, we proposed a new method for monocular retinal depth estimation using deep learning. It was seen that although this method outperforms other existing methods for the depth estimation by large margin in terms of the usual metrics, its generalization ability is one of the main concerns. The study of depth training as as a proxy task for joint optic disc-cup segmentation highlights the issue of generalization ability. One of the ways to address this would be to use better augmentation techniques to remove the dataset bias. In future, we would also like to explore methods for using the depth information explicitly for semantic segmentation.


  • [1] Almazroa, Ahmed, et al. Optic Disc and Optic Cup Segmentation Methodologies for Glaucoma Image Detection: A Survey, J. Ophthalmol, 2015.
  • [2]

    A.Aquino, M. Gegundez-Arias, and D. Marin., Detecting the optic disc boundary in digital fundus images using morphological edge detection and feature extraction techniques, IEEE Trans. Med. Imag, vol. 20, no. 11, pp. 1860-1869, 2010.

  • [3] J. Lowell, A. Hunter, D. Steel, A. Basu, R. Ryder, E. Fletcher et al., “Optic nerve head segmentation,” IEEE Trans. Med. Imag. , vol. 23, no. 2, pp. 256–264, 2004
  • [4] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015
  • [5] S. M. Shankaranarayana, K. Ram, K. Mitra, and M. Sivaprakasam, “Joint 20 optic disc and cup segmentation using fully convolutional and adversarial 21 networks,” in Fetal, Infant and Ophthalmic Medical Image Analysis . 22 Springer, 2017, pp. 168–176.
  • [6]

    J. Xu, O. Chutatape, E. Sung, C. Zheng, and P. Chew, “Optic disk feature extraction via modified deformable model technique for glaucoma analysis,” Pattern Recognit. , vol. 40, no. 7, pp. 2063–2076, 2007.

  • [7] G.D. Joshi, J. Sivaswamy, and S.R. Krishnadas, Optic disk and cup segmentation from monocular color retinal images for glaucoma assessment, IEEE Trans. Med. Imag, vol. 30, no. 6, pp. 1192-1205, 2011.
  • [8] Y. Zheng, D. Stambolian, J. OBrien, and J. C. Gee, “Optic disc and cup segmentation from color fundus photograph using graph cut with priors,” in MICCAI 2013, 2013, pp. 75–82
  • [9] J. Cheng, J. Liu, and et. al., Superpixel classification based optic disc and optic cup segmentation for glaucoma screening, IEEE Trans. Med. Imag, vol. 32, no. 6, pp. 1019-1032, 2013.
  • [10] J. Zilly, J. M. Buhmann, and D. Mahapatra, “Glaucoma detection using entropy sampling and ensemble learning for automatic optic cup and disc segmentation,” Computerized Medical Imaging and Graphics , vol. 55, pp. 28–41, 2017
  • [11] Eigen, David, Christian Puhrsch, and Rob Fergus. ”Depth map prediction from a single image using a multi-scale deep network.” In Advances in neural information processing systems, pp. 2366-2374. 201
  • [12]

    Liu, Fayao, Chunhua Shen, and Guosheng Lin. ”Deep convolutional neural fields for depth estimation from a single image.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5162-5170. 2015.

  • [13] Eigen, David, and Rob Fergus. ”Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 2650-2658. 2015.
  • [14] L. Tang, M. K. Garvin, K. Lee, W. L. W. Alward, Y. H. Kwon and M. D. Abramoff, ”Robust Multiscale Stereo Matching from Fundus Images with Radiometric Differences,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2245-2258, Nov. 2011. doi: 10.1109/TPAMI.2011.6
  • [15] Chakravarty, A. and Sivaswamy, J., 2014, September. Coupled sparse dictionary for depth-based cup segmentation from single color fundus image. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 747-754). Springer, Cham.
  • [16] Ramaswamy, Akshaya; Ram, Keerthi; and Sivaprakasam, Mohanasankar. A Depth Based Approach to Glaucoma Detection Using Retinal Fundus Images. In: Chen X, Garvin MK, Liu J, Trucco E, Xu Y editors. Proceedings of the Ophthalmic Medical Image Analysis Third International Workshop, OMIA 2016, Held in Conjunction with MICCAI 2016, Athens, Greece, October 21, 2016. 9–16.
  • [17] Chakravarty, A. and Sivaswamy, J., 2017. Joint optic disc and cup boundary extraction from monocular fundus images. Computer methods and programs in biomedicine, 147, pp.51-61.
  • [18] Shankaranarayana, S.M., Ram, K., Mitra, K. and Sivaprakasam, M., 2019. Fully convolutional networks for monocular retinal depth estimation and optic disc-cup segmentation. IEEE journal of biomedical and health informatics, 23(4), pp.1417-1426.