Analysis of Macula on Color Fundus Images Using Heightmap Reconstruction Through Deep Learning

12/28/2020
by   Peyman Tahghighi, et al.
University of Tehran
0

For medical diagnosis based on retinal images, a clear understanding of 3D structure is often required but due to the 2D nature of images captured, we cannot infer that information. However, by utilizing 3D reconstruction methods, we can recover the height information of the macula area on a fundus image which can be helpful for diagnosis and screening of macular disorders. Recent approaches have used shading information for heightmap prediction but their output was not accurate since they ignored the dependency between nearby pixels and only utilized shading information. Additionally, other methods were dependent on the availability of more than one image of the retina which is not available in practice. In this paper, motivated by the success of Conditional Generative Adversarial Networks(cGANs) and deeply supervised networks, we propose a novel architecture for the generator which enhances the details and the quality of output by progressive refinement and the use of deep supervision to reconstruct the height information of macula on a color fundus image. Comparisons on our own dataset illustrate that the proposed method outperforms all of the state-of-the-art methods in image translation and medical image translation on this particular task. Additionally, perceptual studies also indicate that the proposed method can provide additional information for ophthalmologists for diagnosis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 7

page 8

page 9

page 10

09/03/2020

Heightmap Reconstruction of Macula on Color Fundus Images Using Conditional Generative Adversarial Networks

For medical diagnosis based on retinal images, a clear understanding of ...
03/25/2019

PI-REC: Progressive Image Reconstruction Network With Edge and Color Domain

We propose a universal image reconstruction method to represent detailed...
05/19/2021

TarGAN: Target-Aware Generative Adversarial Networks for Multi-modality Medical Image Translation

Paired multi-modality medical images, can provide complementary informat...
06/17/2018

MedGAN: Medical Image Translation using GANs

Image-to-image translation is considered a next frontier in the field of...
06/18/2020

Generating Fundus Fluorescence Angiography Images from Structure Fundus Images Using Generative Adversarial Networks

Fluorescein angiography can provide a map of retinal vascular structure ...
06/29/2021

Uncertainty-Guided Progressive GANs for Medical Image Translation

Image-to-image translation plays a vital role in tackling various medica...
01/26/2020

Visualisation of Medical Image Fusion and Translation for Accurate Diagnosis of High Grade Gliomas

The medical image fusion combines two or more modalities into a single v...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Vision is crucial for our everyday activities such as driving, watching television, reading and interacting socially and visual impairment can be a real handicap. Even the slightest visual loss can affect our quality of life considerably and cause depression and in old adults cause accidents, injuries and falls [1, 2]. Blindness is the final stage of many eye diseases and according to previous studies, the most common cause of visual impairments are cataract, macular degeneration, glaucoma and diabetic retinopathy [3, 4].

Color fundus photography is a 2D imaging modality for the diagnosis of retinal diseases. 3D structure of the eye provides a considerable amount of crucial information (such as information about elevation in different parts) for ophthalmologists to diagnose which is unavailable in 2D fundus images. Therefore, being able to infer this information from just a 2D image can be helpful. Furthermore, the reconstructed heightmap offers clinicians another means to view eye structure which may help them in better and more accurate diagnosis [59, 60]. Optical Coherence Tomography (OCT) [61] is an expensive but vital tool for evaluating the retinal structure which provides ophthalmologists with valuable information, enabling them to diagnose most of the macula diseases. Nevertheless, owing to the cost of using this system, OCT devices are not ubiquitous and using fundus images is mostly common for screening.

Figure 1: The left and right image represent the correspondence between a fundus image and its heightmap image. As can be seen, each pixel’s color value of the image on the right indicates a height according to the colorbar which ranges from 0 to 500 . Note that all numbers are in micrometer.

Shape from shading (SFS) [70] is the only method applied to this problem for the reconstruction of the height of a single fundus image [60]

. However, this method suffers from drawbacks that limits its usage in this particular task. In fact, the success of the SFS method is totally dependent on predicting the position of the light source and the assumption about the surface which cannot be applied to the eye retina. Furthermore, disparity map estimation, which is one of the common methods for 3D reconstruction was also applied to this problem

[62]. Nonetheless, it totally depends on the availability of the stereo images from both eyes which is not practical. Hence, devising a method to automatically generate an accurate heightmap image from a given fundus image is crucial.

In recent years, with the advent of Conditional Generative Adversarial Networks (cGANs) [22, 23], many researchers used this methodology for image generation and transformation tasks [24, 39, 40, 76, 77, 80]. Medical image analysis also benefited a lot from these powerful models and many researchers used these methods for translating between different medical images such as translating between CT and PET images [74, 36], denoise and correct the motion in medical images such as denoising CT images [35] or correcting the motion in MR images [36] and finally segmenting medical images [37, 78, 79]. Most of these methods benefited from the U-Net architecture [25] which first were proposed for image segmentation and its extension U-Net++ [21]. Perceptual loss [39, 40]

is another major part of successful methods which considers the difference between two images using high-level features extracted from different layers of a Deep Neural Network (DNN).

Motivated by the promising results of cGANs on tasks relating to the analysis of medical images, in this work, we applied this method to generate a heightmap image from a given color fundus image targeted on the macula area. In fact, height information is one of the crucial information that OCT devices provide to ophthalmologists and is missing in color fundus images due to their 2D nature. Hence, by extracting such information from only a fundus image, we can ease the diagnosis and management of retinal diseases with macular thickness changes.

Considering Figure 1, since our problem can be seen as an image translation task in which we want to predict a color image containing heights data from a fundus image targeted on the macula area, cGANs can be utilized in this problem. That is to say, each pixel in the right image of Figure 1 has a color value which represents a height from 0 to 500 and by predicting red, green and blue color values for each pixel of the left image (fundus image), we can predict its heightmap. The color bar below Figure 1 demonstrates the assignment of different color values to different heights.

In this paper, we used a stack of three U-Nets for our generator network which we averaged on the output of them for deep supervision. Furthermore, in order to avoid problems of traditional GANs, we used Least Squares Adversarial Loss [65] instead along with perceptual loss [39]

and L2-loss as pixel reconstruction loss. For the discriminator network, we used an image-level discriminator that classifies the whole image as real or fake. To the best of our knowledge, this is the first research paper on predicting the heightmap of the macula area on fundus images using DNNs. We evaluated our approach qualitatively and quantitatively on our dataset and compared the results with state-of-the-art methods in image translation and medical image translation. Furthermore, we studied the application of our method on real diagnosis cases which showed that reconstructed heightmaps can provide additional information to ophthalmologists and can be used for the analysis of the macula region.

Our main contributions can be listed as follows:

  • Motivated by deeply supervised networks [69], we propose a novel deep architecture for the generator network based on U-Net [25] and CasNet [36] which consists of a number of connected U-Nets that we utilized each of their output for deep supervision.

  • We propose the first method for the reconstruction of the heightmap of the macula image based on DNNs.

  • Finally, the subjective performance of our reconstructed heightmap was investigated from a medical perspective by two experienced ophthalmologists.

2 Methods

2.1 Preprocessing

A DNN has the capability to learn from unpreprocessed image data, but it can learn more easily and efficiently if we apply appropriate preprocessing on the input data. Hence, in this paper, we first apply Contrast Limited Adaptive Histogram Equalization (CLAHE) [64] which enhances the foreground and background contrast. Afterward, we apply normalization (division by 255) to the input images to prepare them for feeding into the network. The impact of preprocessing on the input fundus images is depicted in Figure 3.

(a) CLAHE
(b) Normal
(a) CLAHE
(b) Normal
Figure 2: The effect of CLAHE preprocessing on the quality of fundus images. As can be seen, in preprocessed images, the details are more clear and this can positively affect learning procedure.

2.2 Network structure

In our proposed cGAN setting, the input to our generator is a 1281283 image of the macula area on a fundus image and the generator will generate an image of the same size and depth such that each pixel’s color indicates a height as depicted in Figure 1

. The discriminator takes this image and gives a probability between 0 to 1 which indicates the similarity of this image to a real heightmap image.

Figure 3: The architecture of generator and discriminator in our proposed method. As can be seen, the generator is consists of a series of connected U-Net which we use the output of them for deep supervision (red arrows). Moreover, The discriminator network consists of Convolution-BatchNorm-LeakyReLU layers with a final fully connected layer. We utilized four convolutional layers to compute perceptual loss. The output of the network is in the form of probability and is used to distinguish between real and fake images.

Our proposed generator architecture is consists of three stacked U-Nets. Motivated by deeply supervised nets [69] which states that discriminative features in deep layers will contribute to better performance, we used the output of the first two U-Net layers for deep supervision. In fact, similar to U-Net++ architecture [21] which uses the output of the upsampling layer for deep supervision, we averaged the output of all U-Net layers and used the result as the final output of the generator network. By doing so, our network tries to learn a meaningful representation for these deep layers which will directly contribute to the final outcome. Moreover, since the amount of detail is of significance in this work, we decided to use the average operator instead of the max operator for the final output [21]

. Another advantage of this architecture is that by using a stack of U-Nets, each layer can add its own level of detail to the final outcome. Furthermore, although our network is deeper in comparison to a normal U-Net architecture, owing to skip-connections and deep supervision involved in the architecture, vanishing gradient problem will not happen and loss flows easily to upper layers through backpropagation. The generator architecture is depicted in the first row of Figure

LABEL:fig3.

Regarding the discriminator, the judgment can be made at the image level as well as the patch level. That is to say, we can judge the quality of the entire image by our discriminator (ImageGAN) or consider its patches when we want to judge (PatchGAN). Since a powerful discriminator is the key to successful training with GANs and extremely influences its output [22, 27], we explored both of these methods and opted for image-level discriminator due to better quality images. Furthermore, As can be seen in the second row of Figure LABEL:fig3, we used and layer of the discriminator network to compute perceptual loss [40] between generated image and ground-truth image as a supervisory signal with the aim of better output.

2.3 Objective functions

Our final loss function is composed of three parts which will be discussed in this section.

2.3.1 Least-squares adversarial loss

cGANs are generative models that learn mapping from observed image

and random noise vector

to , using generator . Then the discrimnator aims to classify the concatenation of the source image and its corresponding ground-truth image as real , while classifying and the transformed image as fake, .

Despite performing great in many tasks, GANs suffer from different problems such as mode collapse or unstable training procedure [22]. Therefore, in this work to avoid such problems we adopted Least Square Generative Adversarial Networks (LSGANs) [65]. The idea of LSGAN is that even samples on the right side of the decision boundary can provide signals for training. Hence, for achieving this aim, we adopted the least-squares loss function instead of the traditional cross-entropy loss used in normal GAN to penalize data on the right side of the decision boundary but very far from it. Using this simple yet effective idea we can provide gradients even for samples that are correctly classified by the discriminator. The loss function of LSGAN for both discriminator and generator can be written as follows:

(1)

This loss functions directly operates on the logits of the output, where

and .

2.3.2 Pixel reconstruction loss

Image-to-image translation tasks that rely solely on the adversarial loss function do not produce consistent results [36]. Therefore, we also used pixel reconstruction loss here but we opted for L2-loss rather than widely used L1-loss since it performed better in reconstructing details in this specific task. The equation for L2-loss is as below:

(2)

2.3.3 Perceptual loss

Despite producing plausible results using only two aforementioned loss functions, since the generated image is blurry [36] and especially in medical diagnosis small details are of significance, we used perceptual loss [40] to improve the final result. As a matter of fact, using only L2-Loss or L1-Loss results in outputs that maintain the global structure but it shows blurriness and distortions [40]. Furthermore, per-pixel losses fail to capture perceptual differences between input and ground-truth images. For instance, when we consider two identical images only shifted by some small offset from each other, per-pixel loss value may vary considerably between these two, even though they are quite similar [39]

. However, by using high-level features extracted from layers of a discriminator, we can capture those discrepancies and can measure image similarity more robustly

[39]. In our work, since discriminator network also has this capability of perceiving the content of images and difference between them and pre-trained networks on other tasks may perform poorly on other unrelated tasks, we used hidden layers of discriminator network [40, 36] to extract features as illustrated in the second row of the Figure 3. The mean absolute error for hidden layer between the generated image and the ground-truth image is then calculated as :

(3)

which , and denote width, height and depth of the hidden layer respectively and means the output of layer of the discriminator network. Finally, perceptual loss can be formulated as :

(4)

Where in equation 4 tunes the contribution of utilized hidden layer on the final loss.

Finally, our complete loss function for the generator is as below:

(5)

where , and

are the hyperparameters that balance the contribution of each of the different losses. Note that we also used perceptual loss in training discriminator besides traditional cGAN loss with equal contribution.

3 Experiments

3.1 Dataset

The dataset was gathered from TopCon DRI OCT Triton captured at Negah Eye Hospital. We cropped the macula part of the fundus and heightmap image from the 3D macula report generated by the device to create image pairs. Our dataset contains 3407 color fundus-heightmap pair images. Since the images in our dataset were insufficient, we used data augmentation for better generalization. Nevertheless, because we are dealing with medical images, we could not rotate images since by rotating images for example by 90, we have vessels in a vertical position which is impossible in fundus imaging. Moreover, changing brightness is also illegal since it will change the standard brightness of a fundus image. Hence, the only augmentation that we could apply was to flip images in 3 different ways to generate 4 different samples from every image. Consequently, we had 13,628 images which we used 80 for training, 10 for validation and 10 for testing. Some examples from the utilized dataset are illustrated in Figure 4.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 4: Macula area on fundus image and their corresponding heightmap in our dataset.

3.2 Exprimental setup

We used Tensorflow 2.0

[66] for implementing our network. We also used Adam optimizer [73] with an initial learning rate of with a step decay of per

steps. Moreover, we used the batch size of 8 and trained for 250 epochs to converge. Additionally, we set

, , and in Equation 4 by trial-and-error and considering the contribution of each of them as discussed in [40, 39]. Since we are working in a very high dimensional parameter space convergence and finding the optimal weights is difficult and starting from a random point will not work very well [58]. As a result, in order to ease the training procedure and convergence, we employed a step-by-step training schema. That is to say, we first trained the first U-Net completely, then we added the next U-Net and trained a stack of 2 U-Net with deep supervision and finally, we trained the entire network. Note that we loaded the weights of the previous network when we wanted to train the new one.

3.3 Evaluation metrics

In this work, we utilized a variety of different metrics to evaluate our final outcomes quantitatively. We measured the quality of the final image using the Structural Similarity Index (SSIM) [67]

, Mean Squared Error (MSE) and Peak Signal to Noise Ratio (PSNR). Nevertheless, these measures are insufficient for assessing structured outputs such as images, as they assume pixel-wise independence

[68]. Consequently, we used Learned Perceptual Image Patch Similarity (LPIPS) [68] which can outperform other measures in terms of comparing the perceptual quality of images. We used features extracted from the and layer of the discriminator network to obtain the features and calculated the difference between as the generated heightmap and as the ground-truth heightmap for given input using the equation below which all parameters are similar to Equation 3:

(6)

3.4 Analysis of generator architecture

In this part, we explore different numbers of stacked U-Nets in generator architecture to find the optimum one. We set , and in Equation 5 and used ImageGAN for our discriminator. The quantitative comparison is made in Figure 5. As can be seen, stacking three U-Nets resulted in higher values for SSIM and PSNR and lower values for MSE and LPIPS. Furthermore, qualitative comparison which is depicted in Figure 6 also supports our claim that three stacks of U-Nets is the best choice. As a matter of fact, the generator with three U-Nets in this figure did well at predicting the full shape of the red region as well as the correct position and full shape of intense red spots. Additionally, it seems that by adding more U-Nets to the structure, the results become more blurry and details begin to vanish. Therefore, three U-Nets is the optimum number that preserves fine details and can produce plausible outcomes.

(a) SSIM and PSNR
(b) MSE and LPIPS
Figure 5: Quantitavte comparison between different number of stacked U-Nets in generator architecture.
(a)
(b)
(c)
(d)
(e)
(f)
(a) 1 U-Net
(b) 2 U-Net
(c) 3 U-Net
(d) 4 U-Net
(e) 5 U-Net
(f) Ground-truth
Figure 6: Qualitative comparison in terms of different numbers of stacked U-Nets in generator structure.

3.5 Effectiveness of deep supervision

In this section, we explore the effectiveness of using deep supervision. For this experiment, we trained our network with and without the use of deep supervision. As can be seen in Figure 7, the output of the deep layers when we used deep supervision is meaningful. In fact, by averaging the output of each of the layers, we force the network to output plausible images from these layers and this contributes to the higher quality of the network with deep supervision.

As depicted in Figure 7, the output of the first U-Net layer is responsible for the overall brightness of the output image besides vaguely representing blue and red parts. The output of the second U-Net is mostly used for abnormal regions that are overlayed on the green parts. In fact, it is responsible for detecting higher or lower elevated parts in the macula region. Finally, the third layer is responsible for adding fine details to the output. If the image given does not contain any abnormalities, the output from the second and third deep layer is mostly black (e.g. Figure 7 third example). However, considering the model without supervision, clearly, there is not any meaningful interpretation for the images outputted from the first and second layer and this contributed to the lower overall quality of this model. Quantitative comparisons in Table 1

also proves our point that deep supervision can contribute to the finer output. As can be seen, our model with deep supervision achieved a higher score in all evaluation metrics.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(a) First layer
(b) Second layer
(c) Third layer
(d) Output
(e) Ground-truth
Figure 7: The effect of deep supervision on the final output. As can be seen, the outputs from the first and second layer of the model with deep supervision(w) generated meaningful results and represent the parts on which the layer focused. On the other hand, without the use of deep supervision(w/o), generated images from deep layers do not contain useful information and caused lower output quality.
SSIM LPIPS MSE PSNR(dB)
w supervision 0.8823 1.81e-05 0.0033 34.6733
w\o supervision 0.7570 2.54e-04 0.0059 27.2784
Table 1: Quantitative comparison between model trained with supervision and model trained without supervision.

3.6 L1-Loss vs L2-Loss

Even though in most of the papers L1-Loss is more common than L2-Loss as pixel-loss reconstruction loss [75, 24, 36], in this work we chose L2-Loss owing to emphasis that L2-Loss put on huge differences between generated image and ground-truth. As a matter of fact, since the difference in L2-Loss has a power of two, small differences become minuscule and negligible and the focus will be on huge differences. This behavior is perfectly suitable for this problem since our important goal is to predict regions that have a red color or blue color and it is acceptable to have inaccurate or blurry green areas or missed vessels. This is because those red or blue regions contain significant information for diagnosis since they are related to regions with elevation changes which are important in the diagnosis of many retinal diseases such as PED which cause different parts of the macula to swell. Our claim is supported by our experiment in which we compared the results from L1-Loss and L2-Loss. Note that in this experiment the contribution of L2-Loss and L1-Loss function was equal along with LSGAN and perceptual loss. As can be seen in Table 2, L2-Loss performed better in all metrics and the difference is considerable.

SSIM LPIPS MSE PSNR(dB)
L1-Loss 0.8721 3.53e-04 0.0072 33.8351
L2-Loss 0.8823 1.81e-05 0.0033 34.6733
Table 2: Quantitative comparison between L2-Loss and L1-Loss.

Furthermore, regarding qualitative comparison in Figure 8, even though the global structure of images considering green areas is roughly the same, L2-Loss performed better at predicting blue regions which is crucial for diagnosis.

(a)
(b)
(c)
(a) Ground-truth
(b) L1
(c) L2
Figure 8: Qualitative comparison between the output of L1-Loss and L2-Loss.

3.7 Comparison with other techniques

Since this is the first method for the reconstruction of the heightmap of color fundus images using DNNs, there were no other method to directly compare our proposed method with. Therefore, we compared the results with popular methods that utilized cGANs. The methods that we compared our results to here are pix2pix [24], PAN [40] and MedGAN [36]. The results are given in Figure 9 and Table 3 for qualitative and quantitative comparison respectively. Pix2pix achieved the worst results since it does not use deep supervision and it is based on L1-loss. PAN did slightly better in SSIM and MSE metrics. However, there is a huge difference between PAN and pix2pix in terms of LPIPS since PAN uses perceptual loss in the training procedure. This difference proves the importance and the impact of using perceptual loss for training cGANs. MedGAN was designed especially for medical image translation such as translating between CT and PET images. As a result, It performs better in comparison to previous general methods, but the results are inferior to our proposed method. In fact, since we are using deep supervision in this method and carefully tuned the parameters for this particular problem, we achieve a higher value in all metrics. Another justification for higher values of the proposed method is that all the previous methods used L1-loss as pixel-reconstruction loss for the training, while we used L2-loss which as stated in the previous section, is the superior choice for this particular problem.

Method SSIM PSNR(dB) LPIPS MSE
pix2pix [24] 0.8596 33.9523 2.25e-03 0.0068
PAN [40] 0.8612 33.8512 2.37e-04 0.0053
MedGAN [36] 0.8659 33.2958 5.61e-05 0.0048
Proposed Method 0.8823 34.6733 1.81e-05 0.0033
Table 3: Quantitative comparison between proposed method and other methods.

Considering the qualitative comparison in Figure 9, our proposed method outperformed others in terms of reconstruction of the details. As can be seen, pix2pix missed some of the important details in the first and second examples such as bright red spots. PAN performed better at reconstructing the highly elevated parts in the second row, however it failed to reconstruct the correct shape for the third example. Finally, since MedGAN is specially designed for medical tasks, it outperformed the aforementioned methods in terms of output quality, but it was outperformed by our method and the proposed method generated the best quality images.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(a) Ground-truth
(b) pix2pix
(c) PAN
(d) MedGAN
(e) Proposed method
Figure 9: Qualitative comparison between the proposed method and other methods.

3.8 Perceptual studies

As stated before, the main purpose of reconstructing the heightmap of fundus images is to be able to infer part of the information that OCT devices provide to ophthalmologists, namely the height information. Hence, to judge the fidelity of the reconstructed heightmap, in this section we conducted an experiment in which two experienced ophthalmologists were presented a series of trails each containing reconstructed heightmap, fundus image and the ground-truth heightmap from the test set. The main purpose of this study is to investigate if the reconstructed heightmap and fundus image pair gives more information for the diagnosis of any retinal disease in comparison to the situation in which we have the fundus image only.

For this experiment, two ophthalmologists first classified all images into two classes positive and negative which positive means that the image provides more information for diagnosis and negative means that the image does not add more information for diagnosis. Additionally, ophthalmologists rated each image from zero to three according to the level of information that each of the given images provide such that zero means no added information and three represents the highest amount of information for diagnosis.

As can be seen in Table 4, ophthalmologist 1 classified all images as useful for diagnosis and the mean score for all of the images is 1.94. Additionally, ophthalmologist 2 classified 92 of the outputs as positive and the mean square is 1.84. This study shows that even though the output of our method may seem more blurry in comparison to the original one, these outputs can be used for diagnosis and can provide valuable additional information to ophthalmologists especially about height information in different regions. For instance, diseases such as Age-related macular degeneration are dependent on the swelling of different regions of the macula and the reconstructed heightmap contains this information.

We also considered the positive samples in isolation and the results are shown in Table 5. As can be seen, both ophthalmologists classified most of the images in class 2 and the average score is near 2 for both ophthalmologists. This experiment also indicates that in most of the cases the reconstructed heightmap can provide useful information for diagnosis from a single fundus image.

Score Classification
Mean SD Positive %
Ophthalmologist 1 1.94 0.7669 100.00
Ophthalmologist 2 1.84 0.9553 92.00
Table 4: Results of perceptual study.
Ophthalmologist 1 Ophthalmologist 2
Score Frequency % Mean SD Frequency % Mean SD
1 16 32.00 1.94 15 30.00 2.00 0.8164
2 21 42.00 0.7669 16 32.00
3 13 26.00 15 30.00
Table 5: Information of positive samples in perceptual studies.

Considering examples which were classified as positive in Figure 10, the reconstructed heightmap can indicate the lack of elevation change (top right example) as well as serious elevation changes in different regions (other examples) depending on the condition of the fundus image. In fact, both lack of elevation changes and having high or low elevated parts are types of information that cannot be inferred solely from a single fundus image. However, by utilizing our proposed method, ophthalmologists can be exposed to this additional information which can aid them in better and easier diagnosis with a single color fundus image. Finally, this figure also proves the point in the last paragraph that even though the reconstructed heightmaps in positive samples may seem more blurry than the ground-truth heightmaps, they were classified as positive owing to the information that they provide for diagnosis.

(a) Fundus
(b) Proposed method
(c) Ground-truth
(d) Fundus
(e) Proposed method
(f) Ground-truth
(a)
(a)
(b)
(c)
(b)
(c)
Figure 10: Some examples from images classified as positive by two ophthalmologists.

4 Conclusion and Discussion

In this paper, we proposed a novel framework to automatically generate a heightmap image of the macula on a color fundus image. For the generator network, we used a stack of three U-Nets and motivated by deeply supervised networks, we averaged on the output of these U-Net layers for deep supervision. We also utilized LSGAN instead of traditional GAN loss for stable training procedure and better results along with L2-Loss and perceptual loss to generate the final outcome.

The experimental results indicate that our proposed method outperformed other methods in the task of image and medical image translation in terms of SSIM, PSNR, MSE and LPIPS metrics as can be seen in Table 3. Furthermore, as depicted in Figure 7, deep supervision contributed greatly to the quality of the final outcome by producing meaningful outputs from deep layers. This suggests that when we are dealing with very deep neural networks, it is better to somehow constrain deep layers into generating features toward the final goal of the network. Finally, considering the applications of our proposed method in real diagnosis, perceptual studies in Table 4 and 5 indicate that we can infer more information from the reconstructed heightmap, especially in cases in which there are some elevation changes in different regions of the macula region. By utilizing this information about heights, we can provide additional information for the diagnosis of diseases that are dependent on the presence of data about elevations using only a color fundus image and without the need for OCT images.

As stated in Section 3.8, despite slight blurriness in the output of our proposed network, it can still be used for diagnosis. However, this work is not free from limitations with further improvements is essential for improving the practical applicability in real diagnosis cases. In fact, in some cases owing to the poor image quality of the fundus image the system cannot extract useful and meaningful information from the image especially in cases in which the fundus image is blurred. This suggests that in future works, a pre-processing step should be employed to de-blur fundus images properly before feeding them into the network and study its effectiveness. Furthermore, in future works, we will try to utilize other features of the fundus image besides automatic features extracted from CNNs to improve the overall performance and quality of the proposed method.

In future works, we will utilize images from different regions of fundus image such as Optic Nerve Head(ONH) to reconstruct its heightmap as this part of the fundus image has many practical applications and to develop our method into a general solution for heightmap reconstruction. Finally, considering the perceptual studies which show that our reconstructed heightmap can provide information for the diagnosis, in future researches, we can utilize the results generated from our network to detect retinal diseases automatically which were impossible before using only a single fundus image.

5 Acknowledgment

The authers are grateful to Dr.Ahmadieh (Ophthalmologist) for grading and classifying images for our experiment.

References