Vision is crucial for our everyday activities such as driving, watching television, reading and interacting socially and visual impairment can be a real handicap. Even the slightest visual loss can affect our quality of life considerably and cause depression and in old adults cause accidents, injuries and falls [1, 2]. Blindness is the final stage of many eye diseases and according to previous studies, the most common cause of visual impairments are cataract, macular degeneration, glaucoma and diabetic retinopathy [3, 4].
Color fundus photography is a 2D imaging modality for the diagnosis of retinal diseases. 3D structure of the eye provides a considerable amount of crucial information (such as information about elevation in different parts) for ophthalmologists to diagnose which is unavailable in 2D fundus images. Therefore, being able to infer this information from just a 2D image can be helpful. Furthermore, the reconstructed heightmap offers clinicians another means to view eye structure which may help them in better and more accurate diagnosis [59, 60]. Optical Coherence Tomography (OCT)  is an expensive but vital tool for evaluating the retinal structure which provides ophthalmologists with valuable information, enabling them to diagnose most of the macula diseases. Nevertheless, owing to the cost of using this system, OCT devices are not ubiquitous and using fundus images is mostly common for screening.
Shape from shading (SFS)  is the only method applied to this problem for the reconstruction of the height of a single fundus image 
. However, this method suffers from drawbacks that limits its usage in this particular task. In fact, the success of the SFS method is totally dependent on predicting the position of the light source and the assumption about the surface which cannot be applied to the eye retina. Furthermore, disparity map estimation, which is one of the common methods for 3D reconstruction was also applied to this problem. Nonetheless, it totally depends on the availability of the stereo images from both eyes which is not practical. Hence, devising a method to automatically generate an accurate heightmap image from a given fundus image is crucial.
In recent years, with the advent of Conditional Generative Adversarial Networks (cGANs) [22, 23], many researchers used this methodology for image generation and transformation tasks [24, 39, 40, 76, 77, 80]. Medical image analysis also benefited a lot from these powerful models and many researchers used these methods for translating between different medical images such as translating between CT and PET images [74, 36], denoise and correct the motion in medical images such as denoising CT images  or correcting the motion in MR images  and finally segmenting medical images [37, 78, 79]. Most of these methods benefited from the U-Net architecture  which first were proposed for image segmentation and its extension U-Net++ . Perceptual loss [39, 40]
Motivated by the promising results of cGANs on tasks relating to the analysis of medical images, in this work, we applied this method to generate a heightmap image from a given color fundus image targeted on the macula area. In fact, height information is one of the crucial information that OCT devices provide to ophthalmologists and is missing in color fundus images due to their 2D nature. Hence, by extracting such information from only a fundus image, we can ease the diagnosis and management of retinal diseases with macular thickness changes.
Considering Figure 1, since our problem can be seen as an image translation task in which we want to predict a color image containing heights data from a fundus image targeted on the macula area, cGANs can be utilized in this problem. That is to say, each pixel in the right image of Figure 1 has a color value which represents a height from 0 to 500 and by predicting red, green and blue color values for each pixel of the left image (fundus image), we can predict its heightmap. The color bar below Figure 1 demonstrates the assignment of different color values to different heights.
In this paper, we used a stack of three U-Nets for our generator network which we averaged on the output of them for deep supervision. Furthermore, in order to avoid problems of traditional GANs, we used Least Squares Adversarial Loss  instead along with perceptual loss 
and L2-loss as pixel reconstruction loss. For the discriminator network, we used an image-level discriminator that classifies the whole image as real or fake. To the best of our knowledge, this is the first research paper on predicting the heightmap of the macula area on fundus images using DNNs. We evaluated our approach qualitatively and quantitatively on our dataset and compared the results with state-of-the-art methods in image translation and medical image translation. Furthermore, we studied the application of our method on real diagnosis cases which showed that reconstructed heightmaps can provide additional information to ophthalmologists and can be used for the analysis of the macula region.
Our main contributions can be listed as follows:
Motivated by deeply supervised networks , we propose a novel deep architecture for the generator network based on U-Net  and CasNet  which consists of a number of connected U-Nets that we utilized each of their output for deep supervision.
We propose the first method for the reconstruction of the heightmap of the macula image based on DNNs.
Finally, the subjective performance of our reconstructed heightmap was investigated from a medical perspective by two experienced ophthalmologists.
A DNN has the capability to learn from unpreprocessed image data, but it can learn more easily and efficiently if we apply appropriate preprocessing on the input data. Hence, in this paper, we first apply Contrast Limited Adaptive Histogram Equalization (CLAHE)  which enhances the foreground and background contrast. Afterward, we apply normalization (division by 255) to the input images to prepare them for feeding into the network. The impact of preprocessing on the input fundus images is depicted in Figure 3.
2.2 Network structure
In our proposed cGAN setting, the input to our generator is a 1281283 image of the macula area on a fundus image and the generator will generate an image of the same size and depth such that each pixel’s color indicates a height as depicted in Figure 1
. The discriminator takes this image and gives a probability between 0 to 1 which indicates the similarity of this image to a real heightmap image.
Our proposed generator architecture is consists of three stacked U-Nets. Motivated by deeply supervised nets  which states that discriminative features in deep layers will contribute to better performance, we used the output of the first two U-Net layers for deep supervision. In fact, similar to U-Net++ architecture  which uses the output of the upsampling layer for deep supervision, we averaged the output of all U-Net layers and used the result as the final output of the generator network. By doing so, our network tries to learn a meaningful representation for these deep layers which will directly contribute to the final outcome. Moreover, since the amount of detail is of significance in this work, we decided to use the average operator instead of the max operator for the final output 
. Another advantage of this architecture is that by using a stack of U-Nets, each layer can add its own level of detail to the final outcome. Furthermore, although our network is deeper in comparison to a normal U-Net architecture, owing to skip-connections and deep supervision involved in the architecture, vanishing gradient problem will not happen and loss flows easily to upper layers through backpropagation. The generator architecture is depicted in the first row of FigureLABEL:fig3.
Regarding the discriminator, the judgment can be made at the image level as well as the patch level. That is to say, we can judge the quality of the entire image by our discriminator (ImageGAN) or consider its patches when we want to judge (PatchGAN). Since a powerful discriminator is the key to successful training with GANs and extremely influences its output [22, 27], we explored both of these methods and opted for image-level discriminator due to better quality images. Furthermore, As can be seen in the second row of Figure LABEL:fig3, we used and layer of the discriminator network to compute perceptual loss  between generated image and ground-truth image as a supervisory signal with the aim of better output.
2.3 Objective functions
Our final loss function is composed of three parts which will be discussed in this section.
2.3.1 Least-squares adversarial loss
cGANs are generative models that learn mapping from observed image
and random noise vectorto , using generator . Then the discrimnator aims to classify the concatenation of the source image and its corresponding ground-truth image as real , while classifying and the transformed image as fake, .
Despite performing great in many tasks, GANs suffer from different problems such as mode collapse or unstable training procedure . Therefore, in this work to avoid such problems we adopted Least Square Generative Adversarial Networks (LSGANs) . The idea of LSGAN is that even samples on the right side of the decision boundary can provide signals for training. Hence, for achieving this aim, we adopted the least-squares loss function instead of the traditional cross-entropy loss used in normal GAN to penalize data on the right side of the decision boundary but very far from it. Using this simple yet effective idea we can provide gradients even for samples that are correctly classified by the discriminator. The loss function of LSGAN for both discriminator and generator can be written as follows:
This loss functions directly operates on the logits of the output, whereand .
2.3.2 Pixel reconstruction loss
Image-to-image translation tasks that rely solely on the adversarial loss function do not produce consistent results . Therefore, we also used pixel reconstruction loss here but we opted for L2-loss rather than widely used L1-loss since it performed better in reconstructing details in this specific task. The equation for L2-loss is as below:
2.3.3 Perceptual loss
Despite producing plausible results using only two aforementioned loss functions, since the generated image is blurry  and especially in medical diagnosis small details are of significance, we used perceptual loss  to improve the final result. As a matter of fact, using only L2-Loss or L1-Loss results in outputs that maintain the global structure but it shows blurriness and distortions . Furthermore, per-pixel losses fail to capture perceptual differences between input and ground-truth images. For instance, when we consider two identical images only shifted by some small offset from each other, per-pixel loss value may vary considerably between these two, even though they are quite similar 
. However, by using high-level features extracted from layers of a discriminator, we can capture those discrepancies and can measure image similarity more robustly. In our work, since discriminator network also has this capability of perceiving the content of images and difference between them and pre-trained networks on other tasks may perform poorly on other unrelated tasks, we used hidden layers of discriminator network [40, 36] to extract features as illustrated in the second row of the Figure 3. The mean absolute error for hidden layer between the generated image and the ground-truth image is then calculated as :
which , and denote width, height and depth of the hidden layer respectively and means the output of layer of the discriminator network. Finally, perceptual loss can be formulated as :
Where in equation 4 tunes the contribution of utilized hidden layer on the final loss.
Finally, our complete loss function for the generator is as below:
where , and
are the hyperparameters that balance the contribution of each of the different losses. Note that we also used perceptual loss in training discriminator besides traditional cGAN loss with equal contribution.
The dataset was gathered from TopCon DRI OCT Triton captured at Negah Eye Hospital. We cropped the macula part of the fundus and heightmap image from the 3D macula report generated by the device to create image pairs. Our dataset contains 3407 color fundus-heightmap pair images. Since the images in our dataset were insufficient, we used data augmentation for better generalization. Nevertheless, because we are dealing with medical images, we could not rotate images since by rotating images for example by 90, we have vessels in a vertical position which is impossible in fundus imaging. Moreover, changing brightness is also illegal since it will change the standard brightness of a fundus image. Hence, the only augmentation that we could apply was to flip images in 3 different ways to generate 4 different samples from every image. Consequently, we had 13,628 images which we used 80 for training, 10 for validation and 10 for testing. Some examples from the utilized dataset are illustrated in Figure 4.
3.2 Exprimental setup
We used Tensorflow 2.0 for implementing our network. We also used Adam optimizer  with an initial learning rate of with a step decay of per
steps. Moreover, we used the batch size of 8 and trained for 250 epochs to converge. Additionally, we set, , and in Equation 4 by trial-and-error and considering the contribution of each of them as discussed in [40, 39]. Since we are working in a very high dimensional parameter space convergence and finding the optimal weights is difficult and starting from a random point will not work very well . As a result, in order to ease the training procedure and convergence, we employed a step-by-step training schema. That is to say, we first trained the first U-Net completely, then we added the next U-Net and trained a stack of 2 U-Net with deep supervision and finally, we trained the entire network. Note that we loaded the weights of the previous network when we wanted to train the new one.
3.3 Evaluation metrics
In this work, we utilized a variety of different metrics to evaluate our final outcomes quantitatively. We measured the quality of the final image using the Structural Similarity Index (SSIM) 
, Mean Squared Error (MSE) and Peak Signal to Noise Ratio (PSNR). Nevertheless, these measures are insufficient for assessing structured outputs such as images, as they assume pixel-wise independence. Consequently, we used Learned Perceptual Image Patch Similarity (LPIPS)  which can outperform other measures in terms of comparing the perceptual quality of images. We used features extracted from the and layer of the discriminator network to obtain the features and calculated the difference between as the generated heightmap and as the ground-truth heightmap for given input using the equation below which all parameters are similar to Equation 3:
3.4 Analysis of generator architecture
In this part, we explore different numbers of stacked U-Nets in generator architecture to find the optimum one. We set , and in Equation 5 and used ImageGAN for our discriminator. The quantitative comparison is made in Figure 5. As can be seen, stacking three U-Nets resulted in higher values for SSIM and PSNR and lower values for MSE and LPIPS. Furthermore, qualitative comparison which is depicted in Figure 6 also supports our claim that three stacks of U-Nets is the best choice. As a matter of fact, the generator with three U-Nets in this figure did well at predicting the full shape of the red region as well as the correct position and full shape of intense red spots. Additionally, it seems that by adding more U-Nets to the structure, the results become more blurry and details begin to vanish. Therefore, three U-Nets is the optimum number that preserves fine details and can produce plausible outcomes.
3.5 Effectiveness of deep supervision
In this section, we explore the effectiveness of using deep supervision. For this experiment, we trained our network with and without the use of deep supervision. As can be seen in Figure 7, the output of the deep layers when we used deep supervision is meaningful. In fact, by averaging the output of each of the layers, we force the network to output plausible images from these layers and this contributes to the higher quality of the network with deep supervision.
As depicted in Figure 7, the output of the first U-Net layer is responsible for the overall brightness of the output image besides vaguely representing blue and red parts. The output of the second U-Net is mostly used for abnormal regions that are overlayed on the green parts. In fact, it is responsible for detecting higher or lower elevated parts in the macula region. Finally, the third layer is responsible for adding fine details to the output. If the image given does not contain any abnormalities, the output from the second and third deep layer is mostly black (e.g. Figure 7 third example). However, considering the model without supervision, clearly, there is not any meaningful interpretation for the images outputted from the first and second layer and this contributed to the lower overall quality of this model. Quantitative comparisons in Table 1
also proves our point that deep supervision can contribute to the finer output. As can be seen, our model with deep supervision achieved a higher score in all evaluation metrics.
3.6 L1-Loss vs L2-Loss
Even though in most of the papers L1-Loss is more common than L2-Loss as pixel-loss reconstruction loss [75, 24, 36], in this work we chose L2-Loss owing to emphasis that L2-Loss put on huge differences between generated image and ground-truth. As a matter of fact, since the difference in L2-Loss has a power of two, small differences become minuscule and negligible and the focus will be on huge differences. This behavior is perfectly suitable for this problem since our important goal is to predict regions that have a red color or blue color and it is acceptable to have inaccurate or blurry green areas or missed vessels. This is because those red or blue regions contain significant information for diagnosis since they are related to regions with elevation changes which are important in the diagnosis of many retinal diseases such as PED which cause different parts of the macula to swell. Our claim is supported by our experiment in which we compared the results from L1-Loss and L2-Loss. Note that in this experiment the contribution of L2-Loss and L1-Loss function was equal along with LSGAN and perceptual loss. As can be seen in Table 2, L2-Loss performed better in all metrics and the difference is considerable.
Furthermore, regarding qualitative comparison in Figure 8, even though the global structure of images considering green areas is roughly the same, L2-Loss performed better at predicting blue regions which is crucial for diagnosis.
3.7 Comparison with other techniques
Since this is the first method for the reconstruction of the heightmap of color fundus images using DNNs, there were no other method to directly compare our proposed method with. Therefore, we compared the results with popular methods that utilized cGANs. The methods that we compared our results to here are pix2pix , PAN  and MedGAN . The results are given in Figure 9 and Table 3 for qualitative and quantitative comparison respectively. Pix2pix achieved the worst results since it does not use deep supervision and it is based on L1-loss. PAN did slightly better in SSIM and MSE metrics. However, there is a huge difference between PAN and pix2pix in terms of LPIPS since PAN uses perceptual loss in the training procedure. This difference proves the importance and the impact of using perceptual loss for training cGANs. MedGAN was designed especially for medical image translation such as translating between CT and PET images. As a result, It performs better in comparison to previous general methods, but the results are inferior to our proposed method. In fact, since we are using deep supervision in this method and carefully tuned the parameters for this particular problem, we achieve a higher value in all metrics. Another justification for higher values of the proposed method is that all the previous methods used L1-loss as pixel-reconstruction loss for the training, while we used L2-loss which as stated in the previous section, is the superior choice for this particular problem.
Considering the qualitative comparison in Figure 9, our proposed method outperformed others in terms of reconstruction of the details. As can be seen, pix2pix missed some of the important details in the first and second examples such as bright red spots. PAN performed better at reconstructing the highly elevated parts in the second row, however it failed to reconstruct the correct shape for the third example. Finally, since MedGAN is specially designed for medical tasks, it outperformed the aforementioned methods in terms of output quality, but it was outperformed by our method and the proposed method generated the best quality images.
3.8 Perceptual studies
As stated before, the main purpose of reconstructing the heightmap of fundus images is to be able to infer part of the information that OCT devices provide to ophthalmologists, namely the height information. Hence, to judge the fidelity of the reconstructed heightmap, in this section we conducted an experiment in which two experienced ophthalmologists were presented a series of trails each containing reconstructed heightmap, fundus image and the ground-truth heightmap from the test set. The main purpose of this study is to investigate if the reconstructed heightmap and fundus image pair gives more information for the diagnosis of any retinal disease in comparison to the situation in which we have the fundus image only.
For this experiment, two ophthalmologists first classified all images into two classes positive and negative which positive means that the image provides more information for diagnosis and negative means that the image does not add more information for diagnosis. Additionally, ophthalmologists rated each image from zero to three according to the level of information that each of the given images provide such that zero means no added information and three represents the highest amount of information for diagnosis.
As can be seen in Table 4, ophthalmologist 1 classified all images as useful for diagnosis and the mean score for all of the images is 1.94. Additionally, ophthalmologist 2 classified 92 of the outputs as positive and the mean square is 1.84. This study shows that even though the output of our method may seem more blurry in comparison to the original one, these outputs can be used for diagnosis and can provide valuable additional information to ophthalmologists especially about height information in different regions. For instance, diseases such as Age-related macular degeneration are dependent on the swelling of different regions of the macula and the reconstructed heightmap contains this information.
We also considered the positive samples in isolation and the results are shown in Table 5. As can be seen, both ophthalmologists classified most of the images in class 2 and the average score is near 2 for both ophthalmologists. This experiment also indicates that in most of the cases the reconstructed heightmap can provide useful information for diagnosis from a single fundus image.
|Ophthalmologist 1||Ophthalmologist 2|
Considering examples which were classified as positive in Figure 10, the reconstructed heightmap can indicate the lack of elevation change (top right example) as well as serious elevation changes in different regions (other examples) depending on the condition of the fundus image. In fact, both lack of elevation changes and having high or low elevated parts are types of information that cannot be inferred solely from a single fundus image. However, by utilizing our proposed method, ophthalmologists can be exposed to this additional information which can aid them in better and easier diagnosis with a single color fundus image. Finally, this figure also proves the point in the last paragraph that even though the reconstructed heightmaps in positive samples may seem more blurry than the ground-truth heightmaps, they were classified as positive owing to the information that they provide for diagnosis.
4 Conclusion and Discussion
In this paper, we proposed a novel framework to automatically generate a heightmap image of the macula on a color fundus image. For the generator network, we used a stack of three U-Nets and motivated by deeply supervised networks, we averaged on the output of these U-Net layers for deep supervision. We also utilized LSGAN instead of traditional GAN loss for stable training procedure and better results along with L2-Loss and perceptual loss to generate the final outcome.
The experimental results indicate that our proposed method outperformed other methods in the task of image and medical image translation in terms of SSIM, PSNR, MSE and LPIPS metrics as can be seen in Table 3. Furthermore, as depicted in Figure 7, deep supervision contributed greatly to the quality of the final outcome by producing meaningful outputs from deep layers. This suggests that when we are dealing with very deep neural networks, it is better to somehow constrain deep layers into generating features toward the final goal of the network. Finally, considering the applications of our proposed method in real diagnosis, perceptual studies in Table 4 and 5 indicate that we can infer more information from the reconstructed heightmap, especially in cases in which there are some elevation changes in different regions of the macula region. By utilizing this information about heights, we can provide additional information for the diagnosis of diseases that are dependent on the presence of data about elevations using only a color fundus image and without the need for OCT images.
As stated in Section 3.8, despite slight blurriness in the output of our proposed network, it can still be used for diagnosis. However, this work is not free from limitations with further improvements is essential for improving the practical applicability in real diagnosis cases. In fact, in some cases owing to the poor image quality of the fundus image the system cannot extract useful and meaningful information from the image especially in cases in which the fundus image is blurred. This suggests that in future works, a pre-processing step should be employed to de-blur fundus images properly before feeding them into the network and study its effectiveness. Furthermore, in future works, we will try to utilize other features of the fundus image besides automatic features extracted from CNNs to improve the overall performance and quality of the proposed method.
In future works, we will utilize images from different regions of fundus image such as Optic Nerve Head(ONH) to reconstruct its heightmap as this part of the fundus image has many practical applications and to develop our method into a general solution for heightmap reconstruction. Finally, considering the perceptual studies which show that our reconstructed heightmap can provide information for the diagnosis, in future researches, we can utilize the results generated from our network to detect retinal diseases automatically which were impossible before using only a single fundus image.
The authers are grateful to Dr.Ahmadieh (Ophthalmologist) for grading and classifying images for our experiment.