Medical imaging is one of the great pillars of modern diagnostics. Clinicians rely on it to obtain information from inside the patient’s body in a non-invasive way. However, noise in the images erodes their quality and makes their interpretation difficult. Moreover, it can cause algorithms, designed to automatically extract measurements from those images, to be inaccurate or fail outright. In this paper, we focus on the domain of retinal optical coherence tomography (OCT), a standard diagnostic tool in ophthalmology. Retinal OCT produces a series of 2D slices (b-scans) that display the depth profile of the retina, thus enabling clinicians to detect many sight-threatening diseases early in their progression. The dominating type of noise in OCT is called speckle. The speckle noise pattern depends on the imaged tissue and is highly sensitive to its position and orientation. Since signal and speckle noise originate from the same physical process, distinguishing signal from noise is particularly challenging. Interested readers are referred to  for more details.
, neither incorporate knowledge about the OCT process nor about structures of the human eye. We argue that such knowledge should help in this task, given the complex and sample-dependent nature of speckle noise. On the other hand, methods emerging from the field of deep learning[14, 4] have demonstrated precisely this ability, i.e. to learn the semantic characteristics of their input domains. We, therefore, aim to leverage deep learning to create a method that can denoise retinal scans by utilizing knowledge it has gained about this domain.
While writing this paper, we discovered recent work from Halupka et al.  and Huang et al. , in which they investigated a different GAN-based approach to retinal OCT denoising. Their approaches require paired training images, which can lead to problems with inaccurately registered images. Additionally, in their works, the denoised domain is constructed by registering and averaging samples from the noisy domain. Constructing denoised samples in this manner is not always feasible or possible and registration of images from different domains will likely not work well.
Our approach casts denoising as a domain translation problem. We demonstrate that, with some modifications, the cycleGAN model, introduced by Zhu et al. , can learn a mapping between a low and high noise domain from unpaired training data.
Initially, we started by directly applying the cycleGAN model to the problem of learning a mapping between images of a high noise (HN) domain and a low noise (LN) one . However, we soon discovered that this model does not perform well on our problem as is. Therefore, we made some modifications to the existing cycleGAN framework and developed our final model, the Hybrid Discriminator cycleGAN (HDcycleGAN). Fig. 1 shows a pass through our model, starting from an HN image. In the following, we briefly summarize the required knowledge about the cycleGAN and highlight the changes we made and why we made them.
The cycleGAN combines two Generative Adversarial Networks (GANs) 
into one two-way Autoencoder. Here, the generator of each GAN learns the mapping from one image domain to the other. In combination, they act like encoder and decoder of an Autoencoder. This framework allows two directions of traversal; going from domain one to domain two and back to domain one or vice versa. The paper also introduced the cycle consistency loss, which corresponds to the reconstruction loss in the standard Autoencoder setting. The goal of this loss function is to achieve consistency when transforming an image from one domain to the other and back. The generators in the cycleGAN down-sample the input image using strided convolutions, pass it through a series of residual blocks
and finally use fractional-strided convolutions for up-sampling. The discriminators down-sample their inputs through strided convolutions to produce a scalar output.
Using a cycleGAN-based approach allows us to train the network on unpaired images. In this way, registration of images becomes obsolete and we can avoid uncertainties that arise due to interpolation in affine transformation or in cases of mismatch between the images. An additional benefit of this framework is the cycle-consistency loss; although we are primarily interested in the mapping from HN images to LN, this added loss function provides a training signal to the network that is more stable than that of the discriminator alone.
We made three adjustments to the original cycleGAN model:
In our first experiments, the vanilla cycleGAN generated blurry images. The sharpness of the image and clarity of visual features with small spatial extent play a crucial role when it comes to image quality. To address this problem, we added skip connections to the generators, which concatenate the output of each down-sampling layer to the input of the corresponding up-sampling layer.
Even after the first two modifications and testing different hyper-parameters, the model failed to consistently improve image quality when mapping from HN to LN. We then noticed that both discriminators learn the characteristics of real OCT b-scans independently. The two image domains are almost identical in terms of image content. Consequently, the discriminators could not pick up on the subtle differences between the domains (see Fig. 0.A.1 in the appendix). As a remedy, we utilized a single discriminator that is shared between both generators. The discriminator can thus focus on the differences between the two domains instead of the full range of characteristics of each. This change resulted in the most significant improvement in visual quality of the generated images.
This shared discriminator acts as a three-way classifier, outputting the class probabilities for real HN, real LN and fake. As the discriminator now has to discriminate between more samples, we increased its complexity by adding a residual block with two convolutions in between each down-sampling layer.
The loss function of our model can thus be written as follows: Let and denote the generators that learn a mapping from LN to HN and from HN to LN respectively and the discriminator. Let
Here and are hyper-parameters for weighting discriminator and cycle-consistency loss respectively. For our model, we set and following . Our implementation of the described methodology is publicly available at github.com/IljaManakov/HDcycleGAN. We also provide implementation details in Tables 0.A.1 and 0.A.2 in the appendix.
3 Experiments and Results
After training the HDcycleGAN for 245 epochs with an Adam optimizer and a learning rate of, we performed both quantitative and qualitative analyses on the test set, which we explain in sections 3.2 and 3.3. In the quantitative analysis, we compared our approach to popular denoising methods using several measurements of similarity between real LN images and denoised ones. For the qualitative analysis, the similarity between real LN images and images produced by BM3D , wavelet denoising  and our method was assessed by three ophthalmologists independently. Finally, in section 3.4, we inspect the learned feature maps of the LN generator. We start by describing our dataset.
We acquired the data for this task in-house, using a SPECTRALIS OCT+HRA from Heidelberg Engineering, as part of the general diagnostic workflow for macular diseases. We did not select patients based on any further traits. As such, the scans in the dataset show various kinds of diseases in all stages and are representative of the typical imaging data generated at our hospital. To gather the images belonging to the high noise domain, we followed the hospital protocol, using 30°ART Volume acquisition with 12 frames averaged for each b-scan. The low noise domain consists of acquisitions that follow the same protocol except that we set the number of averaged frames to 60. We obtained both HN and LN images from the same patients on the same visit. As the proprietary software of the device manufacturer handles the frame averaging, we did not have access to the individual frames. In total, we gathered 23030 b-scans in 470 volumes from 235 patients for each noise domain. We used 90% of the volumes for training and the remaining 10% for testing. Before passing the images through our model, we scaled the 496x512 images to a pixel intensity range between 0.0 and 1.0.
3.2 Quantitative Evaluation
To asses our model’s performance, we evaluated the similarity between the generated images and the ground truth LN images in the test set. Since we acquired HN and LN scans pairwise, we registered the images employing a registration algorithm based on discrete Fourier-transform. After registration, we calculated the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) between the two images. Additionally, we used the Marching Cubes algorithm  to find the contour of the retina. Inverting the selection yields a background mask, while reapplying Marching Cubes on the retinal layers with a different level finds contours in highly reflective parts of the retina. We designated these regions as signal. This process is illustrated in Fig. 0.A.3
in the appendix. Using the signal and background regions, we then calculated the mean-to-standard-deviation ratio (MSR) and contrast-to-noise ratio (CNR). To better gauge the performance of our approach, we included median filtering, wavelet denoising, bilateral filtering , non-local means  and BM3D  in the comparison. The results are displayed in Table 1.
|raw||3.66 2.21||3.96 1.73||21.99 1.33||0.662 0.055|
|median||3.82 2.36||4.25 1.92||22.32 1.45||0.682 0.051|
|wavelet ||3.81 2.37||4.23 1.86||22.34 1.41||0.690 0.053|
|bilateral ||3.78 2.33||4.28 1.93||22.29 1.40||0.690 0.053|
|nl-means ||3.78 2.33||4.43 2.12||22.32 1.40||0.702 0.051|
|BM3D ||3.87 2.44||4.39 1.97||22.50 1.45||0.708 0.052|
|ours||4.00 2.51||4.73 2.23||22.58 1.41||0.706 0.050|
We can see that our model outperforms the other methods in all measurements except SSIM, where BM3D is slightly ahead. Overall we find that the performance of BM3D and our model is very close in inter-image measurements (SSIM and PSNR). In intra-image measurements (CNR and MSR) the margin between our approach and the others widens. It is also worth noting that our algorithm requires 30% less time to run on CPU than BM3D and beats all other algorithms by almost an order of magnitude on a low-end GPU (see Fig. 0.A.2 in the appendix). Although PSNR, SSIM, MSR and CNR are standard metrics of image quality, there is a caveat to these results; since HN and LN samples stem from independent acquisitions the noise in them is uncorrelated. This might explain why the overall improvement in these metrics is relatively low for all algorithms.
3.3 Qualitative Evaluation
Because of this caveat, we asked three expert ophthalmologists to visually assess the quality of our results. We provided them with 150 real LN images from the test set and images generated from the corresponding real HN images using BM3D, wavelet denoising and our method. For each such sample, the clinicians rated the methods by their similarity to the real LN images. We ordered the images in each sample randomly and did not provide any indication as to which model generated which image. The results of this evaluation, displayed in Fig. 2, confirm the findings of the quantitative evaluation. The experts unanimously agree that our approach outperforms the benchmarks.
3.4 Feature Map Inspection
We attempted to understand how the model is approaching the task of image enhancement by looking at the feature maps that it has learned. We did this by passing a sample through the LN generator and extracting the neuron activations at every layer. Due to the convolutional nature of the generator, these layer outputs are shaped like images with many channels. Hence we can view each channel in the activations of a layer as a gray-scale image which we refer to as a feature map.
By up-scaling the feature maps to the size of the input, we then checked for spatial correlations. For visualization purposes, we show some feature maps from different layers, which highlight distinct regions of the retina, in Fig. 3. Many more can be found in the appendix.
We observed that the feature maps at the output of the deeper residual blocks become increasingly abstract and spatially uncorrelated with the input. The feature maps at the outputs of the first four layers (initial convolution and down-sampling 1 to 3) and shallower residual blocks exhibit a strong spatial correlation with the input. Moreover, the different channels seem to correspond to anatomically distinct regions in the b-scan, although segmentation was never part of the training objective.
We think that this finding is relevant when viewed from two perspectives. Firstly, it shows that the model has gained some domain specific knowledge about the structure of macular OCT scans, which general methods such as BM3D and wavelet denoising are lacking. Secondly, this property can prove useful from the viewpoint of transfer learning, i.e. when applying this model to other tasks. The feature maps themselves can also be used for other tasks.
An example of the second point can be found in Fig. 0.A.11 in the appendix. We discovered a feature map that appears to track the positions of the Inner Limiting Membrane (ILM) and the Retinal Pigment Epithelium (RPE) (the inner- and outermost layers of the retina) (see Fig. 0.A.10
). We then multiplied the feature map with its corresponding b-scan, applied the image mean as a threshold and skeletonized the remainder. The resulting lines can be used to estimate retinal thickness. This method seems to work well even in the presence of pathologies, such as myopia (row 4, col. 4) or vitreous detachment (row 1, col. 4 and row 5, col. 3), which are typical causes for segmentation errors in commercial segmentation algorithms.
In this paper, we applied the HDcycleGAN model to the problem of image enhancement. In medical imaging, reduced image noise typically comes at the cost of increased acquisition time, radiation dose or other detrimental effects. Our model can learn a mapping between domains that correspond to different settings of those costly acquisition parameters.
Additionally, our approach learns the structural characteristics of the medical imaging domain, which further improves its usefulness as it can be leveraged for other tasks in that domain. As part of future work, we wish to study the transferability of our approach to other imaging modalities, such as Ultrasound.
As is the case with all GAN-based methods, the training of this model is not straightforward and the performance does not appear to increase monotonically throughout training. Nevertheless, our approach allows us to pre-train the parts individually; the generators as Autoencoders and the discriminator as a classifier between domains. In the future, we also plan to test if pre-training can improve training stability and model performance.
-  Chang, S.G., , Vetterli, M.: Adaptive wavelet thresholding for image denoising and compression. IEEE Transactions on Image Processing 9(9), 1532–1546 (2000)
-  Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising with block-matching and 3d filtering. vol. 6064 (2006)
-  Darbon, J., Cunha, A., Chan, T.F., Osher, S., Jensen, G.J.: Fast nonlocal filtering applied to electron cryomicroscopy. In: 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. pp. 1331–1334 (2008)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672–2680 (2014)
-  Halupka, K.J., Antony, B.J., Lee, M.H., Lucy, K.A., Rai, R.S., Ishikawa, H., Wollstein, G., Schuman, J.S., Garnavi, R.: Retinal optical coherence tomography image enhancement via deep learning. Biomed. Opt. Express 9(12), 6205–6221 (2018)
Huang, Y., Lu, Z., Shao, Z., Ran, M., Zhou, J., Fang, L., Zhang, Y.: Simultaneous denoising and super-resolution of optical coherence tomography images based on generative adversarial network. Opt. Express27(9), 12289–12307 (2019)
-  Joseph M. Schmitt, S. H. Xiang, K.M.Y.: Speckle in optical coherence tomography. Journal of Biomedical Optics 4(1), 95 – 105 – 11 (1999)
-  Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. In: Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques. pp. 163–169 (1987)
-  Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts. Distill (2016), http://distill.pub/2016/deconv-checkerboard
-  Podoleanu, A.G.: Optical coherence tomography. Journal of Microscopy 247(3), 209–219 (2012)
-  Reddy, B.S., Chatterji, B.N.: An fft-based technique for translation, rotation, and scale-invariant image registration. IEEE Transactions on Image Processing 5(8), 1266–1271 (1996)
-  Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Sixth International Conference on Computer Vision. pp. 839–846 (1998)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
Appendix 0.A Appendix
|layer name||type||properties||output size|
|initial convolution||convolution||kernel=7x7x16, stride=1||512x512x16|
|down-sampling 1||strided convolution||kernel=3x3x32, stride=2||256x256x32|
|down-sampling 2||strided convolution||kernel=3x3x64, stride=2||128x128x64|
|down-sampling 3||strided convolution||kernel=3x3x128, stride=2||64x64x128|
|residual block 1||residual block||convolutions=3, kernel=3x3x128, stride=1||64x64x128|
|residual block 2||residual block||convolutions=3, kernel=3x3x128, stride=1||64x64x128|
|residual block 3||residual block||convolutions=3, kernel=3x3x128, stride=1||64x64x128|
|residual block 4||residual block||convolutions=3, kernel=3x3x128, stride=1||64x64x128|
|residual block 5||residual block||convolutions=3, kernel=3x3x128, stride=1||64x64x128|
|residual block 6||residual block||convolutions=3, kernel=3x3x128, stride=1||64x64x128|
|up-sampling 1||bilinear up-scaling + convolution||kernel=3x3x64, scale=2||128x128x64|
|up-sampling 2||bilinear up-scaling + convolution||kernel=3x3x32, scale=2||256x256x32|
|up-sampling 3||bilinear up-scaling + convolution||kernel=3x3x16, scale=2||512x512x16|
|final convolution||convolution||kernel=3x3x1, stride=1||512x512x1|
Architecture of the HDcycleGAN generator. All layers use Instance Norm and ReLU activation. Skip connections link down-sampling and up-sampling layers using concatenation.
|layer name||type||properties||output size|
|down-sampling 1||strided convolution||kernel=3x3x16, stride=2||256x256x16|
|residual block 1||residual block||convolutions=2, kernel=3x3x16, stride=1||256x256x16|
|down-sampling 2||strided convolution||kernel=3x3x32, stride=2||128x128x32|
|residual block 2||residual block||convolutions=2, kernel=3x3x32, stride=1||128x128x32|
|down-sampling 3||strided convolution||kernel=3x3x64, stride=2||64x64x64|
|residual block 3||residual block||convolutions=2, kernel=3x3x64, stride=1||64x64x64|
|down-sampling 4||strided convolution||kernel=3x3x128, stride=2||32x32x128|
|residual block 4||residual block||convolutions=2, kernel=3x3x128, stride=1||32x32x128|
|down-sampling 5||strided convolution||kernel=3x3x256, stride=2||16x16x256|
|residual block 5||residual block||convolutions=2, kernel=3x3x256, stride=1||16x16x256|
|down-sampling 6||strided convolution||kernel=3x3x512, stride=2||8x8x512|
|residual block 6||residual block||convolutions=2, kernel=3x3x512, stride=1||8x8x512|
|down-sampling 7||strided convolution||kernel=3x3x1024, stride=2||4x4x1024|
|residual block 7||residual block||convolutions=2, kernel=3x3x1024, stride=1||4x4x1024|
|average pooling||per channel averaging||1024|