Noise as Domain Shift: Denoising Medical Images by Unpaired Image Translation

10/07/2019 ∙ by Ilja Manakov, et al. ∙ 33

We cast the problem of image denoising as a domain translation problem between high and low noise domains. By modifying the cycleGAN model, we are able to learn a mapping between these domains on unpaired retinal optical coherence tomography images. In quantitative measurements and a qualitative evaluation by ophthalmologists, we show how this approach outperforms other established methods. The results indicate that the network differentiates subtle changes in the level of noise in the image. Further investigation of the model's feature maps reveals that it has learned to distinguish retinal layers and other distinct regions of the images.



There are no comments yet.


page 10

page 11

page 12

page 13

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Medical imaging is one of the great pillars of modern diagnostics. Clinicians rely on it to obtain information from inside the patient’s body in a non-invasive way. However, noise in the images erodes their quality and makes their interpretation difficult. Moreover, it can cause algorithms, designed to automatically extract measurements from those images, to be inaccurate or fail outright. In this paper, we focus on the domain of retinal optical coherence tomography (OCT)[11], a standard diagnostic tool in ophthalmology. Retinal OCT produces a series of 2D slices (b-scans) that display the depth profile of the retina, thus enabling clinicians to detect many sight-threatening diseases early in their progression. The dominating type of noise in OCT is called speckle. The speckle noise pattern depends on the imaged tissue and is highly sensitive to its position and orientation. Since signal and speckle noise originate from the same physical process, distinguishing signal from noise is particularly challenging. Interested readers are referred to [8] for more details.

Current popular methods for denoising OCT scans, such as BM3D [2] or wavelet denoising [1]

, neither incorporate knowledge about the OCT process nor about structures of the human eye. We argue that such knowledge should help in this task, given the complex and sample-dependent nature of speckle noise. On the other hand, methods emerging from the field of deep learning

[14, 4] have demonstrated precisely this ability, i.e. to learn the semantic characteristics of their input domains. We, therefore, aim to leverage deep learning to create a method that can denoise retinal scans by utilizing knowledge it has gained about this domain.

While writing this paper, we discovered recent work from Halupka et al. [5] and Huang et al. [7], in which they investigated a different GAN-based approach to retinal OCT denoising. Their approaches require paired training images, which can lead to problems with inaccurately registered images. Additionally, in their works, the denoised domain is constructed by registering and averaging samples from the noisy domain. Constructing denoised samples in this manner is not always feasible or possible and registration of images from different domains will likely not work well.

Our approach casts denoising as a domain translation problem. We demonstrate that, with some modifications, the cycleGAN model, introduced by Zhu et al. [14], can learn a mapping between a low and high noise domain from unpaired training data.

We introduce our method, the HDcycleGAN model, in Section 2 and evaluate its performance quantitatively and qualitatively in Section 3.2 and Section 3.3. In Section 3.4, we take a closer look at what our model has learned by inspecting its feature maps.

2 Methodology

Initially, we started by directly applying the cycleGAN model to the problem of learning a mapping between images of a high noise (HN) domain and a low noise (LN) one . However, we soon discovered that this model does not perform well on our problem as is. Therefore, we made some modifications to the existing cycleGAN framework and developed our final model, the Hybrid Discriminator cycleGAN (HDcycleGAN). Fig. 1 shows a pass through our model, starting from an HN image. In the following, we briefly summarize the required knowledge about the cycleGAN and highlight the changes we made and why we made them.

The cycleGAN combines two Generative Adversarial Networks (GANs) [4]

into one two-way Autoencoder. Here, the generator of each GAN learns the mapping from one image domain to the other. In combination, they act like encoder and decoder of an Autoencoder. This framework allows two directions of traversal; going from domain one to domain two and back to domain one or vice versa. The paper also introduced the cycle consistency loss, which corresponds to the reconstruction loss in the standard Autoencoder setting. The goal of this loss function is to achieve consistency when transforming an image from one domain to the other and back. The generators in the cycleGAN down-sample the input image using strided convolutions, pass it through a series of residual blocks


and finally use fractional-strided convolutions for up-sampling. The discriminators down-sample their inputs through strided convolutions to produce a scalar output.

Using a cycleGAN-based approach allows us to train the network on unpaired images. In this way, registration of images becomes obsolete and we can avoid uncertainties that arise due to interpolation in affine transformation or in cases of mismatch between the images. An additional benefit of this framework is the cycle-consistency loss; although we are primarily interested in the mapping from HN images to LN, this added loss function provides a training signal to the network that is more stable than that of the discriminator alone.

Figure 1: Schematic overview of the HDcycleGAN model. The path starting from HN is shown here. Starting from LN works analogously

We made three adjustments to the original cycleGAN model:

Skip Connections

In our first experiments, the vanilla cycleGAN generated blurry images. The sharpness of the image and clarity of visual features with small spatial extent play a crucial role when it comes to image quality. To address this problem, we added skip connections to the generators, which concatenate the output of each down-sampling layer to the input of the corresponding up-sampling layer.


Additionally, we noticed checkerboard-like artifacts in the generated images. Following an investigation by Odena et al. [10]

, we replaced each fractional-strided convolution with a combination of bilinear up-sampling and a padded convolution to remedy this issue.

Shared Discriminator

Even after the first two modifications and testing different hyper-parameters, the model failed to consistently improve image quality when mapping from HN to LN. We then noticed that both discriminators learn the characteristics of real OCT b-scans independently. The two image domains are almost identical in terms of image content. Consequently, the discriminators could not pick up on the subtle differences between the domains (see Fig. 0.A.1 in the appendix). As a remedy, we utilized a single discriminator that is shared between both generators. The discriminator can thus focus on the differences between the two domains instead of the full range of characteristics of each. This change resulted in the most significant improvement in visual quality of the generated images.

This shared discriminator acts as a three-way classifier, outputting the class probabilities for real HN, real LN and fake. As the discriminator now has to discriminate between more samples, we increased its complexity by adding a residual block with two convolutions in between each down-sampling layer.

The loss function of our model can thus be written as follows: Let and denote the generators that learn a mapping from LN to HN and from HN to LN respectively and the discriminator. Let

be the one-hot encoded vectors that represent the classes real HN, real LN and fake. Then the loss of the network is:


Here and are hyper-parameters for weighting discriminator and cycle-consistency loss respectively. For our model, we set and following [14]. Our implementation of the described methodology is publicly available at We also provide implementation details in Tables 0.A.1 and 0.A.2 in the appendix.

3 Experiments and Results

After training the HDcycleGAN for 245 epochs with an Adam optimizer and a learning rate of

, we performed both quantitative and qualitative analyses on the test set, which we explain in sections 3.2 and 3.3. In the quantitative analysis, we compared our approach to popular denoising methods using several measurements of similarity between real LN images and denoised ones. For the qualitative analysis, the similarity between real LN images and images produced by BM3D [2], wavelet denoising [1] and our method was assessed by three ophthalmologists independently. Finally, in section 3.4, we inspect the learned feature maps of the LN generator. We start by describing our dataset.

3.1 Dataset

We acquired the data for this task in-house, using a SPECTRALIS OCT+HRA from Heidelberg Engineering, as part of the general diagnostic workflow for macular diseases. We did not select patients based on any further traits. As such, the scans in the dataset show various kinds of diseases in all stages and are representative of the typical imaging data generated at our hospital. To gather the images belonging to the high noise domain, we followed the hospital protocol, using 30°ART Volume acquisition with 12 frames averaged for each b-scan. The low noise domain consists of acquisitions that follow the same protocol except that we set the number of averaged frames to 60. We obtained both HN and LN images from the same patients on the same visit. As the proprietary software of the device manufacturer handles the frame averaging, we did not have access to the individual frames. In total, we gathered 23030 b-scans in 470 volumes from 235 patients for each noise domain. We used 90% of the volumes for training and the remaining 10% for testing. Before passing the images through our model, we scaled the 496x512 images to a pixel intensity range between 0.0 and 1.0.

3.2 Quantitative Evaluation

To asses our model’s performance, we evaluated the similarity between the generated images and the ground truth LN images in the test set. Since we acquired HN and LN scans pairwise, we registered the images employing a registration algorithm based on discrete Fourier-transform

[12]. After registration, we calculated the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) between the two images. Additionally, we used the Marching Cubes algorithm [9] to find the contour of the retina. Inverting the selection yields a background mask, while reapplying Marching Cubes on the retinal layers with a different level finds contours in highly reflective parts of the retina. We designated these regions as signal. This process is illustrated in Fig. 0.A.3

in the appendix. Using the signal and background regions, we then calculated the mean-to-standard-deviation ratio (MSR) and contrast-to-noise ratio (CNR). To better gauge the performance of our approach, we included median filtering, wavelet denoising

[1], bilateral filtering [13], non-local means [3] and BM3D [2] in the comparison. The results are displayed in Table 1.

raw 3.66 2.21 3.96 1.73 21.99 1.33 0.662 0.055
median 3.82 2.36 4.25 1.92 22.32 1.45 0.682 0.051
wavelet [1] 3.81 2.37 4.23 1.86 22.34 1.41 0.690 0.053
bilateral [13] 3.78 2.33 4.28 1.93 22.29 1.40 0.690 0.053
nl-means [3] 3.78 2.33 4.43 2.12 22.32 1.40 0.702 0.051
BM3D [2] 3.87 2.44 4.39 1.97 22.50 1.45 0.708 0.052
ours 4.00 2.51 4.73 2.23 22.58 1.41 0.706 0.050
Table 1: Results of the quantitative analysis. Values are shown as mean standard deviation.

We can see that our model outperforms the other methods in all measurements except SSIM, where BM3D is slightly ahead. Overall we find that the performance of BM3D and our model is very close in inter-image measurements (SSIM and PSNR). In intra-image measurements (CNR and MSR) the margin between our approach and the others widens. It is also worth noting that our algorithm requires 30% less time to run on CPU than BM3D and beats all other algorithms by almost an order of magnitude on a low-end GPU (see Fig. 0.A.2 in the appendix). Although PSNR, SSIM, MSR and CNR are standard metrics of image quality, there is a caveat to these results; since HN and LN samples stem from independent acquisitions the noise in them is uncorrelated. This might explain why the overall improvement in these metrics is relatively low for all algorithms.

3.3 Qualitative Evaluation

Because of this caveat, we asked three expert ophthalmologists to visually assess the quality of our results. We provided them with 150 real LN images from the test set and images generated from the corresponding real HN images using BM3D, wavelet denoising and our method. For each such sample, the clinicians rated the methods by their similarity to the real LN images. We ordered the images in each sample randomly and did not provide any indication as to which model generated which image. The results of this evaluation, displayed in Fig. 2, confirm the findings of the quantitative evaluation. The experts unanimously agree that our approach outperforms the benchmarks.

Figure 2: Results of the qualitative evaluation by three experts.

3.4 Feature Map Inspection

We attempted to understand how the model is approaching the task of image enhancement by looking at the feature maps that it has learned. We did this by passing a sample through the LN generator and extracting the neuron activations at every layer. Due to the convolutional nature of the generator, these layer outputs are shaped like images with many channels. Hence we can view each channel in the activations of a layer as a gray-scale image which we refer to as a feature map.

Figure 3: Example of the feature maps. On the left, the input image is overlaid with the map. On the right, the feature map is shown by itself.

By up-scaling the feature maps to the size of the input, we then checked for spatial correlations. For visualization purposes, we show some feature maps from different layers, which highlight distinct regions of the retina, in Fig. 3. Many more can be found in the appendix.

We observed that the feature maps at the output of the deeper residual blocks become increasingly abstract and spatially uncorrelated with the input. The feature maps at the outputs of the first four layers (initial convolution and down-sampling 1 to 3) and shallower residual blocks exhibit a strong spatial correlation with the input. Moreover, the different channels seem to correspond to anatomically distinct regions in the b-scan, although segmentation was never part of the training objective.

We think that this finding is relevant when viewed from two perspectives. Firstly, it shows that the model has gained some domain specific knowledge about the structure of macular OCT scans, which general methods such as BM3D and wavelet denoising are lacking. Secondly, this property can prove useful from the viewpoint of transfer learning, i.e. when applying this model to other tasks. The feature maps themselves can also be used for other tasks.

An example of the second point can be found in Fig. 0.A.11 in the appendix. We discovered a feature map that appears to track the positions of the Inner Limiting Membrane (ILM) and the Retinal Pigment Epithelium (RPE) (the inner- and outermost layers of the retina) (see Fig. 0.A.10

). We then multiplied the feature map with its corresponding b-scan, applied the image mean as a threshold and skeletonized the remainder. The resulting lines can be used to estimate retinal thickness. This method seems to work well even in the presence of pathologies, such as myopia (row 4, col. 4) or vitreous detachment (row 1, col. 4 and row 5, col. 3), which are typical causes for segmentation errors in commercial segmentation algorithms.

4 Discussion

In this paper, we applied the HDcycleGAN model to the problem of image enhancement. In medical imaging, reduced image noise typically comes at the cost of increased acquisition time, radiation dose or other detrimental effects. Our model can learn a mapping between domains that correspond to different settings of those costly acquisition parameters.

Additionally, our approach learns the structural characteristics of the medical imaging domain, which further improves its usefulness as it can be leveraged for other tasks in that domain. As part of future work, we wish to study the transferability of our approach to other imaging modalities, such as Ultrasound.

As is the case with all GAN-based methods, the training of this model is not straightforward and the performance does not appear to increase monotonically throughout training. Nevertheless, our approach allows us to pre-train the parts individually; the generators as Autoencoders and the discriminator as a classifier between domains. In the future, we also plan to test if pre-training can improve training stability and model performance.


  • [1] Chang, S.G., , Vetterli, M.: Adaptive wavelet thresholding for image denoising and compression. IEEE Transactions on Image Processing 9(9), 1532–1546 (2000)
  • [2] Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising with block-matching and 3d filtering. vol. 6064 (2006)
  • [3] Darbon, J., Cunha, A., Chan, T.F., Osher, S., Jensen, G.J.: Fast nonlocal filtering applied to electron cryomicroscopy. In: 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. pp. 1331–1334 (2008)
  • [4] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672–2680 (2014)
  • [5] Halupka, K.J., Antony, B.J., Lee, M.H., Lucy, K.A., Rai, R.S., Ishikawa, H., Wollstein, G., Schuman, J.S., Garnavi, R.: Retinal optical coherence tomography image enhancement via deep learning. Biomed. Opt. Express 9(12), 6205–6221 (2018)
  • [6]

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

  • [7]

    Huang, Y., Lu, Z., Shao, Z., Ran, M., Zhou, J., Fang, L., Zhang, Y.: Simultaneous denoising and super-resolution of optical coherence tomography images based on generative adversarial network. Opt. Express

    27(9), 12289–12307 (2019)
  • [8] Joseph M. Schmitt, S. H. Xiang, K.M.Y.: Speckle in optical coherence tomography. Journal of Biomedical Optics 4(1), 95 – 105 – 11 (1999)
  • [9] Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. In: Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques. pp. 163–169 (1987)
  • [10] Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts. Distill (2016),
  • [11] Podoleanu, A.G.: Optical coherence tomography. Journal of Microscopy 247(3), 209–219 (2012)
  • [12] Reddy, B.S., Chatterji, B.N.: An fft-based technique for translation, rotation, and scale-invariant image registration. IEEE Transactions on Image Processing 5(8), 1266–1271 (1996)
  • [13] Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Sixth International Conference on Computer Vision. pp. 839–846 (1998)
  • [14]

    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: The IEEE International Conference on Computer Vision (ICCV) (2017)

Appendix 0.A Appendix

Figure 0.A.1: Mean scores assigned by the discriminators to data samples. The bars correspond to samples being evaluated for a specific target domain. In the case of HDcycleGAN the values are obtained from the class probabilities, while for the cycleGAN they are taken from the respective discriminators. We can see that in the cycleGAN the scores do not change much when we evaluate a real image as either LN or HN.
Figure 0.A.2: Runtimes of the different methods computed over the test set. The GPU used for this calculation was a mobile 2GB GTX 1050.
Figure 0.A.3: Illustration of the process to obtain background and signal masks for calculation of CNR and MSR.
Figure 0.A.4: Sample showing a b-scan with 60 frames (a) and 12 frames (b) averaged. Also shown are the results of denoising using wavelet (c), HDcycleGAN (d) and BM3D (e)
Figure 0.A.5: Sample showing a b-scan with 60 frames (a) and 12 frames (b) averaged. Also shown are the results of denoising using wavelet (c), HDcycleGAN (d) and BM3D (e)

Figure 0.A.6: Demonstration of feature maps, part 1. The feature maps represent individual channels of activations in the LN generator at different layers. On the left, the input image is overlayed with the map. On the right, the feature map is shown by itself. Layer and channel are shown on the sides.

Figure 0.A.7: Demonstration of feature maps, part 2. The feature maps represent individual channels of activations in the LN generator at different layers. On the left, the input image is overlayed with the map. On the right, the feature map is shown by itself. Layer and channel are shown on the sides.

Figure 0.A.8: Demonstration of feature maps, part 3. The feature maps represent individual channels of activations in the LN generator at different layers. On the left, the input image is overlayed with the map. On the right, the feature map is shown by itself. Layer and channel are shown on the sides.

Figure 0.A.9: Feature map 12 of layer 3 appears to have learned to detect the borders of the retinal layers.

Figure 0.A.10: Feature map 114 of layer 6 appears to have learned to track the Inner Limiting Membrane (ILM) and Retinal Pigment Epithelium (RPE).

Figure 0.A.11: Skeletonized version of the feature maps from Fig. 0.A.10 overlayed with their b-scans. The skeletons can be used to estimate retinal thickness, which can help in data selection or serve as an input feature for some other task. The feature maps were multiplied with the b-scan, thresholded by its mean value and median-filtered before skeletonizing.
layer name type properties output size
initial convolution convolution kernel=7x7x16, stride=1 512x512x16
down-sampling 1 strided convolution kernel=3x3x32, stride=2 256x256x32
down-sampling 2 strided convolution kernel=3x3x64, stride=2 128x128x64
down-sampling 3 strided convolution kernel=3x3x128, stride=2 64x64x128
residual block 1 residual block convolutions=3, kernel=3x3x128, stride=1 64x64x128
residual block 2 residual block convolutions=3, kernel=3x3x128, stride=1 64x64x128
residual block 3 residual block convolutions=3, kernel=3x3x128, stride=1 64x64x128
residual block 4 residual block convolutions=3, kernel=3x3x128, stride=1 64x64x128
residual block 5 residual block convolutions=3, kernel=3x3x128, stride=1 64x64x128
residual block 6 residual block convolutions=3, kernel=3x3x128, stride=1 64x64x128
up-sampling 1 bilinear up-scaling + convolution kernel=3x3x64, scale=2 128x128x64
up-sampling 2 bilinear up-scaling + convolution kernel=3x3x32, scale=2 256x256x32
up-sampling 3 bilinear up-scaling + convolution kernel=3x3x16, scale=2 512x512x16
final convolution convolution kernel=3x3x1, stride=1 512x512x1
Table 0.A.1:

Architecture of the HDcycleGAN generator. All layers use Instance Norm and ReLU activation. Skip connections link down-sampling and up-sampling layers using concatenation.

layer name type properties output size
down-sampling 1 strided convolution kernel=3x3x16, stride=2 256x256x16
residual block 1 residual block convolutions=2, kernel=3x3x16, stride=1 256x256x16
down-sampling 2 strided convolution kernel=3x3x32, stride=2 128x128x32
residual block 2 residual block convolutions=2, kernel=3x3x32, stride=1 128x128x32
down-sampling 3 strided convolution kernel=3x3x64, stride=2 64x64x64
residual block 3 residual block convolutions=2, kernel=3x3x64, stride=1 64x64x64
down-sampling 4 strided convolution kernel=3x3x128, stride=2 32x32x128
residual block 4 residual block convolutions=2, kernel=3x3x128, stride=1 32x32x128
down-sampling 5 strided convolution kernel=3x3x256, stride=2 16x16x256
residual block 5 residual block convolutions=2, kernel=3x3x256, stride=1 16x16x256
down-sampling 6 strided convolution kernel=3x3x512, stride=2 8x8x512
residual block 6 residual block convolutions=2, kernel=3x3x512, stride=1 8x8x512
down-sampling 7 strided convolution kernel=3x3x1024, stride=2 4x4x1024
residual block 7 residual block convolutions=2, kernel=3x3x1024, stride=1 4x4x1024
average pooling per channel averaging 1024
logits fully connected 3
Table 0.A.2: Architecture of the HDcycleGAN discriminator. All layers use Instance Norm and ReLU activation, except the final layer which uses softmax and no norm.