## 1 Introduction

Due to limited spatial-frequency bandwidth, optical coherence tomography (OCT) [huang1991optical]has an inherent characteristic of speckle [schmitt1999speckle]. The speckle can severely degrade the image quality by occluding essential anatomical structures such as retinal vessels and thin layers (e.g., the external limiting membrane (ELM)). Hence, despeckling is an urgent pre-processing step for both clinical diagnoses [chiu2015kernel] and further image analysis [xiang2018automatic]. The traditional way to reduce speckle noise is to average multiple noisy b-scan acquisitions at the same location [sander2005enhanced]. This approach requires a large number of repetitions, and the prolonged acquisition time can be problematic for patient comfort; additionally, registration artifacts caused by eye movement can be an issue. Therefore, a denoising algorithm that does not require repeated acquisitions is desirable.

Many deep learning methods have been investigated for the OCT denoising problem [ma2018speckle, devalla2019deep, hu2020retinal, mao2019deep, fan2020oct]. Most of these train the model using low-noise reference images, which is not always available in practice. To tackle this lack of ground truth, some unsupervised approaches including Noise2Noise [lehtinen2018noise2noise] (N2N) and deep image prior [ulyanov2018deep] (DIP) have been applied to OCT image restoration. Mao et al. [mao2019deep] implement N2N on OCT denoising by using two repeated b-scans as input and target for model training. Fan et al. [fan2020oct] demonstrate that DIP can have the drawback of serious over-smoothing in despeckling results.

In recent research, Ho et al. proposed a diffusion probabilistic model [ho2020denoising]

that is used for image synthesis with a performance superior to generative adversarial networks (GANs)

[dhariwal2021diffusion]. The general idea of the diffusion model is straightforward: Given a natural image, a Markov chain can be formed by adding a small amount of Gaussian noise at each step. Given a sufficient number of steps, a complex data distribution will finally be transformed to a Gaussian. Conversely, given a Gaussian noise image, a meaningful image can be synthesized in a reverse process. In this work, we propose to leverage the diffusion probabilistic model to denoise retinal OCT b-scans. Since the model is learning the speckle pattern instead of the retina appearance, the reference image used for training is not required to be the true noise-free image. In our experiments we apply the self-fusion [oguz2020self, hu2021life] method to obtain the clean reference image and we train the parameterized Markov chain by variational inference. As the number of reverse steps is adjustable, the algorithm is able to produce different levels of denoising results. This is an advantage since different tasks may require different levels of fine detail retention in the images.## 2 Method

We leverage our self-fusion method [oguz2020self] as a pre-processing step in the training stage (Fig. 1

a). Self-fusion regards b-scans in a small vicinity of a given target b-scan as ‘atlases’ for that b-scan, because of their structural similarity. After registering the neighbors to the target b-scan, a pixel-wise weighted average of these ‘atlases’ will result in an image with high signal-to-noise ratio (SNR). This approach is easy to implement and robust for retinal layer enhancement, but finer features like vessels and texture can be over-smoothed. Nevertheless, since the diffusion probabilistic model aims to learn the speckle pattern instead of the signal, the self-fusion output

can still be used as the clean image for our training purposes. Fig. 1b shows a Markov chain in forward (diffuse) and reverse (denoise) directions:(1) |

where represents the data distribution while . is the parameters of the model. The diffusion and sampling describe a transition between these two distributions with discretized steps. Our goal is to train a deep model to restore a noisy image with an adjustable parameter . Intuitively, an image with stronger speckle require a larger value that indicates more denoising steps.

The image sequence

is created by gradually adding small Gaussian noise with a variance schedule

, where , .(2) |

Eq. 2 approximates the posterior distribution in the forward process assuming that there is a small mean shift after one step of diffusion. Denote , then is acquired by adding

different Gaussian random variables to

.(3) |

As the sum of Gaussians is a Gaussian with variance . The high order terms with regard to in are negligible because is a small value in range . In practice, Eq. 3 enables sampling of with reparameterization:

(4) |

Similar to Eq. 2, the reverse step is also modeled as a Gaussian since the noise added in each step is small:

(5) |

In this work, we set the variance to be a fixed parameter and only learn to predict the mean.

To perfectly recover the image in the reverse process, the ideal solution is to minimize the distance between and . However, according to Ho et al. [ho2020denoising], the direct KL-divergence between these two distributions, , is not tractable. Alternatively, they introduce a tractable constraint . By leveraging the property of Markov chain, the following equation should hold:

(6) |

From Eq. 2 and Eq. 3, we know that both terms in the product are Gaussian; then should also have the form :

(7) |

We use the negative log likelihood as the objective function. This can be optimized by minimizing its variational upper bound given by the Jensen’s inequality; the detailed derivation of this is included in Appendix A:

(8) |

Because the variance schedule is fixed in this implementation, turns out to be a constant, so we only need to consider and

as loss function. Given Eq.

7 and Eq. 5,is the KL divergence of two Gaussian distributions and can be reduced to Eq.

9. The derivation details can be found in Appendix B.(9) |

Obviously, to minimize we can set the mean prediction equal to which is derived from Eq. 7 and Eq. 4:

(10) |

Combining Eq. 9 and Eq. 10 it is easy to see that the model is learning to predict the sample

drawn from a normal distribution with mean square error (MSE). The expression of

demonstrates that the image restoration is actually done by adding a Gaussian noise.## 3 Experiments

### 3.1 Dataset

We train our model on 6 optic nerve head (ONH) volumes of the human retina, and test it on 6 fovea volumes. Each volume contains 500 b-scans of pixels. To test the performance of the model at different speckle levels, the data is acquired with three different levels of signal-to-noise ratio (92dB, 96dB, 101dB). As mentioned in Sec. 1, the ground truth used for evaluation is often obtained by averaging repeated b-scan frames at the same spatial location. In our dataset, we have 5 repeated acquisitions for each b-scan which are averaged together to form the ground truth.

### 3.2 Implementation details

For self-fusion, we take adjacent b-scans within radius of 3 as candidates to compute the weighted mean. We use the greedy^{1}^{1}1https://github.com/pyushkevich/greedy software for diffeomorphic image registration [yushkevich2016fast]

. Images are padded to

pixels, and the intensity is normalized to .The variance schedule is set to linearly increase from to in steps. The architecture of our model is a simplified version of that in Ho et al. [ho2020denoising]

. It is trained on an NVIDIA RTX 2080TI 11GB GPU for 500 epochs, with a batch size of 2 using Adam optimizer. The starting learning rate is

and decay by half every 5 epochs.SNR=92dB | SNR=96dB | SNR=101dB | |

t=0 |
|||
---|---|---|---|

t=20 |
|||

t=41 |
|||

t=46 |
|||

t=51 |
|||

t=70 |

## 4 Results

### 4.1 Qualitative results

In Fig. 2 we show the denoising results of our model for a range of values. is the original input image for each SNR level. Increasing values indicate more denoising steps. The best result (determined visually) for each noise level, as highlighted by the red box, coincide with the intuition that the noisier images benefit from a larger . For the third column, where the input noise level is relatively low, we can see that as increases from 41 to 51, retinal layers gradually become over-smoothed and fine texture features fade away. As discussed in Sec. 2, Eq. 10 explains that the denoising process is done by adding Gaussian noise to compensate for the speckle pattern. When the is too large (e.g. in Fig. 2), the added noise becomes excessive and produces poor results.

We further note that our proposed method performs well for vessel and layer preservation. For example, in the second column, our result reveals the very thin external limiting membrane (ELM) (marked by red arrows) which is hardly visible in the noisy input.

### 4.2 Comparison to baseline denoising model

Hu et al. [hu2020retinal] present a pseudo-modality fusion network (PMFN) that improves the feature preservation compared to a former method developed by Devalla et al. [devalla2019deep]. We use the PMFN as the baseline in this study. In Fig. 3, we observe that the retinal layers are more homogeneous in our proposed approach than in PMFN for all input SNR levels. Downstream analysis tasks such as layer segmentation would likely benefit from this improvement. We also note that small features like vessels are not sacrificed, even though other regions of the layers become denoised. To quantitatively confirm these observations, we use the average of 5 repeated frames (5-mean) as the reference ground truth image and we report several metrics in Table 1

. Comparing with PMFN, the proposed method improve the denoising performance in terms of SNR, CNR and ENL. The results are significantly different in a paired, two-tailed t-test with a significance threshold of 0.05.

SNR | PSNR | CNR | ENL | |
---|---|---|---|---|

PMFN | ||||

proposed |

SNR=92dB | SNR=96dB | SNR=101dB | |
---|---|---|---|

5-mean |
|||

PMFN |
|||

Proposed |

## 5 Conclusion

We propose an OCT restoration method leveraging a diffusion probabilistic model. The model is unsupervised and requires a small amount of data to train. Moreover, the parameter provides control over the output smoothness level. Our results show that our model can efficiently suppress the speckle; furthermore, the features including layers and vessels are not only preserved but present higher visibility. A limitation of this approach is its Gaussian assumption on the speckle pattern. The generalization to other noise distribution types will be a potential direction of future work for better speckle modeling.

## 6 Acknowledgements

This work is supported by NIH R01EY031769, NIH R01EY030490 and the Vanderbilt University Discovery Grant Program.

## References

## Appendix A variational upper bound

According to the Jensen’s inequality, for a concave function , we should have:

(11) |

Hence the log likelihood has a evidence lower bound (ELBO), as logarithm is a concave function:

The negative log likelihood will then have an upper bound:

Then minimizing the negative log likelihood is equivalent to minimize the upper bound .

Intuitively can be approximated by .

hence the bound can be expressed as:

## Appendix B KL divergence of Gaussian variables

Let and be two Gaussian distributions:

The KL divergence between and can be derived as follows:

Apply this result on , if the ratio of variances is either cancelled or regarded as constant

(12) |

From the reparameterization (Eq. 4) it is easy to get

Then plug this in to get:

Then apply this result in the expression of , we get:

(13) |

To minimize , given fixed , the following should hold:

(14) |