Log In Sign Up

Rethinking Deep Image Prior for Denoising

Deep image prior (DIP) serves as a good inductive bias for diverse inverse problems. Among them, denoising is known to be particularly challenging for the DIP due to noise fitting with the requirement of an early stopping. To address the issue, we first analyze the DIP by the notion of effective degrees of freedom (DF) to monitor the optimization progress and propose a principled stopping criterion before fitting to noise without access of a paired ground truth image for Gaussian noise. We also propose the `stochastic temporal ensemble (STE)' method for incorporating techniques to further improve DIP's performance for denoising. We additionally extend our method to Poisson noise. Our empirical validations show that given a single noisy image, our method denoises the image while preserving rich textual details. Further, our approach outperforms prior arts in LPIPS by large margins with comparable PSNR and SSIM on seven different datasets.


page 1

page 8


A Poisson-Gaussian Denoising Dataset with Real Fluorescence Microscopy Images

Fluorescence microscopy has enabled a dramatic development in modern bio...

On Measuring and Controlling the Spectral Bias of the Deep Image Prior

The deep image prior has demonstrated the remarkable ability that untrai...

Early Stopping for Deep Image Prior

Deep image prior (DIP) and its variants have showed remarkable potential...

Towards the Automation of Deep Image Prior

Single image inverse problem is a notoriously challenging ill-posed prob...

PoGaIN: Poisson-Gaussian Image Noise Modeling from Paired Samples

Image noise can often be accurately fitted to a Poisson-Gaussian distrib...

The Spectral Bias of the Deep Image Prior

The "deep image prior" proposed by Ulyanov et al. is an intriguing prope...

One-dimensional Deep Image Prior for Time Series Inverse Problems

We extend the Deep Image Prior (DIP) framework to one-dimensional signal...

1 Introduction

Deep neural network has been widely used in many computer vision tasks, yielding significant improvements over conventional approaches since AlexNet 

[Alexnet_2012_Alex]. However, image denoising has been one of the tasks in which conventional methods such as BM3D [BM3D_2007_Dabov]

outperformed many early deep learning based ones 

[SDA2010Pascal, MLP_2012_CVPR, DNNAllgaussian2014wang] until DnCNN [DnCNN_2017_TIP] outperforms it for synthetic Gaussian noise at the expense of massive amount of noiseless and noisy image pairs [DnCNN_2017_TIP].

(a) Image
(b) Comparison on CSet9 dataset
P / L
Figure 1: Comparison of single image based denoising methods. ‘L’ refers to LPIPS and it is lower the better. ‘P’ refers to PSNR and it is higher the better. Our method denoises an image while preserves rich details; showing the best LPIPS with comparable PSNR to Self2Self (S2S) [S2S_Quan_2020_CVPR]. Ours shows much better trade-off in PSNR and LPIPS than all other methods including all different ensembling attempts of state of the art (S2S). (Numbers in the circle of S2S denotes number of models in ensemble)

Requiring no clean and/or noisy image pairs, deep image prior (DIP) [DIP_2018_CVPR, DIP_2020_IJCV]

has shown that a randomly initialized network with hour-glass structure acts as a prior for several inverse problems including denoising, super-resolution, and inpainting with a single degraded image. Although DIP exhibits remarkable performance in these inverse problems, denoising is the particular task that DIP does not perform well, , a single run yields far lower PSNR than BM3D even for synthetic Gaussian noise set-up 

[DIP_2018_CVPR, DIP_2020_IJCV]. Furthermore, for the best performance, one needs to monitor the PSNR (, the ground-truth clean image is required here) and stop the iterations before fitting to noise. Deep Decoder addresses the issue by proposing a strong structural regularization to allow longer iterations for the inverse problems including denoising [Deep_heckel_2018_ICLR]. However, it yields worse denoising performance than DIP due to low model complexity.

For better use of DIP for denoising without monitoring PSNR with a clean image, we first analyze the model complexity of the DIP by the notion of effective degrees of freedom (DF) [Tibshirani2014DoF, df2004Efron, dfinDNN2016UAI]. Specifically, the DF quantifies the amount of overfitting (, optimism) of a chosen hypothesis (, a trained neural network model) to the given training data [df2004Efron]

. In other words, when overfitting occurs, the DF increases. Therefore, to prevent the overfitting of the DIP network to the noise, we want to suppress the DF over iterations. But obtaining DF again requires a clean (ground truth) image. Fortunately, for the Gaussian noise model, there are approximations for DF without using a clean image; Monte-Carlo divergence approximations in Stein’s unbiased risk estimator (SURE) (Eqs. 

89) ().

Leveraging SURE and improvement techniques in DIP [DIP_2020_IJCV], we propose an objective with ‘stochastic temporal ensembling (STE),’ which mimics ensembling of many noise realizations in a single optimization run. On the proposed objective with the STE, we propose to stop the iteration when the proposed objective function crosses zero. The proposed method leads to much better solutions than DIP and outperforms prior arts for single image denoising. In addition, inspired by PURE formulation [PURE2011Luisier, PGURE2014Montagner], we extend our objective function to address the Poisson noise.

We empirically validate our method by comparing DIP based prior arts for denoising performance in various metrics that are suggested in the literature [Gu2019ABR] such as PSNR, SSIM and learned perceptual image patch similarity (LPIPS) [LPIPS_zhang2018perceptual] on seven different datasets. LPIPS has been widely used in super resolution literature to complement PSNR, SSIM to measure the recovery power of details [Ledig2017SRGAN]. Since it is challenging for denoiser to suppress noise and preserve details together [PDTradeoff2018Blau], we argue that LPIPS is another appropriate metric to evaluate denoisers. Note that it has not been widely used in denoising literature yet to analyze the denoising performance. Our method not only denoises the images but also preserves rich textual details, outperforming other methods in LPIPS with comparable classic measures including the PSNR and SSIM.

Our contributions are summarized as follows:

  • Analyzing the DIP for denoising with effective degrees of freedom (DF) of a network and propose a loss based stopping criterion without ground-truth image.

  • Incorporating noise regularization and exponential moving average by the proposed stochastic temporal ensembling (STE) method.

  • Diverse evaluation in various metrics such as LPIPS, PSNR and SSIM in seven different datasets.

  • Extending our method to Poisson noise.

2 Related work

2.1 Learning based methods

Learning-based denoising methods use a large number of clean-noisy image pairs to train a denoiser. In an early study, a neural network shows decent performance even though the noisy level is unknown, , blind noise setup [MLP_2012_CVPR]. Shortly afterwards, however, [DND_Plotz_2017_CVPR] has shown that most of early learning-based studies have often produced worse results than the classical technique such as BM3D [BM3D_2007_Dabov]. But recently, DnCNN model with residual learning [DnCNN_2017_TIP] outperforms BM3D. Then, several works are proposed to improve the computational efficiency, IRCNN [IRCNNzhang_2017_learning] uses dilated convolution and FFDNet [zhang2018ffdnet] uses downsampled subimages and noise level map.

2.2 Model based methods

Conventional model based methods do not need training but rely on an inductive bias given as a prior. The performance of model based methods depends on the chosen prior knowledge. There are several image priors such as total variation (TV) [TV_PRIOR], Wavelet-domain processing [waveletsparsityprior] and BM3D [BM3D_2007_Dabov]. Each prior assumes that the prior distribution is smoothness, low rank and self-similarity, respectively.

Image prior by deep neural networks.

Ulyanov  [DIP_2020_IJCV]

show that a randomly initialized convolutional neural network serves as an image prior and name it as deep image prior (DIP) and apply to several inverse problems.

Besides the broad usages, the performance of denoining with DIP is still disappointing because of “overfitting” to noise (see Sec. 3). There are several remedies for the noise overfitting of DIP [DIP_2018_CVPR, Deep_heckel_2018_ICLR, GPDIP_Cheng_2019_CVPR, DIPRED_Mataev_2019_ICCV]. DIP-RED [DIPRED_Mataev_2019_ICCV] combines a plug-and-play prior with DIP, which changes the converge point of DIP. GP-DIP [GPDIP_Cheng_2019_CVPR] shows that DIP is asymptotically equivalent to a stationary Gaussian Process prior and introduces stochastic gradient Langevin dynamics (SGLD) [SGLD_19].Deep decoder [Deep_heckel_2018_ICLR] utilizes under-parameterized network based on the fact that overfitting is related to model complexity. Inspired by that, we systematically analyze the fitting of a network but improve the performance of DIP without sacrificing the network size.

Recently, Self2Self (S2S) [S2S_Quan_2020_CVPR]

introduces self-supervised learning based on dropout and ensembling. Owing to model uncertainty from dropout, S2S generates multiple independent denoised instance and averages the outputs for low-variance solution. It outperforms existing solutions but needs extensive iteration with very low learning rate due to dropout. In addition, there is an approach to combine SURE 

[stein_origin] with DIP [DIPSURE_metzler2020unsupervised] (DIP-SURE). They share similarity to our work for both use SURE but we further extend it to propose a ‘stochastic temporal ensembling,’ which deviates from the original SURE formulation. Please find further discussion in Sec. 4.1.

2.3 Effective degrees of freedom

Effective degrees of freedom (DF) [df2004Efron, Tibshirani2014DoF] provides a quantitative analysis of the amount of fitting of a model to the training data. Efron shows that an estimate of optimism is difference of error on test and training data and relates it to a measure of model complexity deemed effective degrees of freedom [df2004Efron]. Intuitively, it reflects the effective number of parameters used by a model in producing the fitted output [Tibshirani2014DoF]. We use the notion of DF to analyze and detect the overfitting of a network and propose our method.

2.4 Stein’s unbiased risk estimator (SURE)

Stein’s unbiased risk estimator [stein_origin]

is a risk estimator for a Gaussian random variable. It is a useful tool for selecting a model or hyper-parameters in denoising problem, since it guarantees unbiasedness for risk estimator without a target vector 

[Donoho95adaptingto, Zhang98AdaptiveSURErisk]. The analytic solution for SURE is only available for limited conditions; non-local mean or linear filter [Ville2009Nonlocal, NonlocalSURE2011Ville]. When the closed form solution is not available, Ramani  [MCSURE_Ramani_2008_TIP] proposed a Monte Carlo-based SURE (MC-SURE) method to determine near-optimal parameters based on the brute-force search of the parameter space. As the SURE based method is limited to Gaussian noise [SURE_2018_NIPS], several works extend it to other types of noises including Poisson [PURE2011Luisier], Poisson-Gaussian [PGURE2014Montagner], exponential family [GSURE2009Eldar] or non-parametric noise model [BaysianSupervision2007Sch]. We also modify our objective to extend our method to Poisson noise by [PGURE2014Montagner, PURE2011Luisier, PURE2018Soltanayev] (Sec. 4.3).

3 Preliminaries

Deep image prior (DIP).

Let a noisy image be modeled as


where be a noiseless image that one would like to recover and be an Gaussian noise such that where

is an identity matrix. Denoising can be formulated as a problem of predicting the unknown

from known noisy observation . Ulyanov  [DIP_2020_IJCV] argued that a network architecture naturally encourages to restore the original image from a degraded image and name it as deep image prior (DIP). Specifically, DIP optimizes a convolutional neural network with parameter by a simple least square loss as:


where is a random variable that is independent of . If has enough capacity (, sufficiently large number of parameters or architecture size) to fit to the noisy image , the output of model should be equal to , which is not desirable. DIP uses the early stopping to obtained the results with best PSNR with clean images.

Effective degrees of freedom for DIP.

The effective degrees of freedom [df2004Efron, Tibshirani2014DoF] quantifies the amount of fitting of a model to training data. We analyze the training of DIP by the effective degrees of freedom (DF) in Eq. 3 as a tool for monitoring overfitting to the given noisy image. the DF for the estimator of with input can be defined as follows [hastie1990generalized]:


where and are a model (, a neural network) and noise image respectively.

is the standard deviation of the noise.

and indicate the element of corresponding vectors. For example, if the input to is and is a noisy image, , , it is the DF for DIP. Note that can take any input and we use (instead of ) for our formulation.

Interestingly, the DF is closely related to the notion of optimism of an estimator , which is defined by the difference between test error and train error [hastie1990generalized, Tibshirani2014DoF] as:


where is a mean squared error (MSE) loss, is another realization from the model (, with different in Eq. 1) that is independent of . In [Tibshirani2014DoF], it is shown that . Thus, combining with Eq. 3, it is straightforward to show that


It is challenging to compute the covariance since is nonlinear (, a neural network), gradually changing in optimization, and the requires many pairs of noisy and clean (ground-truth) images to compute (note that it is an estimate). Here, we introduce a simple approximated degrees of freedom with a single ground-truth and call it as . We derive the as following:


We describe a simple proof of the estimation in the supplementary material.

A large DF implies overfitting to the given input , which is not desirable. If DIP fits to , becomes close to 0. The more the DIP is fitting to , the larger the DF is. We use the to analyze the DIP optimization in empirical studies in Sec. 5.1.

4 Approach

To prevent the overfitting of DIP, we try to suppress the DF (Eq. 3) during the optimization without the access of ground-truth clean image . In Eq. 3, computing the DF is equivalent to the sum of the covariances for each element of the noise image and the model output . There are a number of techniques to simply approximate the covariance computation in statistical learning literature such as AIC [AIC1973akaike], BIC [BIC1978Schwarz] and Stein’s unbiased risk estimator (SURE) [stein_origin]. Both AIC and BIC, however, approximate the DF by counting the number of parameters of a model, so for usual over-parameterized deep neural networks, the approximations based on them could be incorrect [modelselectionNN1999Ulrich]. Note that cannot be used for optimizing model because it needs groud-truth clean image .

Here, we propose to use SURE to suppress the DF by deriving the DIP formulation using the Stein’s lemma. The Stein’s lemma for a multivariate Gaussian vector is [stein_origin]:


It simplifies the computation of DF from the covariances between and to the expected partial derivatives at each point, which is well approximated in a number of computationally efficient ways [MCSURE_Ramani_2008_TIP, NEWDIV_Soltanayev_2020_ICASSP]. Note that the SURE which is denoted as , consists of Eq. 7 and the DIP loss (Eq. 2) with a modification of its input (from to ) as:

= . (8)

While the vanilla DIP loss encourages to fit the output of the model to noisy image , Eq. (8) encourages to approximately fit it to clean image without access to the .

However, it is still computationally demanding to use Eq. 8 as a loss for optimization with any gradient based algorithm due to the divergence term [MCSURE_Ramani_2008_TIP]. A Monte-Carlo approximation for Eq. 8 in [MCSURE_Ramani_2008_TIP] can be a remedy to the computation cost, but it introduces a hyper-parameter that has to be selected properly for the best performance on different network architectures and/or datasets. For not requiring to tune the hyper-parameter , we employed an alternative Monte-Carlo approximation for the divergence term [NEWDIV_Soltanayev_2020_ICASSP] as:


where is a standard normal random vector, , and the element of the Jacobian is . We denote this ‘estimated degrees of freedom by Monte-Carlo’ by and will use it to monitor the DIP optimization without using the PSNR with the clean ground truth images (Sec. 4.2).

4.1 Stochastic temporal ensembling

To improve the fitting accuracy, DIP suggests several methods including noise regularization, exponential moving average [DIP_2020_IJCV]. We propose ‘stochastic temporal ensembling (STE)’ for better fitting performance by leveraging these methods to our objective.

Noise regularization on DIP.

DIP shows that adding extra temporal noises to the input of function at each iteration improves performance for the inverse problems including image denoising [DIP_2020_IJCV]. It is to add a noise vector , with to the input of the function at every iteration of the optimization as:


where is fixed but

is sampled from Gaussian distribution with zero mean, standard deviation of

at every iteration. To estimate the by Eq. 8, we replace the input of the model , , with noisy image (from Eq. 3 to Eq. 7). Interestingly, Eq. 10 becomes similar to the denoising auto-encoder (DAE), which prevents a model from learning a trivial solution by perturbing output of  [DAE2008Vincent].

Meanwhile, contractive autoencoder (CAE) 

[CAE2011Rifai] minimizes the Frobenius norm of the Jacobian and SURE and its variants minimize the trace of the Jacobian (Eq. 9) thus suppresses the DF. Since we assume that the different realizations of noise are independent, the off-diagonal elements of the matrix are zero, CAE is equivalent to SURE in terms of suppressing the DF. Alain  [regautoencoder2014alain] later show that the DAE is a special case of the CAE when . We can rewrite the Eq. 10 by using CAE formulation as:


when , where is a high order error term from Taylor expansion. Thus, solving this optimization problem is equivalent to penalizing increase of DF. Here, the noise level serves as a hyper-parameter for determining performance and it improves performance of DIP by using multiple level of at optimization of DIP. Thus, we further proposed to model as a uniform random variable instead of a empirically chosen hyper-parameter such that


Exponential moving average.

DIP further shows that averaging the restored images obtained in the last iterations improves the performance of denoising [DIP_2020_IJCV], which we refer to as ‘exponential moving average (EMA).’ It can be thought as an analogy to the effect of ensembling [datadistill2018].

Stochastic temporal ensembling.

Leveraging the noise regularization and the EMA, we propose a method called ‘stochastic temporal ensembling (STE)’ to improve the fitting performance of DIP loss. Specifically, we modify our formulation (Eq. 8) by allowing two noise observations, for target of MSE loss and for the input of the model, , instead of one by setting and as:


where is a known noise level of (same as Eq. 1), and are the element of the vectors of and , respectively. Interestingly, Eq. 13 is equivalent to the formulation of extended SURE (eSURE) [eSURE2019Neurips]

, which is shown to be a better unbiased estimator of the MSE with the clean image

. But there are a number of critical differences of ours from  [eSURE2019Neurips]. First, Our method does not require training, while Zhussip  [eSURE2019Neurips] requires training with many noisy images. Because Zhussip  [eSURE2019Neurips] use the fixed instance of , there is no effect of regularization from (Eq. 10), which gives reasonable performance gain( See Sec.5.2). This is our final objective function of DIP that stops automatically by a stopping criterion, described in the following section.

4.2 Zero-crossing stopping criterion

SURE works well if the model satisfies the smoothness condition, , admits a well-defined second-order Taylor expansion [SURE_2018_NIPS, DIPSURE_metzler2020unsupervised]. While a typical learning based denoiser satisfies this smoothness condition [SURE_2018_NIPS, eSURE2019Neurips], the DIP network ‘fits’ to a target image (a noisy image in [DIP_2018_CVPR, DIP_2020_IJCV] and an approximate clean image in our objective) and therefore there is no guarantee that the smoothness condition can be satisfied, especially when it has been converged.

We observed that the divergence term in our formulation (Eq. 13) increases at early iterations (, before convergence) while it starts to diverge to at later iterations (, after convergence). This observation is consistent in all our experiments. Note that this divergence phenomenon was not reported in [DIPSURE_metzler2020unsupervised] because the DIP network with the SURE loss did not seem to be fully converged to recover the fine details with insufficient number of iterations. Based on this observation for our proposed objective, we propose ‘zero crossing stopping criterion’ to stop iteration when our objective function (Eq. 13) deviates from zero.

Solution trajectory.

Figure 2: Illustration of a solution trajectory of ours and DIP. We consider the problem of reconstructing an image from a degraded measurement . DIP finds its optimal stopping point () by early stopping. Ours changes DIP’s solution trajectory from black to orange whose stopping point () is defined by a loss value (Sec.4.2) and is close to noiseless solution (x).

To help understand the difference between our method to DIP in optimization procedure, similar to Fig. 3 in [DIP_2020_IJCV], we illustrate DIP image restoration trajectory with that of our method in Fig. 2. DIP degrades the quality of the restored images by the overfitting. To obtain the solution close to the clean ground truth image, DIP uses early stopping (blue ). Our formulation has different training trajectory (orange) from DIP (black) and automatically stops the optimization by the zero crossing stopping ( orange ). We argue that the resulting image by our formulation is in general closer to the clean image (blue ) than the solution by DIP, which preserves more high frequency details than the solution by the DIP (Sec. 5.3) thanks to a better target to fit (an approximation of the clean over a noisy image and our proposed principled stopping criterion without using ground truth image). We empirically analyze this phenomenon with our proposed and compare it to in Sec. 5.1 and the supplementary material.

4.3 Extension to Poisson noise

As the SURE is limited to Gaussian noise [SURE_2018_NIPS], there are several attempts to extend it to other types of noises [PGURE2014Montagner, GSURE2009Eldar, BaysianSupervision2007Sch]. Here, we extend our formulation to Poisson noise as it is a useful model for noise in low-light condition. We modify our formulation (Eq. 13) to use Poisson unbiased risk estimator (PURE) [PURE2011Luisier, PGURE2014Montagner, Kim2020PURECT] for Poisson noise as follows:


where is a -dimensional binary random variable whose element

takes -1 or 1 with probability 0.5 for each,

is a small positive number and is a Hadamard product. We empirically validate the Poisson extension in Sec. 5.4.

5 Experiments

(a) Convergence analysis
(b) Effect of the STE
(c) Zero-crossing stopping criterion
Figure 3: Learning analysis with . (a) As optimization progresses, degrees of freedom of DIP increase as fit to noisy observation. Ours does not overfit to noisy observation and shows consistently better performance thanks to Stein unbiased risk estimation. The green dashed line indicates the intersection between DIP and ours in . (b) The proposed method makes optimization more stable than single instance one and this tendency also is observed on PSNR to , PSNR to (EMA). (c) Monte-carlo estimation of eSURE is stable for a considerable amount of steps but the error of estimation is soaring. It usually happens when the loss is already close to zero (the green dashed line on the plot). Thus, we propose to stop the optimization as soon as the loss reaches zero.

Implementation details.

For the , we use , following [DnCNN_2017_TIP], and for in-depth analysis. For the in Eq. 12, we set it to the same value to the . RAdam opimizer [liu2019radam] is used for training with learning rate 0.1. Details including network architectures and datasets are in the supplementary material.

Evaluation metrics.

We use peak signal-to-noise ratio (PSNR), structured similarity (SSIM) and learned perceptual image patch similarity (LPIPS) 

[LPIPS_zhang2018perceptual]. The PSNR is widely used in denoising literature [DnCNN_2017_TIP, zhang2018ffdnet, IRCNNzhang_2017_learning, N3Net_Pl_2018, jia2019focnet] but is recently argued that it is not an ideal metric as it values the oversmoothed results [LPIPS_zhang2018perceptual, Ledig2017SRGAN]. For this reason, we compare the algorithms with LPIPS as an alternative measurement of human study. We use the publicly available pre-trained weights based on AlexNet by the authors [LPIPS_zhang2018perceptual]. We additionally report the performance of the peak PSNR during optimization of our method as a reference (denoted as ‘Ours*’).

5.1 Convergence analysis by

Fig. 2(a) shows the , PSNR to and PSNR to ; is the effective degrees of freedom with Ground Trues, PSNR to and PSNR to refer PSNR from the model output to and respectively. As optimization progresses, the degrees of freedom of DIP increases gradually with PSNR to . But PSNR to of DIP decreases from 1,300 iteration. In contrast, in ours, rises at the beginning of the iterations and stays at a certain value. Interestingly, the best stopping point for DIP is near the intersection between DIP and our method in . It implies that the converged value by our method is near the optimal solution of DIP ( of DIP in Fig. 2).

Fig. 2(b) shows the trajectory of two objectives on ; (1) ours w/o STE and (2) ours. As shown in Sec 4.1, STE suppresses the DF by minimizing the norm of the Jacobian, which is similar to trace of Jacobian (the DF from the Stein’s lemma). Accordingly, ours suppresses the DF better than ours w/o STE in (Fig. 2(b) (top)). This tendency is also observed in ‘PSNR to ’ and ‘PSNR to (EMA).’

The optimization progresses of and are shown in Fig. 2(c). The starts underestimating the after a certain iteration and Eq.  13 (‘Loss’) becomes zero and it reaches the highest PSNR value. Thus, our proposed zero stopping criterion detects when fails to estimate the ; when the loss crosses zero.

5.2 Quantitative analysis

Method Overfit Prev. PSNR () SSIM () LPIPS ()
DIP [DIP_2020_IJCV] Early stopping 29.96 0.940 0.152
Deep Decoder [Deep_heckel_2018_ICLR] Under-param. 26.94 0.889 0.377
DIP-RED [DIPRED_Mataev_2019_ICCV] Plug-and-play 30.88 0.932 0.197
GP-DIP [GPDIP_Cheng_2019_CVPR] SGLD 29.99 0.948 0.251
DIP-SURE* [DIPSURE_metzler2020unsupervised] ZCSC 30.33 0.941 0.149
Ours w/o STE [eSURE2019Neurips] ZCSC 31.34 0.955 0.108
Ours ZCSC 31.54 0.953 0.107
Table 1: Comparison to DIP variants on CSet9 dataset (). : higher the better, : lower the better. ‘Overfit Prev.’ refers to ‘overfitting prevention method.’ ‘Under-param.’ refers to ‘under parameterized.’ (Best values: in bold. The second best values: underlined). ‘ZCSC’ refers to the proposed zero crossing stopping criterion. ‘DIP-SURE*’ refers to [DIPSURE_metzler2020unsupervised] with ZCSC for fair comparison among the methods using the SURE formulation.
Dataset BM3D [BM3D_2007_Dabov] DIP [DIP_2020_IJCV] S2S [S2S_Quan_2020_CVPR] Ours (Ours*) BM3D [BM3D_2007_Dabov] DIP [DIP_2020_IJCV] S2S [S2S_Quan_2020_CVPR] Ours (Ours*) BM3D [BM3D_2007_Dabov] DIP [DIP_2020_IJCV] S2S [S2S_Quan_2020_CVPR] Ours (Ours*)
Color Image Datasets
CSet9 15 33.83 31.83 33.24 33.83 (34.07) 0.972 0.960 0.968 0.973 (0.975) 0.111 0.114 0.135 0.070 (0.077)
25 31.68 29.96 31.72 31.54 (31.88) 0.956 0.940 0.956 0.953 (0.960) 0.161 0.152 0.173 0.107 (0.118)
50 28.92 27.42 29.25 28.90(29.03) 0.922 0.900 0.928 0.923 (0.930) 0.267 0.291 0.235 0.181 (0.200)
CBSD68 15 33.51 31.48 32.78 33.43 (33.56) 0.961 0.941 0.956 0.961 (0.963) 0.081 0.081 0.102 0.060 (0.057)
25 30.70 28.66 30.67 30.67 (30.86) 0.932 0.900 0.932 0.932 (0.936) 0.148 0.156 0.147 0.102 (0.100)
50 27.37 25.70 27.62 27.43 (27.58) 0.871 0.832 0.879 0.873 (0.881) 0.298 0.329 0.244 0.194 (0.198)
Kodak 15 34.41 32.17 33.70 34.35 (34.49) 0.962 0.941 0.958 0.961 (0.963) 0.104 0.105 0.118 0.080 (0.077)
25 31.82 29.68 31.79 31.60 (31.98) 0.938 0.907 0.939 0.932 (0.941) 0.161 0.173 0.159 0.117 (0.118)
50 28.62 26.77 29.08 28.58 (28.76) 0.886 0.843 0.898 0.882 (0.892) 0.287 0.338 0.235 0.203 (0.209)
McM 15 34.05 32.54 33.92 34.13 (34.35) 0.969 0.956 0.968 0.967 (0.970) 0.068 0.067 0.089 0.053 (0.052)
25 31.66 30.09 32.15 31.89 (31.98) 0.950 0.929 0.955 0.950 (0.953) 0.107 0.123 0.117 0.085 (0.085)
50 28.51 27.06 29.29 28.83 (28.82) 0.910 0.882 0.924 0.913 (0.918) 0.207 0.252 0.178 0.151 (0.162)
Gray-scale Image Datasets
BSD68 15 31.07 28.83 30.62 30.98 (31.21) 0.872 0.812 0.858 0.873 (0.882) 0.147 0.163 0.163 0.090 (0.099)
25 28.57 26.59 28.60 28.40 (28.78) 0.801 0.734 0.801 0.800 (0.818) 0.226 0.262 0.197 0.157 (0.159)
50 25.61 24.13 25.70 25.75 (25.81) 0.686 0.625 0.687 0.696 (0.708) 0.363 0.443 0.313 0.262 (0.282)
Set12 15 32.36 30.12 32.07 32.20 (32.26) 0.895 0.837 0.889 0.891 (0.894) 0.117 0.132 0.139 0.084 (0.092)
25 29.93 27.54 30.02 29.79 (29.76) 0.850 0.776 0.849 0.844 (0.848) 0.159 0.218 0.159 0.122 (0.137)
50 26.71 24.67 26.49 26.60 (26.47) 0.768 0.683 0.734 0.755 (0.760) 0.262 0.361 0.232 0.208 (0.228)
Table 2: Comparison to the state of the arts on single-image denoising algorithm. () denotes that higher is better and () denotes the lower is better. Best performance is in bold. Second best is underlined. For DIP, we report the peak PSNR scores during the optimization.

Comparison to DIP variants.

Table 1 shows the denoising results of several DIP based methods. Deep decoder (DD) [Deep_heckel_2018_ICLR] shows the worst performance in all metrics. We believe that DD mitigates overfitting problem with under-parameterized network in return for its performance. GP-DIP [GPDIP_Cheng_2019_CVPR] outperforms DIP in PSNR and SSIM. It uses SGLD [SGLD_19] to sample multiple instances of posterior distribution and average them which is similar to Self2Self[S2S_Quan_2020_CVPR]. This strategy may be useful for PSNR score but it may lose the texture of images, which leads to relatively low LPIPS score (see next section for more discussions). DIP-RED shows the best result apart from our method and its ablated version. Its plug-and-play overfitting prevention uses other denoising method as prior. Plug-and-play method might work with our method but it is beyond the scope of this paper. Note that all above methods except ours, DIP-SURE* and DIP stops optimization at the predefined number of iterations provided in the authors’ codes.

In particular, both DIP-SURE* [DIPSURE_metzler2020unsupervised] and ‘Ours w/o STE’ [eSURE2019Neurips] are worse than ours even though they use SURE formulation. We argue that it is because they use a single noise realization. In addition, they are quite similar each other except DIP-SURE* depends on the

as a hyper-parameter while ‘Ours w/o STE’ does not have such hyperparameter (Sec. 

4). The no need of hyperparameter tuning results in a noticeable gain by ‘Ours w/o STE.’ Note that original DIP-SURE depends on early stopping by monitoring PSNR with a clean image. For fair comparison, we use our stopping criterion to it and notate it as DIP-SURE*.

Comparison to the state of the arts.

Table 2 shows comparative results with other single image denoisning methods in six datasets (four color and two gray-scale). The comparing methods includes CBM3D [BM3D_2007_Dabov], DIP [DIP_2018_CVPR], Self2Self (S2S) [S2S_Quan_2020_CVPR]

. Except for BM3D, all remaining methods are based on convolutional neural network. For the network architecture for DIP, we use the same one to ours for fair comparison. We use slightly difference network for S2S since it needs the dropouts instead of batch normalization.

Our method outperforms all other single-image denoising methods in LPIPS, showing comparable PSNR and SSIM. Ours* exhibits best PSNR performance outperforming all comparing methods except S2S in high noise experiments. But Ours* loses some high frequency details to ours (see LPIPS). We believe that it is due to the exponential moving average (EMA) as it alleviates the instability of training (, rough solution space) that cannot be caught by PSNR. Ours performs well especially at low noise. We believe that the error by the MC estimation is smaller in the small noise set-up. Nevertheless, our method exhibits excellent performance in LPIPS and SSIM in almost all setups.

CBM3D 31.74/0.096
DIP* 29.46/0.095
S2S 31.59/0.088
Ours 31.61/0.061
CBM3D 27.82/0.245
DIP* 26.77/0.265
S2S 28.67/0.193
Ours 28.07/0.136
Figure 4: Qualitative comparisons. Best performance: bold. Second best: underlined. More results are in the supplement.

It is worth noting that S2S exhibits high LPIPS, especially in low noise () than all other methods despite being ahead of DIP in PSNR. Considering that MSE is the sum of squared bias and variance, we argue that the S2S achieves impressive PSNR results with significantly reduced variance and increased squared bias (, destroying textural details). It is clearly observed in Fig. 1-(b), where we show the results of S2S with various number of ensembles. As the number of ensembles increases, the PSNR also increases in return of the loss of LPIPS score. In contrast, our method achieves much better trade-off between LPIPS score and PSNR without ensembling.

Moreover, inference time of S2S on CSet9 is almost 35 hours without parallel processing whereas ours only takes 4 hours. Further speed up of S2S and ours are possible by parallel processing [S2S_Quan_2020_CVPR] but the gap would be maintained.

Although it is not quite fair to compare our method with learning-based ones including DnCNN [DnCNN_2017_TIP], N2N [N2N_2018_ICML], HQ-N2V [HQN2V2019Neurips], IRCNN [IRCNNzhang_2017_learning] as we only use a single noisy observation, we additionally compare with them in the supplementary material for the space sake.

5.3 Qualitative analysis

We present examples of denoised images in Fig. 4. In the first row, we observe that the results of CBM3D and S2S are over smoothed (having less high frequency details) than those by our method. DIP preserves textures but is much noisier than ours. Again, we observe that our results are in better trade-off between PSNR and LPIPS.

The second rows has higher noise level () than the first row. S2S and CBM3D show clean images with sharp edges. But they also make the English characters in the sign blurry. In contrast, our method preserves sharper details in the character in the sign while noises are mostly suppressed. More qualitative results are in supplement.

Noise scale BM3D-VST [VST2013Makitalo] DIP [DIP_2020_IJCV] S2S [S2S_Quan_2020_CVPR] Ours (Ours*)
30.50 30.99 32.18 32.00 (31.94)
21.57 23.54 22.84 24.87 (24.94)
18.48 21.43 20.10 22.85 (22.90)
Table 3: Comparison to the state of the arts on Poisson noise (PSNR (dB)). Best performance: bold. Second best: underlined.

5.4 Extension to Poisson noise

Poisson noise is likely to occur in low light condition such as microscopic imaging. In [PURE2018Soltanayev]

, they use MNIST images for simulating this scenario. We conduct experiments of single-image Poisson denoising, and summarize the comparative results with BM3D-VST 

[VST2013Makitalo], DIP, and S2S in Table 3. Note that BM3D-VST is one of the most popular methods for Poisson denoising.

For low noise level (), noise distribution becomes almost symmetric similar to Gaussian. So, our method does not perform well. But at higher level of noise, our method outperforms other methods. DIP shows better results than classic methods such as BM3D with VST [VST2013Makitalo], and the state of the art, S2S, in the higher noise. Our method outperforms all compared methods including the classic methods with VST (BM3D-VST) [VST2013Makitalo].

Fig. 5 shows the qualitative result of the Poisson noise setup. We observe that BM3D+VST images were considerably blurrier than other methods, and S2S also produce blurry image due to overfitting. DIP shows the second best result thanks to early stopping. In contrast, our method denoises holes in the images with detailed texture preserved without early stopping.

Noise 18.81/0.053
BM3D-VST [BM3D_2007_Dabov]19.87/0.040
DIP [DIP_2020_IJCV] 21.17 / 0.024
S2S [S2S_Quan_2020_CVPR] 20.09 / 0.040
Ours 22.99 / 0.020
Figure 5: Qualitative comparison on Poisson noise ().

6 Conclusion

We investigate DIP for denoising by the notion of effective degrees of freedom to monitor the overfitting to noise and propose stochastic temporal ensembling (STE) and zero crossing stopping criterion to stop the optimization before it overfits without a clean image. We significantly improve the performance of Gaussian denoising by DIP without the manual early stopping and extend the method to Poisson denoising with PURE. Our empirical validation shows that the proposed method outperforms state-of-the-arts in LPIPS by large margins with comparable PSNR and SSIM, evaluated with the seven different datasets.


This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2019R1C1C1009283) and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-01842, Artificial Intelligence Graduate School Program (GIST)), (No.2019-0-01351, Development of Ultra Low-Power Mobile Deep Learning Semiconductor With Compression/Decompression of Activation/Kernel Data, 17%), (No. 2021-0-02068, Artificial Intelligence Innovation Hub) and was conducted by Center for Applied Research in Artificial Intelligence (CARAI) grant funded by DAPA and ADD (UD190031RD). The work of SY Chun was supported by Basic Science Research Program through National Research Foundation of Korea (NRF) funded by Ministry of Education (NRF-2017R1D1A1B05035810).