1 Introduction
Deep neural network has been widely used in many computer vision tasks, yielding significant improvements over conventional approaches since AlexNet
[Alexnet_2012_Alex]. However, image denoising has been one of the tasks in which conventional methods such as BM3D [BM3D_2007_Dabov]outperformed many early deep learning based ones
[SDA2010Pascal, MLP_2012_CVPR, DNNAllgaussian2014wang] until DnCNN [DnCNN_2017_TIP] outperforms it for synthetic Gaussian noise at the expense of massive amount of noiseless and noisy image pairs [DnCNN_2017_TIP].Requiring no clean and/or noisy image pairs, deep image prior (DIP) [DIP_2018_CVPR, DIP_2020_IJCV]
has shown that a randomly initialized network with hourglass structure acts as a prior for several inverse problems including denoising, superresolution, and inpainting with a single degraded image. Although DIP exhibits remarkable performance in these inverse problems, denoising is the particular task that DIP does not perform well, , a single run yields far lower PSNR than BM3D even for synthetic Gaussian noise setup
[DIP_2018_CVPR, DIP_2020_IJCV]. Furthermore, for the best performance, one needs to monitor the PSNR (, the groundtruth clean image is required here) and stop the iterations before fitting to noise. Deep Decoder addresses the issue by proposing a strong structural regularization to allow longer iterations for the inverse problems including denoising [Deep_heckel_2018_ICLR]. However, it yields worse denoising performance than DIP due to low model complexity.For better use of DIP for denoising without monitoring PSNR with a clean image, we first analyze the model complexity of the DIP by the notion of effective degrees of freedom (DF) [Tibshirani2014DoF, df2004Efron, dfinDNN2016UAI]. Specifically, the DF quantifies the amount of overfitting (, optimism) of a chosen hypothesis (, a trained neural network model) to the given training data [df2004Efron]
. In other words, when overfitting occurs, the DF increases. Therefore, to prevent the overfitting of the DIP network to the noise, we want to suppress the DF over iterations. But obtaining DF again requires a clean (ground truth) image. Fortunately, for the Gaussian noise model, there are approximations for DF without using a clean image; MonteCarlo divergence approximations in Stein’s unbiased risk estimator (SURE) (Eqs.
8, 9) ().Leveraging SURE and improvement techniques in DIP [DIP_2020_IJCV], we propose an objective with ‘stochastic temporal ensembling (STE),’ which mimics ensembling of many noise realizations in a single optimization run. On the proposed objective with the STE, we propose to stop the iteration when the proposed objective function crosses zero. The proposed method leads to much better solutions than DIP and outperforms prior arts for single image denoising. In addition, inspired by PURE formulation [PURE2011Luisier, PGURE2014Montagner], we extend our objective function to address the Poisson noise.
We empirically validate our method by comparing DIP based prior arts for denoising performance in various metrics that are suggested in the literature [Gu2019ABR] such as PSNR, SSIM and learned perceptual image patch similarity (LPIPS) [LPIPS_zhang2018perceptual] on seven different datasets. LPIPS has been widely used in super resolution literature to complement PSNR, SSIM to measure the recovery power of details [Ledig2017SRGAN]. Since it is challenging for denoiser to suppress noise and preserve details together [PDTradeoff2018Blau], we argue that LPIPS is another appropriate metric to evaluate denoisers. Note that it has not been widely used in denoising literature yet to analyze the denoising performance. Our method not only denoises the images but also preserves rich textual details, outperforming other methods in LPIPS with comparable classic measures including the PSNR and SSIM.
Our contributions are summarized as follows:

Analyzing the DIP for denoising with effective degrees of freedom (DF) of a network and propose a loss based stopping criterion without groundtruth image.

Incorporating noise regularization and exponential moving average by the proposed stochastic temporal ensembling (STE) method.

Diverse evaluation in various metrics such as LPIPS, PSNR and SSIM in seven different datasets.

Extending our method to Poisson noise.
2 Related work
2.1 Learning based methods
Learningbased denoising methods use a large number of cleannoisy image pairs to train a denoiser. In an early study, a neural network shows decent performance even though the noisy level is unknown, , blind noise setup [MLP_2012_CVPR]. Shortly afterwards, however, [DND_Plotz_2017_CVPR] has shown that most of early learningbased studies have often produced worse results than the classical technique such as BM3D [BM3D_2007_Dabov]. But recently, DnCNN model with residual learning [DnCNN_2017_TIP] outperforms BM3D. Then, several works are proposed to improve the computational efficiency, IRCNN [IRCNNzhang_2017_learning] uses dilated convolution and FFDNet [zhang2018ffdnet] uses downsampled subimages and noise level map.
2.2 Model based methods
Conventional model based methods do not need training but rely on an inductive bias given as a prior. The performance of model based methods depends on the chosen prior knowledge. There are several image priors such as total variation (TV) [TV_PRIOR], Waveletdomain processing [waveletsparsityprior] and BM3D [BM3D_2007_Dabov]. Each prior assumes that the prior distribution is smoothness, low rank and selfsimilarity, respectively.
Image prior by deep neural networks.
Ulyanov [DIP_2020_IJCV]
show that a randomly initialized convolutional neural network serves as an image prior and name it as deep image prior (DIP) and apply to several inverse problems.
Besides the broad usages, the performance of denoining with DIP is still disappointing because of “overfitting” to noise (see Sec. 3). There are several remedies for the noise overfitting of DIP [DIP_2018_CVPR, Deep_heckel_2018_ICLR, GPDIP_Cheng_2019_CVPR, DIPRED_Mataev_2019_ICCV]. DIPRED [DIPRED_Mataev_2019_ICCV] combines a plugandplay prior with DIP, which changes the converge point of DIP. GPDIP [GPDIP_Cheng_2019_CVPR] shows that DIP is asymptotically equivalent to a stationary Gaussian Process prior and introduces stochastic gradient Langevin dynamics (SGLD) [SGLD_19].Deep decoder [Deep_heckel_2018_ICLR] utilizes underparameterized network based on the fact that overfitting is related to model complexity. Inspired by that, we systematically analyze the fitting of a network but improve the performance of DIP without sacrificing the network size.
Recently, Self2Self (S2S) [S2S_Quan_2020_CVPR]
introduces selfsupervised learning based on dropout and ensembling. Owing to model uncertainty from dropout, S2S generates multiple independent denoised instance and averages the outputs for lowvariance solution. It outperforms existing solutions but needs extensive iteration with very low learning rate due to dropout. In addition, there is an approach to combine SURE
[stein_origin] with DIP [DIPSURE_metzler2020unsupervised] (DIPSURE). They share similarity to our work for both use SURE but we further extend it to propose a ‘stochastic temporal ensembling,’ which deviates from the original SURE formulation. Please find further discussion in Sec. 4.1.2.3 Effective degrees of freedom
Effective degrees of freedom (DF) [df2004Efron, Tibshirani2014DoF] provides a quantitative analysis of the amount of fitting of a model to the training data. Efron shows that an estimate of optimism is difference of error on test and training data and relates it to a measure of model complexity deemed effective degrees of freedom [df2004Efron]. Intuitively, it reflects the effective number of parameters used by a model in producing the fitted output [Tibshirani2014DoF]. We use the notion of DF to analyze and detect the overfitting of a network and propose our method.
2.4 Stein’s unbiased risk estimator (SURE)
Stein’s unbiased risk estimator [stein_origin]
is a risk estimator for a Gaussian random variable. It is a useful tool for selecting a model or hyperparameters in denoising problem, since it guarantees unbiasedness for risk estimator without a target vector
[Donoho95adaptingto, Zhang98AdaptiveSURErisk]. The analytic solution for SURE is only available for limited conditions; nonlocal mean or linear filter [Ville2009Nonlocal, NonlocalSURE2011Ville]. When the closed form solution is not available, Ramani [MCSURE_Ramani_2008_TIP] proposed a Monte Carlobased SURE (MCSURE) method to determine nearoptimal parameters based on the bruteforce search of the parameter space. As the SURE based method is limited to Gaussian noise [SURE_2018_NIPS], several works extend it to other types of noises including Poisson [PURE2011Luisier], PoissonGaussian [PGURE2014Montagner], exponential family [GSURE2009Eldar] or nonparametric noise model [BaysianSupervision2007Sch]. We also modify our objective to extend our method to Poisson noise by [PGURE2014Montagner, PURE2011Luisier, PURE2018Soltanayev] (Sec. 4.3).3 Preliminaries
Deep image prior (DIP).
Let a noisy image be modeled as
(1) 
where be a noiseless image that one would like to recover and be an Gaussian noise such that where
is an identity matrix. Denoising can be formulated as a problem of predicting the unknown
from known noisy observation . Ulyanov [DIP_2020_IJCV] argued that a network architecture naturally encourages to restore the original image from a degraded image and name it as deep image prior (DIP). Specifically, DIP optimizes a convolutional neural network with parameter by a simple least square loss as:(2) 
where is a random variable that is independent of . If has enough capacity (, sufficiently large number of parameters or architecture size) to fit to the noisy image , the output of model should be equal to , which is not desirable. DIP uses the early stopping to obtained the results with best PSNR with clean images.
Effective degrees of freedom for DIP.
The effective degrees of freedom [df2004Efron, Tibshirani2014DoF] quantifies the amount of fitting of a model to training data. We analyze the training of DIP by the effective degrees of freedom (DF) in Eq. 3 as a tool for monitoring overfitting to the given noisy image. the DF for the estimator of with input can be defined as follows [hastie1990generalized]:
(3) 
where and are a model (, a neural network) and noise image respectively.
is the standard deviation of the noise.
and indicate the element of corresponding vectors. For example, if the input to is and is a noisy image, , , it is the DF for DIP. Note that can take any input and we use (instead of ) for our formulation.Interestingly, the DF is closely related to the notion of optimism of an estimator , which is defined by the difference between test error and train error [hastie1990generalized, Tibshirani2014DoF] as:
(4) 
where is a mean squared error (MSE) loss, is another realization from the model (, with different in Eq. 1) that is independent of . In [Tibshirani2014DoF], it is shown that . Thus, combining with Eq. 3, it is straightforward to show that
(5) 
It is challenging to compute the covariance since is nonlinear (, a neural network), gradually changing in optimization, and the requires many pairs of noisy and clean (groundtruth) images to compute (note that it is an estimate). Here, we introduce a simple approximated degrees of freedom with a single groundtruth and call it as . We derive the as following:
(6) 
We describe a simple proof of the estimation in the supplementary material.
A large DF implies overfitting to the given input , which is not desirable. If DIP fits to , becomes close to 0. The more the DIP is fitting to , the larger the DF is. We use the to analyze the DIP optimization in empirical studies in Sec. 5.1.
4 Approach
To prevent the overfitting of DIP, we try to suppress the DF (Eq. 3) during the optimization without the access of groundtruth clean image . In Eq. 3, computing the DF is equivalent to the sum of the covariances for each element of the noise image and the model output . There are a number of techniques to simply approximate the covariance computation in statistical learning literature such as AIC [AIC1973akaike], BIC [BIC1978Schwarz] and Stein’s unbiased risk estimator (SURE) [stein_origin]. Both AIC and BIC, however, approximate the DF by counting the number of parameters of a model, so for usual overparameterized deep neural networks, the approximations based on them could be incorrect [modelselectionNN1999Ulrich]. Note that cannot be used for optimizing model because it needs groudtruth clean image .
Here, we propose to use SURE to suppress the DF by deriving the DIP formulation using the Stein’s lemma. The Stein’s lemma for a multivariate Gaussian vector is [stein_origin]:
(7) 
It simplifies the computation of DF from the covariances between and to the expected partial derivatives at each point, which is well approximated in a number of computationally efficient ways [MCSURE_Ramani_2008_TIP, NEWDIV_Soltanayev_2020_ICASSP]. Note that the SURE which is denoted as , consists of Eq. 7 and the DIP loss (Eq. 2) with a modification of its input (from to ) as:
= .  (8) 
While the vanilla DIP loss encourages to fit the output of the model to noisy image , Eq. (8) encourages to approximately fit it to clean image without access to the .
However, it is still computationally demanding to use Eq. 8 as a loss for optimization with any gradient based algorithm due to the divergence term [MCSURE_Ramani_2008_TIP]. A MonteCarlo approximation for Eq. 8 in [MCSURE_Ramani_2008_TIP] can be a remedy to the computation cost, but it introduces a hyperparameter that has to be selected properly for the best performance on different network architectures and/or datasets. For not requiring to tune the hyperparameter , we employed an alternative MonteCarlo approximation for the divergence term [NEWDIV_Soltanayev_2020_ICASSP] as:
(9) 
where is a standard normal random vector, , and the element of the Jacobian is . We denote this ‘estimated degrees of freedom by MonteCarlo’ by and will use it to monitor the DIP optimization without using the PSNR with the clean ground truth images (Sec. 4.2).
4.1 Stochastic temporal ensembling
To improve the fitting accuracy, DIP suggests several methods including noise regularization, exponential moving average [DIP_2020_IJCV]. We propose ‘stochastic temporal ensembling (STE)’ for better fitting performance by leveraging these methods to our objective.
Noise regularization on DIP.
DIP shows that adding extra temporal noises to the input of function at each iteration improves performance for the inverse problems including image denoising [DIP_2020_IJCV]. It is to add a noise vector , with to the input of the function at every iteration of the optimization as:
(10) 
where is fixed but
is sampled from Gaussian distribution with zero mean, standard deviation of
at every iteration. To estimate the by Eq. 8, we replace the input of the model , , with noisy image (from Eq. 3 to Eq. 7). Interestingly, Eq. 10 becomes similar to the denoising autoencoder (DAE), which prevents a model from learning a trivial solution by perturbing output of [DAE2008Vincent].Meanwhile, contractive autoencoder (CAE)
[CAE2011Rifai] minimizes the Frobenius norm of the Jacobian and SURE and its variants minimize the trace of the Jacobian (Eq. 9) thus suppresses the DF. Since we assume that the different realizations of noise are independent, the offdiagonal elements of the matrix are zero, CAE is equivalent to SURE in terms of suppressing the DF. Alain [regautoencoder2014alain] later show that the DAE is a special case of the CAE when . We can rewrite the Eq. 10 by using CAE formulation as:(11) 
when , where is a high order error term from Taylor expansion. Thus, solving this optimization problem is equivalent to penalizing increase of DF. Here, the noise level serves as a hyperparameter for determining performance and it improves performance of DIP by using multiple level of at optimization of DIP. Thus, we further proposed to model as a uniform random variable instead of a empirically chosen hyperparameter such that
(12) 
Exponential moving average.
DIP further shows that averaging the restored images obtained in the last iterations improves the performance of denoising [DIP_2020_IJCV], which we refer to as ‘exponential moving average (EMA).’ It can be thought as an analogy to the effect of ensembling [datadistill2018].
Stochastic temporal ensembling.
Leveraging the noise regularization and the EMA, we propose a method called ‘stochastic temporal ensembling (STE)’ to improve the fitting performance of DIP loss. Specifically, we modify our formulation (Eq. 8) by allowing two noise observations, for target of MSE loss and for the input of the model, , instead of one by setting and as:
(13) 
where is a known noise level of (same as Eq. 1), and are the element of the vectors of and , respectively. Interestingly, Eq. 13 is equivalent to the formulation of extended SURE (eSURE) [eSURE2019Neurips]
, which is shown to be a better unbiased estimator of the MSE with the clean image
. But there are a number of critical differences of ours from [eSURE2019Neurips]. First, Our method does not require training, while Zhussip [eSURE2019Neurips] requires training with many noisy images. Because Zhussip [eSURE2019Neurips] use the fixed instance of , there is no effect of regularization from (Eq. 10), which gives reasonable performance gain( See Sec.5.2). This is our final objective function of DIP that stops automatically by a stopping criterion, described in the following section.4.2 Zerocrossing stopping criterion
SURE works well if the model satisfies the smoothness condition, , admits a welldefined secondorder Taylor expansion [SURE_2018_NIPS, DIPSURE_metzler2020unsupervised]. While a typical learning based denoiser satisfies this smoothness condition [SURE_2018_NIPS, eSURE2019Neurips], the DIP network ‘fits’ to a target image (a noisy image in [DIP_2018_CVPR, DIP_2020_IJCV] and an approximate clean image in our objective) and therefore there is no guarantee that the smoothness condition can be satisfied, especially when it has been converged.
We observed that the divergence term in our formulation (Eq. 13) increases at early iterations (, before convergence) while it starts to diverge to at later iterations (, after convergence). This observation is consistent in all our experiments. Note that this divergence phenomenon was not reported in [DIPSURE_metzler2020unsupervised] because the DIP network with the SURE loss did not seem to be fully converged to recover the fine details with insufficient number of iterations. Based on this observation for our proposed objective, we propose ‘zero crossing stopping criterion’ to stop iteration when our objective function (Eq. 13) deviates from zero.
Solution trajectory.
To help understand the difference between our method to DIP in optimization procedure, similar to Fig. 3 in [DIP_2020_IJCV], we illustrate DIP image restoration trajectory with that of our method in Fig. 2. DIP degrades the quality of the restored images by the overfitting. To obtain the solution close to the clean ground truth image, DIP uses early stopping (blue ). Our formulation has different training trajectory (orange) from DIP (black) and automatically stops the optimization by the zero crossing stopping ( orange ). We argue that the resulting image by our formulation is in general closer to the clean image (blue ) than the solution by DIP, which preserves more high frequency details than the solution by the DIP (Sec. 5.3) thanks to a better target to fit (an approximation of the clean over a noisy image and our proposed principled stopping criterion without using ground truth image). We empirically analyze this phenomenon with our proposed and compare it to in Sec. 5.1 and the supplementary material.
4.3 Extension to Poisson noise
As the SURE is limited to Gaussian noise [SURE_2018_NIPS], there are several attempts to extend it to other types of noises [PGURE2014Montagner, GSURE2009Eldar, BaysianSupervision2007Sch]. Here, we extend our formulation to Poisson noise as it is a useful model for noise in lowlight condition. We modify our formulation (Eq. 13) to use Poisson unbiased risk estimator (PURE) [PURE2011Luisier, PGURE2014Montagner, Kim2020PURECT] for Poisson noise as follows:
, 
(14) 
where is a dimensional binary random variable whose element
takes 1 or 1 with probability 0.5 for each,
is a small positive number and is a Hadamard product. We empirically validate the Poisson extension in Sec. 5.4.5 Experiments
Implementation details.
For the , we use , following [DnCNN_2017_TIP], and for indepth analysis. For the in Eq. 12, we set it to the same value to the . RAdam opimizer [liu2019radam] is used for training with learning rate 0.1. Details including network architectures and datasets are in the supplementary material.
Evaluation metrics.
We use peak signaltonoise ratio (PSNR), structured similarity (SSIM) and learned perceptual image patch similarity (LPIPS)
[LPIPS_zhang2018perceptual]. The PSNR is widely used in denoising literature [DnCNN_2017_TIP, zhang2018ffdnet, IRCNNzhang_2017_learning, N3Net_Pl_2018, jia2019focnet] but is recently argued that it is not an ideal metric as it values the oversmoothed results [LPIPS_zhang2018perceptual, Ledig2017SRGAN]. For this reason, we compare the algorithms with LPIPS as an alternative measurement of human study. We use the publicly available pretrained weights based on AlexNet by the authors [LPIPS_zhang2018perceptual]. We additionally report the performance of the peak PSNR during optimization of our method as a reference (denoted as ‘Ours*’).5.1 Convergence analysis by
Fig. 2(a) shows the , PSNR to and PSNR to ; is the effective degrees of freedom with Ground Trues, PSNR to and PSNR to refer PSNR from the model output to and respectively. As optimization progresses, the degrees of freedom of DIP increases gradually with PSNR to . But PSNR to of DIP decreases from 1,300 iteration. In contrast, in ours, rises at the beginning of the iterations and stays at a certain value. Interestingly, the best stopping point for DIP is near the intersection between DIP and our method in . It implies that the converged value by our method is near the optimal solution of DIP ( of DIP in Fig. 2).
Fig. 2(b) shows the trajectory of two objectives on ; (1) ours w/o STE and (2) ours. As shown in Sec 4.1, STE suppresses the DF by minimizing the norm of the Jacobian, which is similar to trace of Jacobian (the DF from the Stein’s lemma). Accordingly, ours suppresses the DF better than ours w/o STE in (Fig. 2(b) (top)). This tendency is also observed in ‘PSNR to ’ and ‘PSNR to (EMA).’
5.2 Quantitative analysis
Method  Overfit Prev.  PSNR ()  SSIM ()  LPIPS () 
DIP [DIP_2020_IJCV]  Early stopping  29.96  0.940  0.152 
Deep Decoder [Deep_heckel_2018_ICLR]  Underparam.  26.94  0.889  0.377 
DIPRED [DIPRED_Mataev_2019_ICCV]  Plugandplay  30.88  0.932  0.197 
GPDIP [GPDIP_Cheng_2019_CVPR]  SGLD  29.99  0.948  0.251 
DIPSURE* [DIPSURE_metzler2020unsupervised]  ZCSC  30.33  0.941  0.149 
Ours w/o STE [eSURE2019Neurips]  ZCSC  31.34  0.955  0.108 
Ours  ZCSC  31.54  0.953  0.107 
PSNR ()  SSIM ()  LPIPS ()  
Dataset  BM3D [BM3D_2007_Dabov]  DIP [DIP_2020_IJCV]  S2S [S2S_Quan_2020_CVPR]  Ours (Ours*)  BM3D [BM3D_2007_Dabov]  DIP [DIP_2020_IJCV]  S2S [S2S_Quan_2020_CVPR]  Ours (Ours*)  BM3D [BM3D_2007_Dabov]  DIP [DIP_2020_IJCV]  S2S [S2S_Quan_2020_CVPR]  Ours (Ours*)  
Color Image Datasets  
CSet9  15  33.83  31.83  33.24  33.83 (34.07)  0.972  0.960  0.968  0.973 (0.975)  0.111  0.114  0.135  0.070 (0.077) 
25  31.68  29.96  31.72  31.54 (31.88)  0.956  0.940  0.956  0.953 (0.960)  0.161  0.152  0.173  0.107 (0.118)  
50  28.92  27.42  29.25  28.90(29.03)  0.922  0.900  0.928  0.923 (0.930)  0.267  0.291  0.235  0.181 (0.200)  
CBSD68  15  33.51  31.48  32.78  33.43 (33.56)  0.961  0.941  0.956  0.961 (0.963)  0.081  0.081  0.102  0.060 (0.057) 
25  30.70  28.66  30.67  30.67 (30.86)  0.932  0.900  0.932  0.932 (0.936)  0.148  0.156  0.147  0.102 (0.100)  
50  27.37  25.70  27.62  27.43 (27.58)  0.871  0.832  0.879  0.873 (0.881)  0.298  0.329  0.244  0.194 (0.198)  
Kodak  15  34.41  32.17  33.70  34.35 (34.49)  0.962  0.941  0.958  0.961 (0.963)  0.104  0.105  0.118  0.080 (0.077) 
25  31.82  29.68  31.79  31.60 (31.98)  0.938  0.907  0.939  0.932 (0.941)  0.161  0.173  0.159  0.117 (0.118)  
50  28.62  26.77  29.08  28.58 (28.76)  0.886  0.843  0.898  0.882 (0.892)  0.287  0.338  0.235  0.203 (0.209)  
McM  15  34.05  32.54  33.92  34.13 (34.35)  0.969  0.956  0.968  0.967 (0.970)  0.068  0.067  0.089  0.053 (0.052) 
25  31.66  30.09  32.15  31.89 (31.98)  0.950  0.929  0.955  0.950 (0.953)  0.107  0.123  0.117  0.085 (0.085)  
50  28.51  27.06  29.29  28.83 (28.82)  0.910  0.882  0.924  0.913 (0.918)  0.207  0.252  0.178  0.151 (0.162)  
Grayscale Image Datasets  
BSD68  15  31.07  28.83  30.62  30.98 (31.21)  0.872  0.812  0.858  0.873 (0.882)  0.147  0.163  0.163  0.090 (0.099) 
25  28.57  26.59  28.60  28.40 (28.78)  0.801  0.734  0.801  0.800 (0.818)  0.226  0.262  0.197  0.157 (0.159)  
50  25.61  24.13  25.70  25.75 (25.81)  0.686  0.625  0.687  0.696 (0.708)  0.363  0.443  0.313  0.262 (0.282)  
Set12  15  32.36  30.12  32.07  32.20 (32.26)  0.895  0.837  0.889  0.891 (0.894)  0.117  0.132  0.139  0.084 (0.092) 
25  29.93  27.54  30.02  29.79 (29.76)  0.850  0.776  0.849  0.844 (0.848)  0.159  0.218  0.159  0.122 (0.137)  
50  26.71  24.67  26.49  26.60 (26.47)  0.768  0.683  0.734  0.755 (0.760)  0.262  0.361  0.232  0.208 (0.228) 
Comparison to DIP variants.
Table 1 shows the denoising results of several DIP based methods. Deep decoder (DD) [Deep_heckel_2018_ICLR] shows the worst performance in all metrics. We believe that DD mitigates overfitting problem with underparameterized network in return for its performance. GPDIP [GPDIP_Cheng_2019_CVPR] outperforms DIP in PSNR and SSIM. It uses SGLD [SGLD_19] to sample multiple instances of posterior distribution and average them which is similar to Self2Self[S2S_Quan_2020_CVPR]. This strategy may be useful for PSNR score but it may lose the texture of images, which leads to relatively low LPIPS score (see next section for more discussions). DIPRED shows the best result apart from our method and its ablated version. Its plugandplay overfitting prevention uses other denoising method as prior. Plugandplay method might work with our method but it is beyond the scope of this paper. Note that all above methods except ours, DIPSURE* and DIP stops optimization at the predefined number of iterations provided in the authors’ codes.
In particular, both DIPSURE* [DIPSURE_metzler2020unsupervised] and ‘Ours w/o STE’ [eSURE2019Neurips] are worse than ours even though they use SURE formulation. We argue that it is because they use a single noise realization. In addition, they are quite similar each other except DIPSURE* depends on the
as a hyperparameter while ‘Ours w/o STE’ does not have such hyperparameter (Sec.
4). The no need of hyperparameter tuning results in a noticeable gain by ‘Ours w/o STE.’ Note that original DIPSURE depends on early stopping by monitoring PSNR with a clean image. For fair comparison, we use our stopping criterion to it and notate it as DIPSURE*.Comparison to the state of the arts.
Table 2 shows comparative results with other single image denoisning methods in six datasets (four color and two grayscale). The comparing methods includes CBM3D [BM3D_2007_Dabov], DIP [DIP_2018_CVPR], Self2Self (S2S) [S2S_Quan_2020_CVPR]
. Except for BM3D, all remaining methods are based on convolutional neural network. For the network architecture for DIP, we use the same one to ours for fair comparison. We use slightly difference network for S2S since it needs the dropouts instead of batch normalization.
Our method outperforms all other singleimage denoising methods in LPIPS, showing comparable PSNR and SSIM. Ours* exhibits best PSNR performance outperforming all comparing methods except S2S in high noise experiments. But Ours* loses some high frequency details to ours (see LPIPS). We believe that it is due to the exponential moving average (EMA) as it alleviates the instability of training (, rough solution space) that cannot be caught by PSNR. Ours performs well especially at low noise. We believe that the error by the MC estimation is smaller in the small noise setup. Nevertheless, our method exhibits excellent performance in LPIPS and SSIM in almost all setups.
It is worth noting that S2S exhibits high LPIPS, especially in low noise () than all other methods despite being ahead of DIP in PSNR. Considering that MSE is the sum of squared bias and variance, we argue that the S2S achieves impressive PSNR results with significantly reduced variance and increased squared bias (, destroying textural details). It is clearly observed in Fig. 1(b), where we show the results of S2S with various number of ensembles. As the number of ensembles increases, the PSNR also increases in return of the loss of LPIPS score. In contrast, our method achieves much better tradeoff between LPIPS score and PSNR without ensembling.
Moreover, inference time of S2S on CSet9 is almost 35 hours without parallel processing whereas ours only takes 4 hours. Further speed up of S2S and ours are possible by parallel processing [S2S_Quan_2020_CVPR] but the gap would be maintained.
Although it is not quite fair to compare our method with learningbased ones including DnCNN [DnCNN_2017_TIP], N2N [N2N_2018_ICML], HQN2V [HQN2V2019Neurips], IRCNN [IRCNNzhang_2017_learning] as we only use a single noisy observation, we additionally compare with them in the supplementary material for the space sake.
5.3 Qualitative analysis
We present examples of denoised images in Fig. 4. In the first row, we observe that the results of CBM3D and S2S are over smoothed (having less high frequency details) than those by our method. DIP preserves textures but is much noisier than ours. Again, we observe that our results are in better tradeoff between PSNR and LPIPS.
The second rows has higher noise level () than the first row. S2S and CBM3D show clean images with sharp edges. But they also make the English characters in the sign blurry. In contrast, our method preserves sharper details in the character in the sign while noises are mostly suppressed. More qualitative results are in supplement.
Noise scale  BM3DVST [VST2013Makitalo]  DIP [DIP_2020_IJCV]  S2S [S2S_Quan_2020_CVPR]  Ours (Ours*) 
30.50  30.99  32.18  32.00 (31.94)  
21.57  23.54  22.84  24.87 (24.94)  
18.48  21.43  20.10  22.85 (22.90) 
5.4 Extension to Poisson noise
Poisson noise is likely to occur in low light condition such as microscopic imaging. In [PURE2018Soltanayev]
, they use MNIST images for simulating this scenario. We conduct experiments of singleimage Poisson denoising, and summarize the comparative results with BM3DVST
[VST2013Makitalo], DIP, and S2S in Table 3. Note that BM3DVST is one of the most popular methods for Poisson denoising.For low noise level (), noise distribution becomes almost symmetric similar to Gaussian. So, our method does not perform well. But at higher level of noise, our method outperforms other methods. DIP shows better results than classic methods such as BM3D with VST [VST2013Makitalo], and the state of the art, S2S, in the higher noise. Our method outperforms all compared methods including the classic methods with VST (BM3DVST) [VST2013Makitalo].
Fig. 5 shows the qualitative result of the Poisson noise setup. We observe that BM3D+VST images were considerably blurrier than other methods, and S2S also produce blurry image due to overfitting. DIP shows the second best result thanks to early stopping. In contrast, our method denoises holes in the images with detailed texture preserved without early stopping.
6 Conclusion
We investigate DIP for denoising by the notion of effective degrees of freedom to monitor the overfitting to noise and propose stochastic temporal ensembling (STE) and zero crossing stopping criterion to stop the optimization before it overfits without a clean image. We significantly improve the performance of Gaussian denoising by DIP without the manual early stopping and extend the method to Poisson denoising with PURE. Our empirical validation shows that the proposed method outperforms stateofthearts in LPIPS by large margins with comparable PSNR and SSIM, evaluated with the seven different datasets.
Acknowledgement.
This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2019R1C1C1009283) and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019001842, Artificial Intelligence Graduate School Program (GIST)), (No.2019001351, Development of Ultra LowPower Mobile Deep Learning Semiconductor With Compression/Decompression of Activation/Kernel Data, 17%), (No. 2021002068, Artificial Intelligence Innovation Hub) and was conducted by Center for Applied Research in Artificial Intelligence (CARAI) grant funded by DAPA and ADD (UD190031RD). The work of SY Chun was supported by Basic Science Research Program through National Research Foundation of Korea (NRF) funded by Ministry of Education (NRF2017R1D1A1B05035810).