1 Introduction
This paper addresses the classic image denoising problem: an ideal image is measured in the presence of an additive zeromean white and homogeneous Gaussian noise,
, with standard deviation
. The measured image is thus , and our goal is the recovery of from with the knowledge of the parameter . This is quite a challenging task due to the need to preserve the fine details in while rejecting as much noise as possible.The importance of the image denoising problem cannot be overstated. First and foremost, noise corruption is inevitable in any image sensing process, often times heavily degrading the visual quality of the acquired image. Indeed, today’s cellphones all deploy a denoising algorithm of some sort in their camera pipelines [plotz2017benchmarking]
. Removing noise from an image is also an essential and popular prestep in various image processing and computer vision tasks
[katsaggelos2012digital]. Last but not least, many image restoration problems can be addressed effectively by solving a series of denoising subproblems, further broadening the applicability of image denoising algorithms [afonso2010fast, romano2017little]. Due to its practical importance and the fact that it is the simplest inverse problem, image denoising has become the entry point for many new ideas brought over the years to the realm of image processing. Over a period of several decades, many image denoising algorithms have been proposed and tested, forming an evolution of methods with gradually improved performance.A common and systematic approach for the design of novel denoising algorithms is the Bayesian point of view. This calls for image priors, used as regularizers within the Maximum a Posteriori (MAP) or the Minimum Mean Squared Error (MMSE) estimators. In this paper we concentrate on one specific regularization approach, as introduced in
[elad2006image]: the use of sparse and redundant representation modeling of image patches – this is the KSVD denoising algorithm, which stands at the center of this paper. The authors of [elad2006image] defined a global image prior that forces sparsity over patches in every location in the image. Their algorithm starts by breaking the image into small fully overlapping patches, solving their MAP estimate (i.e., finding their sparse representation), and ending with a tiling of the results back together by an averaging. As the MAP estimate relies on the availability of the dictionary, this work proposed two approaches, both harnessing the well known KSVD dictionary learning algorithm [aharon2006k]. The first option is to train offline on an external large corpus of image patches, aiming for a universally good dictionary to serve all test images. The alternative, which was found to be more effective, suggests using the noisy patches themselves in order to learn the dictionary, this way adapting to the denoised image.KSVD has been widely used and extended, as evidenced by its many followup papers. For a short while, this algorithm was considered as stateoftheart, standing at the top in denoising performance^{1}^{1}1Ranking denoising algorithms is typically done by evaluating synthetic denoising performance on agreedupon image databases (e.g. set12 or BSD68), measuring PeakSignaltoNoise (PSNR) and/or Structured Similarity Index Measure (SSIM) results.. However, over the years it has been surpassed by other methods, such as BM3D [dabov2007video], EPLL [zoran2011learning], WNNM [gu2014weighted], and many others. The recent newcomers to this game – supervised deeplearning based denoising methods – are currently at the lead [chen2016trainable, lefkimmiatis2017non, zhang2017beyond, liu2018non, zhang2018ffdnet].
Can KSVD denoising make a comeback and compete favorably with the most recent and best performing denoising algorithms? In this paper we answer this question positively. We show that this algorithm can be brought to perform far better, provided that its parameters are tuned in a supervised manner. By following the exact KSVD computational path, we preserve its global image prior. This includes (i) breaking the image into small fully overlapping patches, (ii) solving their MAP estimate as a pursuit that aims to get their sparse representation in a learned dictionary, and then (ii) averaging the overlapping patches to restore the clean image. A special care is given to the redesign of all these steps into a differentiable and learnable computational scheme. We therefore end up with a deep architecture that reproduces the exact KSVD operations, and can be trained by backpropagation for best denoising results. Our work shows that with small number of parameters to learn and while preserving the original KSVD essence, the proposed machine outperforms the original KSVD and other classical algorithms (e.g. BM3D and WNNM), and getting closer to stateoftheart learning based denoising methods.
Our motivation in this paper goes beyond a simple improvement of the KSVD denoising algorithm, aiming higher and broader. What are the lessons to be taken from the derived solution? How should we design novel and welljustified architectures for solving signal and image processing problems? What is the relation between classic (old fashioned) solutions and learningbased novel ones, in the context of such tasks? How can we further improve the proposed scheme in a principled way? All these and more are central questions, discussed towards the end of this paper. We urge the readers to go through the discussion towards the end of the paper carefully, as is gives the proper context to this work, and to its future prospects.
This paper is organized as followed. Section 2 recalls the KSVD denoising algorithm, serving as the background for our derived alternative. In Section 3 we present the designed architecture with various modifications and adjustments that enable differentiabilty, local adaptivity, and more. Section 4 describes series of experiments that demonstrate the superiority of the proposed learned network over the classic KSVD denoising algorithm, and show the tendency of our proposed network to have competitive performance with recent learned methods. We conclude this work in Section 5 with a wide discussion about this work and its contributions, and highlight potential future research directions.
2 The KSVD Denoising Algorithm
In [elad2006image] the authors address the image denoising problem by using local sparsity and redundancy as ingredients in the formation of a global Bayesian objective. In this section we describe this KSVD denoising algorithm by discussing (i) their global prior; (ii) the objective function induced; (iii) its corresponding numerical solver; and (iv) the two approaches for training the corresponding dictionary.
2.1 From the Patch to a Global Objective Function
We start by introducing the local prior as imposed on patches in [elad2006image]. Let be a small image patch of size
pixels, ordered lexicographically as a column vector of length
. The sparse representation model assumes that is built as a linear combination of columns (also referred to as atoms) taken from a prespecified dictionary^{2}^{2}2The option implies that the dictionary is redundant. . Put formally, , where is a sparse vector with nonzeros (this is denoted by ). Consider , a noisy version of , contaminated by an additive zeromean white Gaussian noise with standard deviation . The MAP estimator for denoising this patch is obtained by solving(1) 
aiming to recover the sparse representation vector of . This is followed by , obtaining the denoised result [chen2001atomic, donoho2005stable, tropp2006just]. Note that the above optimization can be changed to a Lagrangian form,
(2) 
such that the constraint becomes a penalty. With a proper choice of , which is signal (the vector ) dependent, the two problems can become equivalent.
Moving now to handle a complete and large image of size and its noisy version (both held as vectors of length ), the global image prior proposed in [elad2006image] imposes the abovedescribed local prior on every patch in , considering their extractions with full overlaps. This leads to the following global MAP estimator for the denoising:
(3)  
In this expression, the first term is the loglikelihood global force that demands a proximity between the measured image, , and its denoised (and unknown) version . Put as a constraint, this penalty would have read , which reflects the direct relationship between and .
The second term stands for the image prior that assures that in the constructed image, , every patch^{3}^{3}3For simplicity and without loss of generality, a single index is used to account for the spatial image location. of size in every location (thus, the summation by ) has a sparse representation with bounded error. The matrix stands for an operator that extracts the th block from the image. As to the coefficients , those must be spatially dependent, so as to comply with a set of constraints of the form .
2.2 Numerical Solution
Assume for the moment that the underlying dictionary
is known. The objective function in Equation (3) has two kinds of unknowns: the sparse representations per each location, and the output image . Instead of addressing both together, the authors of [elad2006image] propose a blockcoordinate minimization algorithm that starts with an initialization , and then seeks the optimal for all locations . This leads to a decoupling of the minimization task to many smaller pursuit problems of the form(4) 
each handling a separate patch. This is solved in [elad2006image] using the Orthonormal Matching Pursuit (OMP) [elad2010sparse], which gathers one atom at a time to the solution, and stops when the error goes below^{4}^{4}4In fact, the threshold used in [elad2006image] is , with , which was found empirically to perform best. . This way, the choice of has been handled implicitly. Thus, this stage works as a sliding window sparse coding stage, operated on each patch of size pixels at a time.
Given all the sparse representations of the patches, , we can now fix those and turn to update . Returning to the expression in Equation (3), we need to solve
(5) 
This is a simple quadratic term that has a closedform solution of the form
(6) 
The matrix to invert in the above expression is a diagonal one, and thus the required computation is quite simple. In fact, all that this expression does is to put back the patches to their original locations, and average these with a weighted version of the noisy image itself.
All the above stands for a single update of and then . For an effective blockcoordinate minimization of the cost function in Equation (3) we should repeat these pair of updates several times. However, a difficulty with such an approach is the fact that once has been modified, we no longer know the level of noise in each patch, and thus the stopping criteria for the OMP becomes more challenging. The original KSVD denoising algorithm, as proposed in [elad2006image], chose to apply only the first round of updates. The work reported in [sulam2015expected] adopts an EPLL point of view [zoran2011learning], extending the iterative algorithm further for getting improved results.
2.3 Obtaining the Dictionary
The discussion so far has been based on the assumption that the dictionary is known. This could be the case if we train it using the KSVD algorithm over a corpus of clean image patches [elad2010sparse]. An interesting alternative is to embed the identification of within the Bayesian formulation. Returning to the objective function in Eq. (3), the authors of [elad2006image] also considered the case where is an unknown,
In this case, is learned using all the existing noisy patches taken from itself. Put more formally, a blockcoordinate minimization is done: Initialize the dictionary as the overcomplete DCT matrix and set . Then iterate between the OMP over all the patches and an update of using the KSVD strategy [aharon2006k]. After such rounds, the dictionary admits a content adapted to the image being treated, and the representations are ready for a final stage in which the output image is computed via Eq. (6).
3 Proposed Architecture
In this work our goal is to design a network that reproduces the KSVD denoising algorithm, while having the capacity to learn its parameters. One of the main difficulties we encounter is the pursuit stage, in which we are supposed to replace the greedy OMP algorithm by an equivalent learnable alternative. This may seem as an easy task, as we can use the based Iterated SoftThresholding Algorithm (ISTA), unfolded appropriately for several iterations [gregor2010learning, daubechies2004iterative]. However, the challenge is the fact that OMP easily adapts the treatment for each patch using a stopping criterion based on the noise level. The equivalence in the ISTA case requires an identification of the appropriate regularization parameter for each patch, which is a nontrivial task. Assuming that this issue has been resolved, our computational process includes a decomposition of the image into its overlapped patches, cleaning of each by an appropriate pursuit, and a reconstruction of the overall image by averaging the cleaned patches. We propose to learn the parameters of this network by training over pairs of corrupted and groundtruth images. Next, we describe in details this overall architecture.
3.1 Patch Denoising
Figure 1 illustrates our endtoend architecture. We start by describing the three stages that perform the denoising of the individual patches.
Sparse Coding: Given a patch (held as a column vector of length ) corrupted by an additive zeromean Gaussian noise with standard deviation , we aim to derive its sparse code according to a known dictionary . This objective can be formulated as in Equation (1). An approximate solution to this problem can be obtained by replacing the norm with an [donoho2003optimally, donoho2005stable]:
(7) 
For a proper choice of , the above can be reformulated as
(8) 
A popular and effective algorithm for solving the above problem is the Iterative Soft Thresholding Algorithm (ISTA) [daubechies2004iterative], which is guaranteed to converge to the global optimum
(9) 
where is the square spectral norm of and is the componentwise softthresholding operator,
(10) 
The motivation to adopt a proximal gradient descent method, as done above, is the fact that it allows an unrolling of the sparse coding stage into a meaningful and learnable scheme, just as practiced in [gregor2010learning]. Indeed, replacing the norm by the supports this goal as it allows to differentiate through this scheme. Because of these reasons, in this work we consider a learnable version of ISTA by keeping exactly the same recursion with a fixed number of iterations , and letting and become the learnable parameters.
Evaluation: Referring to the pursuit formulation in Equation (8), an important issue is the need to set the parameter . This regularization coefficient depends not only on but also on the patch itself. Following the computational path of the KSVD denoising algorithm in [elad2006image], we should set for each patch so as to yield sparse representation with a controlled level of error, . As there is no closedform solution to this evaluation of s, we propose to learn a regression function from the patches to their corresponding regularization parameters
. A MultiLayer Perceptron (MLP) network is used to represent this function,
, whereis the vector of the parameters of the MLP. Our MLP consists of three hidden layers, each composed of a fully connected linear mapping followed by a ReLU (apart from the last layer). The input layer has
nodes, which is the dimension of the vectorized patch, and the output layer consists of a single node, being the regularization parameter. The overall structure of the network is given by the following expression, in which symbolizes a multiplication by a matrix of that size: . Thus, an overall of nearly parameters are needed for this regression network.Patch Reconstruction: This stage reconstructs the cleaned version of the patch using and the sparse code . This is given by . Note that in our learned network, the dictionary stands for a set of parameters that are shared in all locations where we multiply by either or .
3.2 Endtoend Architecture
We can now discuss the complete architecture. We start by breaking the input image into fully overlapping patches, then treat each corrupted patch via the abovedescribed patch denoising stage, and conclude by rebuilding the image by averaging the cleaned version of these patches. In the last stage we slightly deviate from the original KSVD, by allowing a learned weighted combination of the patches. Denoting by this patch of weights, the reconstructed image is obtained by
(11) 
where is the Schur product, and the division is done elementwise. This weighted averaging aligns with Guleryuz’ approach as advocated in [guleryuz2007].
To conclude, the proposed network is a parametrized function of (the parameters of the MLP network computing ), (the stepsize in the ISTA algorithm), (the dictionary) and (the weights for the patchaveraging). The overall number of parameters stands on ; for example, for and , this number is .
Given a corrupted image , the computation returns a cleaned version of it. Training
is done by minimizing the loss function
, with respect to all the above parameters. In the above objective, the set stands for our training images, and are their synthetically noisy versions, obtained by , where is a zero mean and white Gaussian iid noise vector.3.3 Extension to Multiple Update
As already mentioned in the previous section, an EPLL version of the KSVD can be envisioned, in which the process of cleaning the patches is repeated several times. This implies that once the above architecture obtains its output , the whole scheme could be applied again (and again). This diffusion process of repeated denoisings has been shown in [sulam2015expected] to improve the KSVD denoising performance. However, the difficulty is in setting the noise level to target in each patch after the first denoising, as it is no longer . In our case, we adopt a crude version of the EPLL scheme, in which we disregard the noise level problem altogether, and simply assume that the evaluation stage takes care of this challenge, adjusting the MLP in each round to best predict the values to be used. Thus, our iterated scheme shares the dictionary across all denoising stages, while allowing a different evaluation network for each stage.
4 Experimental Results
We turn to present experiments with the proposed Learned KSVD (LKSVD). Our goals are to show that LKSVD is

Much better than the original KSVD in its two forms – the image adaptive algorithm (), and the one using a universal dictionary ();

Better than other classic denoising algorithms; and

Competitive with recent deeplearning based denoisers.
4.1 Training
Dataset: In order to train our model we generate the training data using the Berkeley segmentation dataset (BSDS) [MartinFTM01], which consists of 500 images. We split these images into a training set of 432 images and the validation/test set that consists of the remaining 68 images. We note that these 68 images are exactly the ones used in the standard evaluation dataset of [roth2009fields]. In addition, following [liu2018non, zhang2017beyond], we test our proposed method on the benchmark Set12 – a collection of widelyused testing images. The training and the two test sets are strictly disjoint and all the images are converted to grayscale in each experiment setup. This allows a fair and comprehensive comparison with recent deep learning based methods, as we train and test on the same datasets and benchmarks used in [lefkimmiatis2017non, lefkimmiatis2018universal, zhang2017beyond, liu2018non, chen2016trainable, mao2016image].
Training Settings: During training we randomly sample cropped images of size from the training set. We add i.i.d. Gaussian noise with zero mean and a specified level of noise to each cropped image as the noisy input during training. We train a different model for each noise level, considering .
We use SGD optimizer to minimize the loss function. We set the learning rate as and consider one cropped image as the minibatch size during training. We use the same initialization as in the KSVD algorithm to initialize the dictionary , i.e the overcomplete DCT matrix. We also initialize the normalization paramater of the sparse coding stage using the squared spectral norm of the DCT matrix. The other parameters of the network are randomly initialized using Kaiming Uniform method. Training a model takes about days with a Titan Xp GPU.
Test Settings: Our network does not depend on the input size of the image. Thus, in order to test our architecture’s performance, we simply add white Gaussian noise with a specified power to the original image, and feed it to the learned scheme. The metric used to determine the quality is the standard PeakSignaltoNoise (PSNR).
4.2 Results
In Tables I, II and III we compare^{5}^{5}5The results in these tables corresponding to BM3D and WNNM have been taken from [liu2018non] and [zhang2018ffdnet], respectively. LKSVD with the two original KSVD versions ( and ) and two leading classic denoising algorithms, BM3D [dabov2007video] and WNNM [gu2014weighted]. Tables I and II refer to the BSD68 testset (one showing PSNR and the other SSIM quality measures) and Table III shows the Set12 results (PSNR only). In this comparison, LKSVD is set to use the same patch and dictionary sizes as in and from [elad2006image], namely and . Also, LKSVD applies unfolded iterations of ISTA, and EPLLlike denoising rounds.
Dataset  Noise  BM3D  WNNM  LKSVD  

BSD 68  15  31.07  31.37  30.91  30.87  31.48 
25  28.57  28.83  28.32  28.28  28.96  
50  25.62  25.87  25.03  25.01  25.97 
Dataset  Noise  BM3D  WNNM  LKSVD  

BSD 68  15  0.8717  0.8766  0.8692  0.8685  0.8835 
25  0.8013  0.8087  0.7876  0.7894  0.8171  
50  0.6864  0.6982  0.6322  0.6462  0.7035 
Images  C.man  House  Peppers  Starfish  Monarch  Airplane  Parrot  Lena  Barbara  Boat  Man  Couple  Average 

Noise level  
BM3D  31.91  34.93  32.69  31.14  31.85  31.07  31.37  34.26  33.10  32.13  31.92  32.10  32.37 
WNNM  32.17  35.13  32.99  31.82  32.71  31.39  31.62  34.27  33.60  32.27  32.11  32.17  32.70 
31.43  32.21  34.23  30.80  31.59  30.99  31.64  31.45  30.95  31.83  32.44  33.78  31.95  
31.39  32.16  33.85  30.96  31.66  30.96  31.62  31.71  30.99  31.63  30.58  33.49  31.75  
DKSVD  32.16  32.92  34.59  31.54  32.11  31.66  32.22  32.78  31.78  32.18  32.22  34.24  32.53 
Noise level  
BM3D  29.45  32.85  30.16  28.56  29.25  28.42  28.93  32.07  30.71  29.90  29.61  29.71  29.97 
WNNM  29.64  33.22  30.42  29.03  29.84  28.69  29.15  32.24  31.24  30.03  29.76  29.82  30.26 
28.75  29.64  31.86  28.21  29.10  28.42  29.26  28.83  28.27  29.44  29.77  31.37  29.41  
28.78  29.74  31.46  28.39  29.04  28.57  29.16  28.85  28.24  29.18  27.61  31.04  29.17  
DKSVD  29.70  30.35  32.53  28.92  29.71  29.13  29.85  30.15  28.99  30.07  30.06  31.99  30.12 
Noise level  
BM3D  26.13  29.69  26.68  25.04  25.82  25.10  25.90  29.05  27.22  26.78  26.81  26.46  26.72 
WNNM  26.45  30.33  26.95  25.44  26.32  25.42  26.14  29.25  27.79  26.97  26.94  26.64  27.05 
25.12  25.93  27.82  24.86  25.56  24.80  26.16  25.11  24.45  25.98  25.78  27.71  25.78  
25.29  26.02  27.71  24.85  25.44  25.15  25.98  24.82  24.32  25.93  24.04  27.32  25.58  
DKSVD  26.67  26.97  29.37  25.61  26.55  26.00  26.95  26.54  25.38  26.98  27.03  28.86  26.91 
A clear conclusion from the above tables is the fact that LKSVD is much better performing compared to the classic KSVD, be it the universal dictionary approach or the image adaptive one. Indeed, the PSNR BSD68 results suggest that LKSVD is better than BM3D (by dB) and WNNM (by dB) as well. Table III displays a slightly different story, where LKSVD is still better performing compared to BM3D, while being slightly weaker than the WNNM. Recall that BM3D and WNNM both leverage nonlocal selfsimilarity, which gives them an edge over KSVD. In addition, these two methods have been tuned for best results for this test set (see in particular their exceptional results for Lena and Barbara). As a final note we add that Table II shows that the ordering of the methods remains the same as we move from PSNR to the SSIM quality measure, which explains our choice to use PSNR for the rest of the experiments.
We proceed by exploring the effect of (patch size), (dictionary size) and (number of denoising steps) on the LKSVD performance. We denote by the result for the proposed architecture with these specified parameters. Table IV presents the obtained results for the two benchmarks (BSD68 and Set12) and a noise level of . As can be seen, even with , LKSVD is markedly better than the classic KSVD. As grows, the performance improves by dB per each additional denoising round. A boost in performance is also obtained when growing the patchsize to while preserving the redundancy factor of the dictionary. This also shows that the proposed scheme has the capacity to yield results that go beyond the ones reported in Tables I and III.
Dataset  Noise  

BSD 68  25  28.32  28.28  28.76  28.96  28.95  29.07 
Set 12  29.41  29.17  29.76  30.12  30.09  30.22 
We conclude by comparing the with recent learningbased denoising competitors: TNRD [chen2016trainable], NLNet [lefkimmiatis2017non], DnCNN [zhang2017beyond] and NLRNet [liu2018non]. The results are shown in Table V, referring to the two benchmarks. As can be seen, our scheme surpasses TNRD [chen2016trainable] and even the nonlocal deeplylearned denoiser by Lefkimmiatis [lefkimmiatis2018universal, lefkimmiatis2017non]. Still, there is a gap between LKSVD and the best performing denoisers DnCNN [zhang2017beyond] and NLRNet [liu2018non]. Table VI sheds more light on these results by presenting the model complexities involved in this experiment. As can be seen, our network () uses about 10% of the overall number of parameters compared to the better performing methods.
Dataset  Noise  TNRD  NLNet  DnCNN  NLRNet  

BSD 68  15  31.42  31.52  31.73  31.88  31.54 
25  28.92  29.03  29.23  29.41  29.07  
50  25.97  26.07  26.23  26.47  26.13  
Set 12  15  32.50    32.86  33.16  32.61 
25  30.06    30.44  30.80  30.22  
50  26.81    27.18  27.64  27.04 
DnCNN  NLRNet  LKSVD  

Max effective depth  17  38  21 
Parameter sharing  No  Yes  Yes 
Parameter no.  554k  330k  45k 
We conclude by presenting visual results of the various methods compared. Figure 2 shows the denoising results of BM3D, WNNM, , DnCNN, and LKSVD. The figure refers to a noise level of and the images used are taken from the BSD68 test set.
Noisy  BM3D  WNNM  DnCNN  LKSVD  

PSNR=20.17  PSNR= 27.08  PSNR=27.25  PSNR= 27.15  PSNR= 27.72  PSNR= 27.43 
PSNR=20.15  PSNR= 32.43  PSNR=32.59  PSNR= 31.77  PSNR= 26.29  PSNR= 32.57 
PSNR=20.19  PSNR= 28.22  PSNR=28.34  PSNR= 27.91  PSNR= 30.30  PSNR = 28.56 
PSNR=20.17  PSNR= 25.60  PSNR= 25.85  PSNR= 25.74  PSNR= 30.74  PSNR=26.16 
5 Discussion and Conclusions
5.1 Why bother improving KSVD denoising?
The rationale behind this work goes beyond a simple improvement of the KSVD denoising algorithm. Indeed, our motivation is drawn from the hope to propose systematic ways of designing deeplearning architectures and connecting novel solutions to classical algorithms.
A fundamental question nowadays in computational imaging is whether old/classic methods should be discarded and replaced by their deeplearning alternatives. In the context of image denoising, classical methods focused on data modeling and optimization, and searched for ways to identify and exploit the redundancies existing in the visual data. The recent deep networks, which lead the denoising performance charts today, take an entirely different route, targeting the inference stage directly, and learning their parameters for optimized endtoend performance. Now that these methods are getting close to touch their ceiling, our work comes to argue that the classic methods are still very much relevant, and could become key in breaking such barriers. We believe that classic image processing algorithms will have a comeback for this exact reason.
Adopting a different point of view, this work offers a migration from intuitively chosen architectures, as many recent papers have offered, towards welljustified ones based on domain knowledge of the problem we are trying to solve. That is, the structure of the denoising problem is embedded into the deep learning architecture, making the overall algorithm enjoy both the flexibility of the deep methods, and the structure brought by the more classical approaches. The option of piling convolutions, ReLU’s, batchnormalization steps, skip connections, strides and pooling operations, dilated filtering, and many other tricks, and seeking for best performing architectures by trial an error, has been the dominating approach so far. It is time to return to the theoretical foundations of signal and image processing in order to go beyond this point. Relying on sparse representation modeling, the KSVD network we introduce in this work has a clear objective, a concise structure, and yet it works quite well. In fact, we believe that the results shown here stand as yet another testimony for the central role that sparse modeling plays in broad data processing.
And related to the above, here is an interesting question: What is the simplest possible network, in terms of the number of free parameters to learn and the number of computations to apply, for getting stateoftheart image denoising? In singleimage superresolution it has become common practice in the literature to compare different solutions by considering their complexity as well (e.g.,
[Timofte_2016_CVPR]). This is done by showing points in a 2D graph of PSNR versus computational cost. Doing the same in image denoising may reveal interesting patterns. The general deeplearning based methods, while showing the best PSNR, tend to be quite heavy and cumbersome. Could much lighter networks perform nearly as well (and perhaps even better)? In this work we offer one such avenue to explore, and we are certain that many others will follow.5.2 Going Beyond Sparsity?
Why has it been so easy to outperform the original KSVD denoising algorithm in the first place? A possible answer could be that this algorithm builds its cleaning abilities on two prime forces: (i) the spatial redundancy that exists in image patches, exposed by the sparse modeling; and (ii) the patchaveraging effect, which has an MMSE flavor to it [papyan2015multi]. Many of the better performing competitors strengthen their performance by considering several additional ideas:

NonLocality: Nonlocal selfsimilarity can be practiced as an additional prior, as done by BM3D [dabov2007video] and lowrank modeling [gu2014weighted, Yair_2018_CVPR]. Indeed, the paper by Mairal et. al [mairal2009non] extended the KSVD denoising by incorporating joint sparsity on groups of patches, this way introducing nonlocality. Broadly speaking, nonlocal methods are known to be effective in capturing the correlation between farapart patches, leading to improved restoration.

Patch Consensus: Patch based methods must address the disagreement found between overlapping patches. The original KSVD scheme we embark from in this paper proposed an averaging^{6}^{6}6While the original KSVD denoising algorithm has used a plain averaging, we deploy a slightly improved weighted option, due to its simplicity in the context of a learned machine. of these patches. However, the EPLL approach [zoran2011learning] suggests a far better strategy, by imposing the prior on patches taken from the resulting image, rather than ones extracted from the measured one. In the context of sparse modeling, this idea boils down to an iterated KSVD algorithm, as was shown in [sulam2015expected]. In such a scheme the cleaned image is aggregated and broken to patches again for subsequent pursuit. We have deployed this very idea in an elementary way by replicating the filtering process. Closely related alternatives to this strategy are the SOS boosting method [romano2015boosting] and the deployment of the CSC model [papyan2017convolutional].

MultiScale: Multiscale analysis of visual data seems to be a natural strategy to follow, and various papers have shown the benefit of this for image denoising [papyan2015multi]. More specifically, a multiscale extension of the KSVD denoising algorithm has been considered in various practical ways [sulam2014image, ophir2011multi, mairal2007multiscale, mairal2008learning].
The above suggests that KSVD denoising in its original form carries a builtin weakness in it. Yet, the results in this paper suggest otherwise. Consider the more recent and better performing deeplearning based solutions. These alternatives seem to disregard these extra forces (at least explicitly), concentrating instead on capturing image intrinsic properties by a direct supervised learning of the inference process. Recent such convolutional neural networks (CNNs) for image restoration
[zhang2017beyond, mao2016image] achieve impressive performance over classical approaches. Do these methods exploit selfsimilarity? anything reminiscent of patchconsensus? a multiscale architecture? One may argue that the answer is, at most, only partially positive, hidden by the wide receptive field and the global treatment that these networks entertain. Note that there are deep learning methods that explicitly use selfsimilarity in their processing [lefkimmiatis2017non, liu2018non], however those do not necessarily improve over the simpler alternatives.The conclusion we draw from the above is that there is room for introducing nonlocality, patchconsensus and a multiscale structure into the proposed KSVD scheme, thereby driving the revised architecture towards even better results. Indeed, nothing is sacred in the KSVD computational path, and the same treatment as done in this work could be given to wellperforming classical denoising algorithms, such as BM3D [dabov2007video], kernelbased methods [takeda2006kernel] and WNNM [gu2014weighted]. We leave these ideas for future work.
5.3 Could we suggest an unsupervised version of this architecture?
This is perhaps a good time to recall that the denoising work in [elad2006image] offered two strategies for getting the dictionary – a globally universal approach that trains the dictionary offline, and an imageadaptive alternative that trains on the noisy image patches themselves. Interestingly, despite the fact that the later (imageadaptive) approach was found to be better performing, the solution we put forward in this paper aligns solely with the first approach. Why? because the supervised strategy we adopt naturally leads to a single architecture that serves all images via the same set of parameters. Could we offer an unsupervised alternative, more in line with the image adaptive path? The answer, while tricky, could be positive. A related approach of great relevance is [ulyanov2018deep], in which a chosen network architecture is trained on each image all over again. A similar concept could be envisioned, where our own KSVD architecture is used for synthesizing the clean image. However, this raises some difficulties and challenges, which is why we leave this activity for future work.
5.4 Conclusions
This work shows that the good old KSVD denoising algorithm [elad2006image] can have a comeback and become much better performing, getting closer to leading deeplearning based denoisers. This is achieved very simply by setting its parameters in a supervised fashion, while preserving its exact original form. Our work have shown how to turn the KSVD denoiser into a learnable architecture that enables backpropagation, and demonstrated the achieved boost in denoising performance. As the discussion above reveals, our story goes beyond the KSVD denoising and its improvement, towards more fundamental questions related to the role of deeplearning in contemporary image processing.
Comments
There are no comments yet.