Deep-K-SVD
None
view repo
This work considers noise removal from images, focusing on the well known K-SVD denoising algorithm. This sparsity-based method was proposed in 2006, and for a short while it was considered as state-of-the-art. However, over the years it has been surpassed by other methods, including the recent deep-learning-based newcomers. The question we address in this paper is whether K-SVD was brought to its peak in its original conception, or whether it can be made competitive again. The approach we take in answering this question is to redesign the algorithm to operate in a supervised manner. More specifically, we propose an end-to-end deep architecture with the exact K-SVD computational path, and train it for optimized denoising. Our work shows how to overcome difficulties arising in turning the K-SVD scheme into a differentiable, and thus learnable, machine. With a small number of parameters to learn and while preserving the original K-SVD essence, the proposed architecture is shown to outperform the classical K-SVD algorithm substantially, and getting closer to recent state-of-the-art learning-based denoising methods. Adopting a broader context, this work touches on themes around the design of deep-learning solutions for image processing tasks, while paving a bridge between classic methods and novel deep-learning-based ones.
READ FULL TEXT VIEW PDFNone
This paper addresses the classic image denoising problem: an ideal image is measured in the presence of an additive zero-mean white and homogeneous Gaussian noise,
, with standard deviation
. The measured image is thus , and our goal is the recovery of from with the knowledge of the parameter . This is quite a challenging task due to the need to preserve the fine details in while rejecting as much noise as possible.The importance of the image denoising problem cannot be overstated. First and foremost, noise corruption is inevitable in any image sensing process, often times heavily degrading the visual quality of the acquired image. Indeed, today’s cell-phones all deploy a denoising algorithm of some sort in their camera pipelines [plotz2017benchmarking]
. Removing noise from an image is also an essential and popular pre-step in various image processing and computer vision tasks
[katsaggelos2012digital]. Last but not least, many image restoration problems can be addressed effectively by solving a series of denoising sub-problems, further broadening the applicability of image denoising algorithms [afonso2010fast, romano2017little]. Due to its practical importance and the fact that it is the simplest inverse problem, image denoising has become the entry point for many new ideas brought over the years to the realm of image processing. Over a period of several decades, many image denoising algorithms have been proposed and tested, forming an evolution of methods with gradually improved performance.A common and systematic approach for the design of novel denoising algorithms is the Bayesian point of view. This calls for image priors, used as regularizers within the Maximum a Posteriori (MAP) or the Minimum Mean Squared Error (MMSE) estimators. In this paper we concentrate on one specific regularization approach, as introduced in
[elad2006image]: the use of sparse and redundant representation modeling of image patches – this is the K-SVD denoising algorithm, which stands at the center of this paper. The authors of [elad2006image] defined a global image prior that forces sparsity over patches in every location in the image. Their algorithm starts by breaking the image into small fully overlapping patches, solving their MAP estimate (i.e., finding their sparse representation), and ending with a tiling of the results back together by an averaging. As the MAP estimate relies on the availability of the dictionary, this work proposed two approaches, both harnessing the well known K-SVD dictionary learning algorithm [aharon2006k]. The first option is to train off-line on an external large corpus of image patches, aiming for a universally good dictionary to serve all test images. The alternative, which was found to be more effective, suggests using the noisy patches themselves in order to learn the dictionary, this way adapting to the denoised image.K-SVD has been widely used and extended, as evidenced by its many followup papers. For a short while, this algorithm was considered as state-of-the-art, standing at the top in denoising performance^{1}^{1}1Ranking denoising algorithms is typically done by evaluating synthetic denoising performance on agreed-upon image databases (e.g. set12 or BSD68), measuring Peak-Signal-to-Noise (PSNR) and/or Structured Similarity Index Measure (SSIM) results.. However, over the years it has been surpassed by other methods, such as BM3D [dabov2007video], EPLL [zoran2011learning], WNNM [gu2014weighted], and many others. The recent newcomers to this game – supervised deep-learning based denoising methods – are currently at the lead [chen2016trainable, lefkimmiatis2017non, zhang2017beyond, liu2018non, zhang2018ffdnet].
Can K-SVD denoising make a comeback and compete favorably with the most recent and best performing denoising algorithms? In this paper we answer this question positively. We show that this algorithm can be brought to perform far better, provided that its parameters are tuned in a supervised manner. By following the exact K-SVD computational path, we preserve its global image prior. This includes (i) breaking the image into small fully overlapping patches, (ii) solving their MAP estimate as a pursuit that aims to get their sparse representation in a learned dictionary, and then (ii) averaging the overlapping patches to restore the clean image. A special care is given to the redesign of all these steps into a differentiable and learnable computational scheme. We therefore end up with a deep architecture that reproduces the exact K-SVD operations, and can be trained by back-propagation for best denoising results. Our work shows that with small number of parameters to learn and while preserving the original K-SVD essence, the proposed machine outperforms the original K-SVD and other classical algorithms (e.g. BM3D and WNNM), and getting closer to state-of-the-art learning based denoising methods.
Our motivation in this paper goes beyond a simple improvement of the K-SVD denoising algorithm, aiming higher and broader. What are the lessons to be taken from the derived solution? How should we design novel and well-justified architectures for solving signal and image processing problems? What is the relation between classic (old fashioned) solutions and learning-based novel ones, in the context of such tasks? How can we further improve the proposed scheme in a principled way? All these and more are central questions, discussed towards the end of this paper. We urge the readers to go through the discussion towards the end of the paper carefully, as is gives the proper context to this work, and to its future prospects.
This paper is organized as followed. Section 2 recalls the K-SVD denoising algorithm, serving as the background for our derived alternative. In Section 3 we present the designed architecture with various modifications and adjustments that enable differentiabilty, local adaptivity, and more. Section 4 describes series of experiments that demonstrate the superiority of the proposed learned network over the classic K-SVD denoising algorithm, and show the tendency of our proposed network to have competitive performance with recent learned methods. We conclude this work in Section 5 with a wide discussion about this work and its contributions, and highlight potential future research directions.
In [elad2006image] the authors address the image denoising problem by using local sparsity and redundancy as ingredients in the formation of a global Bayesian objective. In this section we describe this K-SVD denoising algorithm by discussing (i) their global prior; (ii) the objective function induced; (iii) its corresponding numerical solver; and (iv) the two approaches for training the corresponding dictionary.
We start by introducing the local prior as imposed on patches in [elad2006image]. Let be a small image patch of size
pixels, ordered lexicographically as a column vector of length
. The sparse representation model assumes that is built as a linear combination of columns (also referred to as atoms) taken from a pre-specified dictionary^{2}^{2}2The option implies that the dictionary is redundant. . Put formally, , where is a sparse vector with non-zeros (this is denoted by ). Consider , a noisy version of , contaminated by an additive zero-mean white Gaussian noise with standard deviation . The MAP estimator for denoising this patch is obtained by solving(1) |
aiming to recover the sparse representation vector of . This is followed by , obtaining the denoised result [chen2001atomic, donoho2005stable, tropp2006just]. Note that the above optimization can be changed to a Lagrangian form,
(2) |
such that the constraint becomes a penalty. With a proper choice of , which is signal (the vector ) dependent, the two problems can become equivalent.
Moving now to handle a complete and large image of size and its noisy version (both held as vectors of length ), the global image prior proposed in [elad2006image] imposes the above-described local prior on every patch in , considering their extractions with full overlaps. This leads to the following global MAP estimator for the denoising:
(3) | ||||
In this expression, the first term is the log-likelihood global force that demands a proximity between the measured image, , and its denoised (and unknown) version . Put as a constraint, this penalty would have read , which reflects the direct relationship between and .
The second term stands for the image prior that assures that in the constructed image, , every patch^{3}^{3}3For simplicity and without loss of generality, a single index is used to account for the spatial image location. of size in every location (thus, the summation by ) has a sparse representation with bounded error. The matrix stands for an operator that extracts the -th block from the image. As to the coefficients , those must be spatially dependent, so as to comply with a set of constraints of the form .
Assume for the moment that the underlying dictionary
is known. The objective function in Equation (3) has two kinds of unknowns: the sparse representations per each location, and the output image . Instead of addressing both together, the authors of [elad2006image] propose a block-coordinate minimization algorithm that starts with an initialization , and then seeks the optimal for all locations . This leads to a decoupling of the minimization task to many smaller pursuit problems of the form(4) |
each handling a separate patch. This is solved in [elad2006image] using the Orthonormal Matching Pursuit (OMP) [elad2010sparse], which gathers one atom at a time to the solution, and stops when the error goes below^{4}^{4}4In fact, the threshold used in [elad2006image] is , with , which was found empirically to perform best. . This way, the choice of has been handled implicitly. Thus, this stage works as a sliding window sparse coding stage, operated on each patch of size pixels at a time.
Given all the sparse representations of the patches, , we can now fix those and turn to update . Returning to the expression in Equation (3), we need to solve
(5) |
This is a simple quadratic term that has a closed-form solution of the form
(6) |
The matrix to invert in the above expression is a diagonal one, and thus the required computation is quite simple. In fact, all that this expression does is to put back the patches to their original locations, and average these with a weighted version of the noisy image itself.
All the above stands for a single update of and then . For an effective block-coordinate minimization of the cost function in Equation (3) we should repeat these pair of updates several times. However, a difficulty with such an approach is the fact that once has been modified, we no longer know the level of noise in each patch, and thus the stopping criteria for the OMP becomes more challenging. The original K-SVD denoising algorithm, as proposed in [elad2006image], chose to apply only the first round of updates. The work reported in [sulam2015expected] adopts an EPLL point of view [zoran2011learning], extending the iterative algorithm further for getting improved results.
The discussion so far has been based on the assumption that the dictionary is known. This could be the case if we train it using the K-SVD algorithm over a corpus of clean image patches [elad2010sparse]. An interesting alternative is to embed the identification of within the Bayesian formulation. Returning to the objective function in Eq. (3), the authors of [elad2006image] also considered the case where is an unknown,
In this case, is learned using all the existing noisy patches taken from itself. Put more formally, a block-coordinate minimization is done: Initialize the dictionary as the overcomplete DCT matrix and set . Then iterate between the OMP over all the patches and an update of using the K-SVD strategy [aharon2006k]. After such rounds, the dictionary admits a content adapted to the image being treated, and the representations are ready for a final stage in which the output image is computed via Eq. (6).
In this work our goal is to design a network that reproduces the K-SVD denoising algorithm, while having the capacity to learn its parameters. One of the main difficulties we encounter is the pursuit stage, in which we are supposed to replace the greedy OMP algorithm by an equivalent learnable alternative. This may seem as an easy task, as we can use the -based Iterated Soft-Thresholding Algorithm (ISTA), unfolded appropriately for several iterations [gregor2010learning, daubechies2004iterative]. However, the challenge is the fact that OMP easily adapts the treatment for each patch using a stopping criterion based on the noise level. The equivalence in the ISTA case requires an identification of the appropriate regularization parameter for each patch, which is a non-trivial task. Assuming that this issue has been resolved, our computational process includes a decomposition of the image into its overlapped patches, cleaning of each by an appropriate pursuit, and a reconstruction of the overall image by averaging the cleaned patches. We propose to learn the parameters of this network by training over pairs of corrupted and ground-truth images. Next, we describe in details this overall architecture.
Figure 1 illustrates our end-to-end architecture. We start by describing the three stages that perform the denoising of the individual patches.
Sparse Coding: Given a patch (held as a column vector of length ) corrupted by an additive zero-mean Gaussian noise with standard deviation , we aim to derive its sparse code according to a known dictionary . This objective can be formulated as in Equation (1). An approximate solution to this problem can be obtained by replacing the -norm with an [donoho2003optimally, donoho2005stable]:
(7) |
For a proper choice of , the above can be reformulated as
(8) |
A popular and effective algorithm for solving the above problem is the Iterative Soft Thresholding Algorithm (ISTA) [daubechies2004iterative], which is guaranteed to converge to the global optimum
(9) |
where is the square spectral norm of and is the component-wise soft-thresholding operator,
(10) |
The motivation to adopt a proximal gradient descent method, as done above, is the fact that it allows an unrolling of the sparse coding stage into a meaningful and learnable scheme, just as practiced in [gregor2010learning]. Indeed, replacing the -norm by the supports this goal as it allows to differentiate through this scheme. Because of these reasons, in this work we consider a learnable version of ISTA by keeping exactly the same recursion with a fixed number of iterations , and letting and become the learnable parameters.
Evaluation: Referring to the pursuit formulation in Equation (8), an important issue is the need to set the parameter . This regularization coefficient depends not only on but also on the patch itself. Following the computational path of the K-SVD denoising algorithm in [elad2006image], we should set for each patch so as to yield sparse representation with a controlled level of error, . As there is no closed-form solution to this evaluation of -s, we propose to learn a regression function from the patches to their corresponding regularization parameters
. A Multi-Layer Perceptron (MLP) network is used to represent this function,
, whereis the vector of the parameters of the MLP. Our MLP consists of three hidden layers, each composed of a fully connected linear mapping followed by a ReLU (apart from the last layer). The input layer has
nodes, which is the dimension of the vectorized patch, and the output layer consists of a single node, being the regularization parameter. The overall structure of the network is given by the following expression, in which symbolizes a multiplication by a matrix of that size: . Thus, an overall of nearly parameters are needed for this regression network.Patch Reconstruction: This stage reconstructs the cleaned version of the patch using and the sparse code . This is given by . Note that in our learned network, the dictionary stands for a set of parameters that are shared in all locations where we multiply by either or .
We can now discuss the complete architecture. We start by breaking the input image into fully overlapping patches, then treat each corrupted patch via the above-described patch denoising stage, and conclude by rebuilding the image by averaging the cleaned version of these patches. In the last stage we slightly deviate from the original K-SVD, by allowing a learned weighted combination of the patches. Denoting by this patch of weights, the reconstructed image is obtained by
(11) |
where is the Schur product, and the division is done element-wise. This weighted averaging aligns with Guleryuz’ approach as advocated in [guleryuz2007].
To conclude, the proposed network is a parametrized function of (the parameters of the MLP network computing ), (the step-size in the ISTA algorithm), (the dictionary) and (the weights for the patch-averaging). The overall number of parameters stands on ; for example, for and , this number is .
Given a corrupted image , the computation returns a cleaned version of it. Training
is done by minimizing the loss function
, with respect to all the above parameters. In the above objective, the set stands for our training images, and are their synthetically noisy versions, obtained by , where is a zero mean and white Gaussian iid noise vector.As already mentioned in the previous section, an EPLL version of the K-SVD can be envisioned, in which the process of cleaning the patches is repeated several times. This implies that once the above architecture obtains its output , the whole scheme could be applied again (and again). This diffusion process of repeated denoisings has been shown in [sulam2015expected] to improve the K-SVD denoising performance. However, the difficulty is in setting the noise level to target in each patch after the first denoising, as it is no longer . In our case, we adopt a crude version of the EPLL scheme, in which we disregard the noise level problem altogether, and simply assume that the evaluation stage takes care of this challenge, adjusting the MLP in each round to best predict the values to be used. Thus, our iterated scheme shares the dictionary across all denoising stages, while allowing a different evaluation network for each stage.
We turn to present experiments with the proposed Learned K-SVD (LKSVD). Our goals are to show that LKSVD is
Much better than the original KSVD in its two forms – the image adaptive algorithm (), and the one using a universal dictionary ();
Better than other classic denoising algorithms; and
Competitive with recent deep-learning based denoisers.
Dataset: In order to train our model we generate the training data using the Berkeley segmentation dataset (BSDS) [MartinFTM01], which consists of 500 images. We split these images into a training set of 432 images and the validation/test set that consists of the remaining 68 images. We note that these 68 images are exactly the ones used in the standard evaluation dataset of [roth2009fields]. In addition, following [liu2018non, zhang2017beyond], we test our proposed method on the benchmark Set12 – a collection of widely-used testing images. The training and the two test sets are strictly disjoint and all the images are converted to gray-scale in each experiment setup. This allows a fair and comprehensive comparison with recent deep learning based methods, as we train and test on the same datasets and benchmarks used in [lefkimmiatis2017non, lefkimmiatis2018universal, zhang2017beyond, liu2018non, chen2016trainable, mao2016image].
Training Settings: During training we randomly sample cropped images of size from the training set. We add i.i.d. Gaussian noise with zero mean and a specified level of noise to each cropped image as the noisy input during training. We train a different model for each noise level, considering .
We use SGD optimizer to minimize the loss function. We set the learning rate as and consider one cropped image as the minibatch size during training. We use the same initialization as in the K-SVD algorithm to initialize the dictionary , i.e the overcomplete DCT matrix. We also initialize the normalization paramater of the sparse coding stage using the squared spectral norm of the DCT matrix. The other parameters of the network are randomly initialized using Kaiming Uniform method. Training a model takes about days with a Titan Xp GPU.
Test Settings: Our network does not depend on the input size of the image. Thus, in order to test our architecture’s performance, we simply add white Gaussian noise with a specified power to the original image, and feed it to the learned scheme. The metric used to determine the quality is the standard Peak-Signal-to-Noise (PSNR).
In Tables I, II and III we compare^{5}^{5}5The results in these tables corresponding to BM3D and WNNM have been taken from [liu2018non] and [zhang2018ffdnet], respectively. LKSVD with the two original K-SVD versions ( and ) and two leading classic denoising algorithms, BM3D [dabov2007video] and WNNM [gu2014weighted]. Tables I and II refer to the BSD68 test-set (one showing PSNR and the other SSIM quality measures) and Table III shows the Set12 results (PSNR only). In this comparison, LKSVD is set to use the same patch and dictionary sizes as in and from [elad2006image], namely and . Also, LKSVD applies unfolded iterations of ISTA, and EPLL-like denoising rounds.
Dataset | Noise | BM3D | WNNM | LKSVD | ||
---|---|---|---|---|---|---|
BSD 68 | 15 | 31.07 | 31.37 | 30.91 | 30.87 | 31.48 |
25 | 28.57 | 28.83 | 28.32 | 28.28 | 28.96 | |
50 | 25.62 | 25.87 | 25.03 | 25.01 | 25.97 |
Dataset | Noise | BM3D | WNNM | LKSVD | ||
---|---|---|---|---|---|---|
BSD 68 | 15 | 0.8717 | 0.8766 | 0.8692 | 0.8685 | 0.8835 |
25 | 0.8013 | 0.8087 | 0.7876 | 0.7894 | 0.8171 | |
50 | 0.6864 | 0.6982 | 0.6322 | 0.6462 | 0.7035 |
Images | C.man | House | Peppers | Starfish | Monarch | Airplane | Parrot | Lena | Barbara | Boat | Man | Couple | Average |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Noise level | |||||||||||||
BM3D | 31.91 | 34.93 | 32.69 | 31.14 | 31.85 | 31.07 | 31.37 | 34.26 | 33.10 | 32.13 | 31.92 | 32.10 | 32.37 |
WNNM | 32.17 | 35.13 | 32.99 | 31.82 | 32.71 | 31.39 | 31.62 | 34.27 | 33.60 | 32.27 | 32.11 | 32.17 | 32.70 |
31.43 | 32.21 | 34.23 | 30.80 | 31.59 | 30.99 | 31.64 | 31.45 | 30.95 | 31.83 | 32.44 | 33.78 | 31.95 | |
31.39 | 32.16 | 33.85 | 30.96 | 31.66 | 30.96 | 31.62 | 31.71 | 30.99 | 31.63 | 30.58 | 33.49 | 31.75 | |
DKSVD | 32.16 | 32.92 | 34.59 | 31.54 | 32.11 | 31.66 | 32.22 | 32.78 | 31.78 | 32.18 | 32.22 | 34.24 | 32.53 |
Noise level | |||||||||||||
BM3D | 29.45 | 32.85 | 30.16 | 28.56 | 29.25 | 28.42 | 28.93 | 32.07 | 30.71 | 29.90 | 29.61 | 29.71 | 29.97 |
WNNM | 29.64 | 33.22 | 30.42 | 29.03 | 29.84 | 28.69 | 29.15 | 32.24 | 31.24 | 30.03 | 29.76 | 29.82 | 30.26 |
28.75 | 29.64 | 31.86 | 28.21 | 29.10 | 28.42 | 29.26 | 28.83 | 28.27 | 29.44 | 29.77 | 31.37 | 29.41 | |
28.78 | 29.74 | 31.46 | 28.39 | 29.04 | 28.57 | 29.16 | 28.85 | 28.24 | 29.18 | 27.61 | 31.04 | 29.17 | |
DKSVD | 29.70 | 30.35 | 32.53 | 28.92 | 29.71 | 29.13 | 29.85 | 30.15 | 28.99 | 30.07 | 30.06 | 31.99 | 30.12 |
Noise level | |||||||||||||
BM3D | 26.13 | 29.69 | 26.68 | 25.04 | 25.82 | 25.10 | 25.90 | 29.05 | 27.22 | 26.78 | 26.81 | 26.46 | 26.72 |
WNNM | 26.45 | 30.33 | 26.95 | 25.44 | 26.32 | 25.42 | 26.14 | 29.25 | 27.79 | 26.97 | 26.94 | 26.64 | 27.05 |
25.12 | 25.93 | 27.82 | 24.86 | 25.56 | 24.80 | 26.16 | 25.11 | 24.45 | 25.98 | 25.78 | 27.71 | 25.78 | |
25.29 | 26.02 | 27.71 | 24.85 | 25.44 | 25.15 | 25.98 | 24.82 | 24.32 | 25.93 | 24.04 | 27.32 | 25.58 | |
DKSVD | 26.67 | 26.97 | 29.37 | 25.61 | 26.55 | 26.00 | 26.95 | 26.54 | 25.38 | 26.98 | 27.03 | 28.86 | 26.91 |
A clear conclusion from the above tables is the fact that LKSVD is much better performing compared to the classic K-SVD, be it the universal dictionary approach or the image adaptive one. Indeed, the PSNR BSD68 results suggest that LKSVD is better than BM3D (by dB) and WNNM (by dB) as well. Table III displays a slightly different story, where LKSVD is still better performing compared to BM3D, while being slightly weaker than the WNNM. Recall that BM3D and WNNM both leverage non-local self-similarity, which gives them an edge over K-SVD. In addition, these two methods have been tuned for best results for this test set (see in particular their exceptional results for Lena and Barbara). As a final note we add that Table II shows that the ordering of the methods remains the same as we move from PSNR to the SSIM quality measure, which explains our choice to use PSNR for the rest of the experiments.
We proceed by exploring the effect of (patch size), (dictionary size) and (number of denoising steps) on the LKSVD performance. We denote by the result for the proposed architecture with these specified parameters. Table IV presents the obtained results for the two benchmarks (BSD68 and Set12) and a noise level of . As can be seen, even with , LKSVD is markedly better than the classic K-SVD. As grows, the performance improves by dB per each additional denoising round. A boost in performance is also obtained when growing the patch-size to while preserving the redundancy factor of the dictionary. This also shows that the proposed scheme has the capacity to yield results that go beyond the ones reported in Tables I and III.
Dataset | Noise | ||||||
---|---|---|---|---|---|---|---|
BSD 68 | 25 | 28.32 | 28.28 | 28.76 | 28.96 | 28.95 | 29.07 |
Set 12 | 29.41 | 29.17 | 29.76 | 30.12 | 30.09 | 30.22 |
We conclude by comparing the with recent learning-based denoising competitors: TNRD [chen2016trainable], NLNet [lefkimmiatis2017non], DnCNN [zhang2017beyond] and NLRNet [liu2018non]. The results are shown in Table V, referring to the two benchmarks. As can be seen, our scheme surpasses TNRD [chen2016trainable] and even the non-local deeply-learned denoiser by Lefkimmiatis [lefkimmiatis2018universal, lefkimmiatis2017non]. Still, there is a gap between LKSVD and the best performing denoisers DnCNN [zhang2017beyond] and NLRNet [liu2018non]. Table VI sheds more light on these results by presenting the model complexities involved in this experiment. As can be seen, our network () uses about 10% of the overall number of parameters compared to the better performing methods.
Dataset | Noise | TNRD | NLNet | DnCNN | NLRNet | |
---|---|---|---|---|---|---|
BSD 68 | 15 | 31.42 | 31.52 | 31.73 | 31.88 | 31.54 |
25 | 28.92 | 29.03 | 29.23 | 29.41 | 29.07 | |
50 | 25.97 | 26.07 | 26.23 | 26.47 | 26.13 | |
Set 12 | 15 | 32.50 | - | 32.86 | 33.16 | 32.61 |
25 | 30.06 | - | 30.44 | 30.80 | 30.22 | |
50 | 26.81 | - | 27.18 | 27.64 | 27.04 |
DnCNN | NLRNet | LKSVD | |
---|---|---|---|
Max effective depth | 17 | 38 | 21 |
Parameter sharing | No | Yes | Yes |
Parameter no. | 554k | 330k | 45k |
We conclude by presenting visual results of the various methods compared. Figure 2 shows the denoising results of BM3D, WNNM, , DnCNN, and LKSVD. The figure refers to a noise level of and the images used are taken from the BSD68 test set.
Noisy | BM3D | WNNM | DnCNN | LKSVD | |
---|---|---|---|---|---|
PSNR=20.17 | PSNR= 27.08 | PSNR=27.25 | PSNR= 27.15 | PSNR= 27.72 | PSNR= 27.43 |
PSNR=20.15 | PSNR= 32.43 | PSNR=32.59 | PSNR= 31.77 | PSNR= 26.29 | PSNR= 32.57 |
PSNR=20.19 | PSNR= 28.22 | PSNR=28.34 | PSNR= 27.91 | PSNR= 30.30 | PSNR = 28.56 |
PSNR=20.17 | PSNR= 25.60 | PSNR= 25.85 | PSNR= 25.74 | PSNR= 30.74 | PSNR=26.16 |
The rationale behind this work goes beyond a simple improvement of the K-SVD denoising algorithm. Indeed, our motivation is drawn from the hope to propose systematic ways of designing deep-learning architectures and connecting novel solutions to classical algorithms.
A fundamental question nowadays in computational imaging is whether old/classic methods should be discarded and replaced by their deep-learning alternatives. In the context of image denoising, classical methods focused on data modeling and optimization, and searched for ways to identify and exploit the redundancies existing in the visual data. The recent deep networks, which lead the denoising performance charts today, take an entirely different route, targeting the inference stage directly, and learning their parameters for optimized end-to-end performance. Now that these methods are getting close to touch their ceiling, our work comes to argue that the classic methods are still very much relevant, and could become key in breaking such barriers. We believe that classic image processing algorithms will have a comeback for this exact reason.
Adopting a different point of view, this work offers a migration from intuitively chosen architectures, as many recent papers have offered, towards well-justified ones based on domain knowledge of the problem we are trying to solve. That is, the structure of the denoising problem is embedded into the deep learning architecture, making the overall algorithm enjoy both the flexibility of the deep methods, and the structure brought by the more classical approaches. The option of piling convolutions, ReLU’s, batch-normalization steps, skip connections, strides and pooling operations, dilated filtering, and many other tricks, and seeking for best performing architectures by trial an error, has been the dominating approach so far. It is time to return to the theoretical foundations of signal and image processing in order to go beyond this point. Relying on sparse representation modeling, the K-SVD network we introduce in this work has a clear objective, a concise structure, and yet it works quite well. In fact, we believe that the results shown here stand as yet another testimony for the central role that sparse modeling plays in broad data processing.
And related to the above, here is an interesting question: What is the simplest possible network, in terms of the number of free parameters to learn and the number of computations to apply, for getting state-of-the-art image denoising? In single-image super-resolution it has become common practice in the literature to compare different solutions by considering their complexity as well (e.g.,
[Timofte_2016_CVPR]). This is done by showing points in a 2D graph of PSNR versus computational cost. Doing the same in image denoising may reveal interesting patterns. The general deep-learning based methods, while showing the best PSNR, tend to be quite heavy and cumbersome. Could much lighter networks perform nearly as well (and perhaps even better)? In this work we offer one such avenue to explore, and we are certain that many others will follow.Why has it been so easy to outperform the original K-SVD denoising algorithm in the first place? A possible answer could be that this algorithm builds its cleaning abilities on two prime forces: (i) the spatial redundancy that exists in image patches, exposed by the sparse modeling; and (ii) the patch-averaging effect, which has an MMSE flavor to it [papyan2015multi]. Many of the better performing competitors strengthen their performance by considering several additional ideas:
Non-Locality: Non-local self-similarity can be practiced as an additional prior, as done by BM3D [dabov2007video] and low-rank modeling [gu2014weighted, Yair_2018_CVPR]. Indeed, the paper by Mairal et. al [mairal2009non] extended the K-SVD denoising by incorporating joint sparsity on groups of patches, this way introducing non-locality. Broadly speaking, non-local methods are known to be effective in capturing the correlation between far-apart patches, leading to improved restoration.
Patch Consensus: Patch based methods must address the disagreement found between overlapping patches. The original K-SVD scheme we embark from in this paper proposed an averaging^{6}^{6}6While the original K-SVD denoising algorithm has used a plain averaging, we deploy a slightly improved weighted option, due to its simplicity in the context of a learned machine. of these patches. However, the EPLL approach [zoran2011learning] suggests a far better strategy, by imposing the prior on patches taken from the resulting image, rather than ones extracted from the measured one. In the context of sparse modeling, this idea boils down to an iterated K-SVD algorithm, as was shown in [sulam2015expected]. In such a scheme the cleaned image is aggregated and broken to patches again for subsequent pursuit. We have deployed this very idea in an elementary way by replicating the filtering process. Closely related alternatives to this strategy are the SOS boosting method [romano2015boosting] and the deployment of the CSC model [papyan2017convolutional].
Multi-Scale: Multi-scale analysis of visual data seems to be a natural strategy to follow, and various papers have shown the benefit of this for image denoising [papyan2015multi]. More specifically, a multi-scale extension of the K-SVD denoising algorithm has been considered in various practical ways [sulam2014image, ophir2011multi, mairal2007multiscale, mairal2008learning].
The above suggests that K-SVD denoising in its original form carries a built-in weakness in it. Yet, the results in this paper suggest otherwise. Consider the more recent and better performing deep-learning based solutions. These alternatives seem to disregard these extra forces (at least explicitly), concentrating instead on capturing image intrinsic properties by a direct supervised learning of the inference process. Recent such convolutional neural networks (CNNs) for image restoration
[zhang2017beyond, mao2016image] achieve impressive performance over classical approaches. Do these methods exploit self-similarity? anything reminiscent of patch-consensus? a multi-scale architecture? One may argue that the answer is, at most, only partially positive, hidden by the wide receptive field and the global treatment that these networks entertain. Note that there are deep learning methods that explicitly use self-similarity in their processing [lefkimmiatis2017non, liu2018non], however those do not necessarily improve over the simpler alternatives.The conclusion we draw from the above is that there is room for introducing non-locality, patch-consensus and a multi-scale structure into the proposed K-SVD scheme, thereby driving the revised architecture towards even better results. Indeed, nothing is sacred in the K-SVD computational path, and the same treatment as done in this work could be given to well-performing classical denoising algorithms, such as BM3D [dabov2007video], kernel-based methods [takeda2006kernel] and WNNM [gu2014weighted]. We leave these ideas for future work.
This is perhaps a good time to recall that the denoising work in [elad2006image] offered two strategies for getting the dictionary – a globally universal approach that trains the dictionary off-line, and an image-adaptive alternative that trains on the noisy image patches themselves. Interestingly, despite the fact that the later (image-adaptive) approach was found to be better performing, the solution we put forward in this paper aligns solely with the first approach. Why? because the supervised strategy we adopt naturally leads to a single architecture that serves all images via the same set of parameters. Could we offer an unsupervised alternative, more in line with the image adaptive path? The answer, while tricky, could be positive. A related approach of great relevance is [ulyanov2018deep], in which a chosen network architecture is trained on each image all over again. A similar concept could be envisioned, where our own K-SVD architecture is used for synthesizing the clean image. However, this raises some difficulties and challenges, which is why we leave this activity for future work.
This work shows that the good old K-SVD denoising algorithm [elad2006image] can have a comeback and become much better performing, getting closer to leading deep-learning based denoisers. This is achieved very simply by setting its parameters in a supervised fashion, while preserving its exact original form. Our work have shown how to turn the K-SVD denoiser into a learnable architecture that enables back-propagation, and demonstrated the achieved boost in denoising performance. As the discussion above reveals, our story goes beyond the K-SVD denoising and its improvement, towards more fundamental questions related to the role of deep-learning in contemporary image processing.