1 Introduction
Research on image restoration has originally focused on designing image models by hand in order to address inverse problems that require good a priori knowledge about the structure of natural images. For that purpose, various regularization functions have been investigated, ranging from linear differential operators enforcing smooth signals [perona1990], to total variation [rudin1992nonlinear], or wavelet sparsity [mallat].
Later, a bit more than ten years ago, image restoration paradigms have shifted towards datadriven approaches. For instance, nonlocal means [buades2005non]
is a nonparametric estimator that exploits image selfsimilarities, following pioneer works on texture synthesis
[efros], and many successful approaches have relied on unsupervised learning, such as learned sparse models
[aharon2006k, mairal2014sparse], Gaussian scale mixtures [portilla], or fields of experts [roth]. Then, models combining several image priors, in particular selfsimilarities and sparse representations, have proven to further improve the reconstruction quality for various restoration tasks [dabov2007image, dabov2009bm3d, dong2012nonlocally, gu2014weighted, mairal2009non]. Among these approaches, the most famous one is probably block matching with 3D filtering (BM3D)
[dabov2007image].Only relatively recently, this last class of methods has been outperformed by deep learning models, which are able to leverage pairs of corrupted/clean images for training in a supervised fashion. More specifically, deep models have shown great effectiveness on many tasks such as denoising [lefkimmiatis2017non, zhang2017beyond, plotz2018neural, liu2018non], demoisaicking [kokkinos2019iterative, zhang2017learning, zhang2019rnan]
[dong2015image, kim2016accurate], or artefact removal, to name a few. Yet, they also suffer from inherent limitations such as lack of interpretability and they often require learning a huge number of parameters, which can be prohibitive for some applications. Improving these two aspects is one of the key motivation of our paper. Our goal is to design algorithms for image restoration that bridge the gap in performance between earlier approaches that are interpretable and parameter efficient, and current stateoftheart deep learning models.Our strategy consists of considering nonlocal sparse image models, the LSSC [mairal2009non] and the centralized sparse coding (CSR) methods [dong2012nonlocally], and use their principles to design a differentiable algorithm—that is, we design a restoration algorithm that optimizes a welldefined (and thus interpretable) cost function, but the algorithm and the cost also involve parameters that may be learned endtoend with supervision. Such a principle was introduced for sparse coding problems involving the penalty in the LISTA algorithm [gregor2010learning], which was then improved later in [chen2018theoretical, liu2018alista]. Such a differentiable approach for sparse coding was recently used for image denoising in [simon2019rethinking] and for superresolution in [wang2015deep].
Our main contribution is to extend this idea of differentiable algorithms to structured sparse models [jenatton2011structured], which is a key to exploit selfsimilarities in the LSSC [mairal2009non] and CSR [dong2012nonlocally] approaches. Groups of similar patches are indeed processed together in order to obtain a joint sparse representation. Empirically, such a joint sparsity principle leads to simple architectures with few parameters that are competitive with the state of the art.
We present indeed a model with 68k parameters for image denoising that performs on par with the classical deep learning baseline DnCNN [zhang2017beyond] (556k parameters), and even performs better on lownoise settings. For color image denoising, our model with 112k parameters significantly ourperforms the color variant of DnCNN (668k parameters), and for image demosaicking, we obtain slightly better results than the stateoftheart approach [zhang2019rnan], while reducing the number of parameters by x. Perhaps more importantly than improving the PSNR, we also observe that the principle of non local sparsity also reduces visual artefacts when compared to using sparsity alone (an observation also made in LSSC [mairal2009non]), which is illustrated in Figure 2.
Our models are implemented in PyTorch and our implementation is provided in the supplemental material.
2 Preliminaries and related work
In this section, we introduce nonlocal sparse coding models for image denoising and present a differentiable algorithm for sparse coding [gregor2010learning], which we extend later.
Sparse coding models on learned dictionaries.
A simple and yet effective approach for image denoising introduced in [elad2006image] consists of assuming that patches from natural images can often be represented by linear combinations of few dictionary elements. Thus, computing such a sparse approximation for a noisy patch is expected to yield a clean estimate of the signal. Then, given a noisy image , we denote by in the set of overlapping patches of size
, which we represent by vectors in
for grayscale images. Each noisy patch is then approximated by solving the sparse decomposition problem(1) 
where in
is the dictionary, which we assume to be given (at the moment) and good at representing image patches, and
is a sparsityinducing penalty that encourages sparse solutions. This is indeed known to be the case when is the norm (), see [mairal2014sparse]; when , is called the penalty and simply counts the number of nonzero elements in a vector.Then, a clean estimate of is simply , which is a sparse linear combination of dictionary elements. Since the patches overlap, we obtain estimates for each pixel and the denoised image is obtained by averaging
(2) 
where is a linear operator that places the patch at the adequate position centered on pixel
on the output image. Note that for simplicity, we neglect here the fact that pixels close to the image border admit less estimates unless zero padding is used.
Whereas we have previously assumed that a good dictionary for natural images is available—e.g., it could be the discrete cosine transform (DCT) [ahmed1974discrete]—it has been proposed in [elad2006image] to adapt to the image, by solving a matrix factorization problem called dictionary learning [field].
Finally, variants of the previous formulation have been shown to improve the results, see [mairal2009non]. In particular, it seems helpful to center each patch (removing the mean intensity) before performing the sparse approximation and adding back the mean intensity to the final estimate. Instead of (1), it is also possible to minimize under the constraint , where
is proportional to the noise variance
, which is assumed to be known.Differentiable algorithms for sparse coding.
A popular approach to solve (1) when is the iterative shrinkage algorithm (ISTA) [figueiredo2003]. Denoting by the softthresholding operator, which can be applied pointwise to a vector, and by
an upperbound on the largest eigenvalue of
, ISTA performs the following steps for solving (1):(3) 
Note that such a step performs a linear operation on followed by a pointwise nonlinear function . It is thus tempting to consider
steps of the algorithm, see it as a neural network with
layers, and learn the corresponding weights. Following such an insight, the authors of [gregor2010learning] have proposed the LISTA algorithm, which is trained such that the resulting “neural network” with layers learns to approximate the solution of the sparse coding problem (1).Other variants were then proposed, see [chen2018theoretical, liu2018alista], and the one we have adopted in our paper may be written as
(4) 
where has the same size as and is a vector in such that for in performs a softthresholding operation on with a different threshold for each entry of . Then, the variables and will be learned for a supervised task, thus allowing to implement efficiently a taskdriven dictionary learning method [mairal2011task].
Note that when and , the recursion recovers the ISTA algorithm. Empirically, it has been observed that allowing accelerates convergence and could be interpreted as learning a preconditioner for the ISTA method [liu2018alista], while allowing to have entries different than corresponds to using a weighted norm instead of
and learning the weights. The concept of differentiable algorithm is interesting and differs from classical machine learning paradigms: it could be seen indeed as a way to learn a cost function and tune at the same time an optimization algorithm to minimize it.
There have been already a few attempts to leverage the LISTA algorithm for specific image restoration tasks such as superresolution [wang2015deep] or denoising [simon2019rethinking], which we extend in our paper with nonlocal priors and structured sparsity.
Exploiting nonlocal selfsimilarities.
The nonlocal means approach [buades2005non] consists of averaging patches that are similar to each other, but that are corrupted by different independent zeromean noise variables, such that averaging reduces the noise variance without corrupting too much the underlying signal. The intuition is relatively simple and relies on the fact that natural images admit many local selfsimilarities. Nonlocal means is a nonparametric approach, which may be seen as a NadarayaWatson estimator.
Non local sparse models.
Noting that selfsimilarities and sparsity are two relatively different image priors, the authors of [mairal2009non] have introduced the LSSC approach that uses a structured sparsity prior based on image self similarities. If we denote by the set of patches similar to ,
(5) 
for some threshold , then, we may consider the matrix in of coefficients forming a group of similar patches. LSSC then encourages the sparsity pattern—that is, the set of nonzero coefficients—of the decompositions for to be similar. This can be achieved by using a groupsparsity regularizer [turlach]
(6) 
where is the th row in , and (leading to a convex penalty), or ; then, simply counts the number of nonzero rows in . The effect of such penalties is to encourage sparsity patterns to be shared across similar patches, as illustrated in Figure 1.
Deep learning models.
Deep neural networks have been successful over the past years and give stateoftheart results for many restoration tasks. In particular, successful principles include very deep networks, residual connections, batch norm, and residual learning
[lefkimmiatis2018universal, zhang2017beyond, zhang2018ffdnet, zhang2019rnan]. Recent models also use socalled attention mechanisms to model self similarities, which are in fact pooling operations akin to nonlocal means. More precisely, a generic non local module has been proposed in [liu2018non], which performs weighed average of similar features. In [plotz2018neural], a relaxation of the knearest selection rule is introduced, which can be used for designing deep neural networks for image restoration.3 Methods
In this section, we first introduce trainable sparse coding models for image denoising, following [simon2019rethinking] (while introducing two minor improvements to the method), before introducing several approaches to model selfsimilarities.
3.1 Trainable sparse coding
In [simon2019rethinking], the sparse coding approach (SC) described in Section 2 is combined with the LISTA algorithm to perform denoising tasks.^{1}^{1}1Specifically, [simon2019rethinking] considers the SC approach as a baseline, and proposes an improved model based on the principle of convolutional sparse coding (CSC). CSC is a variant of SC, where an image is approximated by a sparse linear combination of small dictionary elements placed at all possible positions in the image. Unfortunately, CSC leads to illconditioned sparse optimization problems and has shown to perform poorly for image denoising. For this reason, [simon2019rethinking]
introduces strides, which yields a hybrid approach between SC and CSC. In our paper, we have decided to stick to the SC baseline and leave the investigation of CSC models for future work.
The only modification we introduce here is a centering step for the patches, which empirically yields better results (and thus a stronger baseline).SC Model  inference.
We now explain how an input image is represented in the SC model, before discussing how to learn the parameters. Following the classical approach from Section 2, the first step consists of extracting all overlapping patches from , which we denote by , where is a linear patch extraction operator.
Then, we perform the centering operation for every patch
(7) 
The mean value is recorded and added after denoising . Hence, the low frequency component of the signal does not flow through the model. This observation is related to the residual approach for deep learning methods for denoising and super resolution [zhang2017beyond], where neural networks learn to predict the corruption noise rather than the full image. The centering step is not used in [simon2019rethinking], but we have found it to provide better reconstruction quality.
The next step consists of sparsely encoding each centered patch with steps of the LISTA variant presented in (4), replacing by there, assuming the parameters and are given. Here, a minor change compared to [simon2019rethinking] is the use of varying parameters at each LISTA step, which leads to a minor increase in the number of parameters (on the order in our experiments).
Finally, the final image is obtained by averaging the patch estimates as in (2), after adding back the mean value:
(8) 
but note that the dictionary is replaced by another one of the same size. The reason for decoupling from is that the weighted penalty that is implicitly used by the LISTA method is known to shrink the coefficients too much and to provide biased estimates of the signal. For this reason, classical denoising approaches such as [elad2006image, mairal2009non] based on sparse coding use instead the penalty, but we have found it ineffective for endtoend training. Therefore, as in [simon2019rethinking], we have chosen to decouple from .
Note that in terms of implementation, it is worth noting that all operations above can be simply expressed in classical frameworks for deep learning. LISTA steps involve indeed convolutions after representing the
’s as a traditional feature map, akin to that of convolutional neural networks, whereas the averaging step (
8) corresponds to the “transpose convolution” in Tensorflow or PyTorch.
Training the parameters.
We now show how to train the parameters in a supervised fashion, which differs from the traditional dictionary learning approach where only the noisy image is available [elad2006image, mairal2009non]. Here, we assume that we are given a data distribution of pairs of clean/noisy images , and we simply minimize the reconstruction loss.
(9) 
where is the denoised image defined in (8), given a noisy image
. This is then achieved by using stochastic gradient descent or one of its variants (see experimental section).
3.2 Embedding nonlocal sparse priors
In this section, we replace the norm (or its weighted variant) by structured sparsityinducing regularization functions that take into account nonlocal image self similarities. This idea allow us to turn classical nonlocal sparse models [dong2012nonlocally, mairal2009non] into differentiable algorithms.
The generic approach is presented in Algorithm 1. The algorithm performs steps, where it computes pairwise patch similarities between patches of a current estimate , using various possible metrics that we discuss in Section 3.3. Then, is updated by computing a socalled proximal operator, defined below, for a particular penalty that depends on and some parameters . Practical variants where the pairwise similarities are only updated once in a while, are discussed in Section 3.4.
Definition 1 (Proximal operator).
Given a convex function , the proximal operator of is defined as
(10) 
The proximal operator plays a key role in optimization and admits a closed form for many sparsityinducing penalties, see [mairal2014sparse]. Indeed, given , it may be shown that the iterations are instances of the ISTA algorithm [beck2009fast] for minimizing the problem
and then the update of in Algorithm 1 becomes simply an extension of LISTA to deal with the penalty .
Note that for the weighted norm , the proximal operator is the softthresholding operator introduced in Section 2 for in , and we simply recover the SC algorithm from Section 3.1 since does not depend on the pairwise similarities (which then do not need to be computed). Next, we present different structured sparsityinducing penalties that yield more effective algorithms.
3.2.1 Group Lasso and LSSC
For each location , the LSSC approach [mairal2009non] defines groups of similar patches; however, for computational reasons, LSSC relaxes the definition (5) in practice, and implements instead a simple clustering method such that if and belong to the same group. Then, under this clustering assumption and given a dictionary , LSSC minimizes
(11) 
where in represents all codes, , is the group sparsity regularizer defined in (6), is the Frobenius norm, , and depends on the group size. As explained in Section 2, the role of the Group Lasso penalty is to encourage the codes belonging to the same cluster to share the same sparsity pattern, see Figure 1. For homogeneity reasons, we also consider the normalization factor , as in [mairal2009non]. Minimizing (11) when is easy with the ISTA method (and thus it is compatible with LISTA) since we know how to compute its proximal operator, which is described below, see [mairal2014sparse]:
Lemma 1 (Proximal operator for the Group Lasso).
Consider a matrix and call . Then, for all row of ,
(12) 
Unfortunately, the procedure used to design the groups does not yield a differentiable relation between the denoised image and the parameters to learn, which raises a major difficulty. Therefore, we first relax the hard clustering assumption into a soft one, which is able to exploit a similarity matrix representing pairwise relations between patches.
To do so, we first consider a similarity matrix that encodes the hard clustering assignment used by LSSC—that is, if is in and otherwise. Second, we note that where is the th column of that encodes the th cluster membership. Then, we adapt LISTA to problem (11), with a different shrinkage parameter per coordinate and per iteration as in Section 3.1, which yields the following iteration
(13) 
where the second update is performed for all , the superscript denotes the th row of a matrix, as above, and is simply the th entry of .
We are now in shape to relax the hard clustering assumption by allowing any similarity matrix in (13), and then use a relaxation of the Group Lasso penalty in Algorithm 1. The resulting model is able to encourage similar patches to share similar sparsity patterns, while being trainable by minimization of the cost (9
) with backpropagation.
3.2.2 Centralised sparse coding
A different approach to take into account self similarities in sparse models is the centralized sparse coding approach of [dong2012nonlocally]. This approach is easier to turn into a differentiable algorithm than the LSSC method, but we have empirically observed that it does not perform as well. Nevertheless, we believe it to be conceptually interesting, and we provide a brief description below.
The idea is relatively simple, and consists of regularizing each code with the regularization function
(14) 
where is obtained by a weighted average of codes obtained from a previous iteration, in the spirit of nonlocal means, where the weights involve pairwise distances between patches. Specifically, given some codes obtained at iteration and a similarity matrix , we compute
(15) 
and the weights are used to define the penalty (14) in order to compute the codes . Note that the original CSR method of [dong2012nonlocally] uses similarities of the form , which is based on the distance between two clean estimates of the patches, but other similarities functions may be used, see Section 3.3.
Even though [dong2012nonlocally] does not use a proximal gradient descent method to solve the problem regularized with (14), the next proposition shows that it admits a closed form, which is a key to turn CSR into a differentiable algorithm.
Proposition 1 (Proximal operator of the CSR penalty).
The proof of this proposition can be found in the appendix. The proximal operator is then differentiable almost everywhere, and thus can easily be plugged into Algorithm 1. At each iteration, the similarity matrix is updated along with the codes . Note also that a variant with different thresholding parameters per iteration and coordinate can be used in this model, as before for LSSC and SC.
3.3 Practical similarity metrics
We have computed similarities
in various manners, and implemented the following practical heuristics.
Semilocal grouping.
As in all methods that exploit nonlocal self similarities in images, we restrict the search for similar patches to to a window of size centered around the patch. This approach is commonly used to reduce the size of the similarity matrix and the global memory cost of the method. This means that we will always have if pixels and are too far apart.
Learned distance.
We always use a similarity function of the form , where is a distance between patches and , and is a parameter used at iteration of Algorithm 1, which we learn by backpropagation on the objective function. As in classical deep learning models using nonlocal approaches [zhang2019rnan], we do not directly use a Euclidean distance between patches, but allow to learn a few parameters. Specifically, we consider
(16) 
where and are the and th patches from the current estimate of the denoise image, respectively, and in is a set of weights, which are also learned by backpropagation.
Online averaging of similarity matrices.
As shown in Algorithm 1, we use a convex combination of similarity matrices (using the parameter in , also learned by backpropagation), which provides better results than computing the similarity on the current estimate only. This is expected since the current estimate may lose too much signal information to compute accurately the similarities.
3.4 Practical variants and implementation
Finally, we conclude this methodological section by discussing other practical variants and implementation details.
Dictionary initialization.
A great benefit of designing an architecture that admits a sparse coding interpretation, is that the parameters can be initialized with a classical dictionary learning approach, instead of using random weights, which makes it more robust to initialization. To do so, we use the online method of [mairal2010online], implemented in the SPAMS toolbox, due to its robustness and speed.
Block processing and dealing with border effects.
The size of the tensor grows quadratically with the image size, which requires processing sequentially sub image blocks rather than the full image directly. Here, the block size is chosen to match the size of the non local window, which requires taking into account two important details:
(i) Pixels close to the image border belong to fewer image patches than those from the center, and thus receive less estimates in the averaging procedure. When processing images per block, it is thus important to have a small overlap between blocks, such that the number of estimates per pixel is consistent across the image.
(ii) For training, we also process image blocks. It then is important to take border effects into account, by rescaling the reconstruction loss by the number of estimates per pixel.
4 Extension to demosaicking
Most modern digital cameras acquire color images by measuring only one color channel per pixel, red, green, or blue, according to a specific pattern called the Bayer pattern. Demosaicking is the processing step that reconstruct a full color image given these incomplete measurements.
Originally addressed by using interpolation techniques
[gunturk], demosaicking has been successfully tackled by sparse coding [mairal2009non] and deep learning models. Most of them such as [zhang2017learning, zhang2019rnan] rely on generic architectures and black box models that do not encode a priori knowledge about the problem, whereas the authors of [kokkinos2019iterative] propose an iterative algorithm that relies on the physics of the acquisition process. Extending our model to demosaicking (and in fact to other inpainting tasks with small holes) can be achieved by introducing a mask in the formulation for unobserved pixel values. Formally we define for patch as a vector in , and in represents all masks. Then, the sparse coding formulation becomes(17) 
where denotes the elementwise product between two matrices. The first updating rule of equation (13) is modified accordingly. This lead to a different update which has the effect of discarding reconstruction error of masked pixels.
(18) 
5 Experiments
(a)  (b)  (c)  (d)  (e)  (f) 
Training dataset.
In our experiments, we adopt the setting of [zhang2017beyond], which is the most standard one used by recent deep learning methods, allowing a simple and fair comparison. In particular, we use as a training set a subset of the Berkeley Segmentation Dataset (BSD) [martin2001database], commonly called BSD400, even though we believe it to be suboptimal. BSD400 is indeed relatively small, with only 400 mediumresolution images, some of which suffering from a few compression artefacts. We evaluate the models on 4 popular benchmarks, called Set12, BSD68 (with no overlap with BSD400), Kodak24, and Urban100, see [zhang2019rnan]. For demosaicking we evaluate our model on Kodak24 and BSD68.
Training details.
During training, we randomly extract patches of size whose size equals the neighborhood for nonlocal operations. We apply a light data augmentation (random rotation by and horizontal flips). We optimize the parameters of our models using ADAM [kingma2014adam] with a minibatch size of
. All the models are trained for 300 epochs for denoising and 200 epochs for demosaicking. The learning rate is set to
at initialization and is sequentially lowered during training by a factor of 0.35 every 80 training steps. Similar to [simon2019rethinking], we normalize the initial dictionnaryby its largest singular value, which helps the LISTA algorithm to converge. We initialize the dictionary
, and with the same value similarly to the implementation of [simon2019rethinking] released by the authors.Since large learning rates can make the model diverge, we have implemented a backtracking strategy that automatically decreases the learning rate by a factor 0.8 when the loss function increases, and restore a previous snapshot of the model. Divergence is monitored by computing the loss on the training set every 20 epochs. Training the GroupSC model from Section
3.2.1 for color denoising takes about 1.5 days on a Titan RTX GPU, whereas inference speed for one block of size 56x56x3 pixels is 50 ms.Method  RNAN  GroupSc (Ours) 

Parameters  8.96M  192k 
Depth (number of layers)  120  25 
Nr training epochs  1300  200 
Demosaicking
We follow the same experimental setting as IRCNN [zhang2017learning], but we do not crop the output images similarly to [zhang2017learning, mairal2009non] since [zhang2019rnan] does not seem to perform such an operation according to their code online. At inference time, we replace pixel prediction by its corresponding obervation when the pixel is non occluded by the Bayer pattern mask.
We evaluate the performance of the three variants of the algorithm SC, CSR, GroupSC of our proposed framework. We compare our model with sateoftheart deep learning methods [kokkinos2019iterative, zhang2019rnan, zhang2019rnan]. We also report the performance of LSCC. For the concurent methods we provide the numbers reported in the corresponding papers, unless specified. We first observe that our baseline provides already very good results, which is surprising given its simplicity. Compared to RNAN, our model is much smaller and shallower, as shown in Table 1. We report the number of parameters of [zhang2019rnan] based on the implementation of the authors. We also note that CSR performs poorly in comparaison with our baseline and groupSC.
Method  Params  Kodak24  BSD68 
Unsupervised  
LSCC [mairal2009non]    41.39   
Trainable  
IRCNN [zhang2019rnan]    40.41   
MMNet [kokkinos2019iterative]  380k  42.0   
RNAN [zhang2019rnan]  8.96M  42.86  42.61 
SC (ours)  192k  42.51  42.33 
CSR (ours)  192k  42.44   
GroupSC ^{2}^{2}2We report here our scores without any cropping similarly to [zhang2019rnan]. If we crop 10 pixels from the border following [zhang2017learning] we obtain a score of 42.98 db on Kodak24 and 42.64 db on BSD68.(ours)  192k  42.87  42.71 
Color Image Denoising
For fair comparaison, we train our models under the same setting of [lefkimmiatis2017non, zhang2017beyond] We corrupt images with synthetic additive gaussian noise with a variance . We train a different model for each variance of noise. We choose a patch size of and a set the size of the dictionary to 256. We report the performance in term of PSNR of our model in Table 3, along with those of competitive approaches, and provide results on other datasets in the appendix.
Finally, we compare our model with [zhang2019rnan] in Table 4 for because we did not manage to run their code for the sigma values considered in Table 3. Overall, it seems that RNAN performs slightly better than GroupSC, at a cost of using times more parameters.
Method  Params  Noise level ()  

5  15  25  50  
Unsupervised  
CBM3D [dabov2007image]    40.24  33.49  30.68  27.36 
Trainable  
CSCnet [simon2019rethinking] ^{3}^{3}3CSCnet has been trained on a larger dataset made of 5214 images (waterloo + bsd432).  186k    33.83  31.18  28.00 
NLNet[lefkimmiatis2017non]      33.69  30.96  27.64 
FFDNET [zhang2018ffdnet]  486k    33.87  31.21  27.96 
CDnCNN [zhang2017beyond]  668k  40.11  33.89  31.22  27.91 
SC (baseline)  112K  40.44  33.75  30.94  27.39 
CSR (ours)  112K  40.53  34.05  31.33  28.01 
GroupSC (ours)  112K  40.61  34.10  31.42  28.03 
Method  Params  

RNAN [zhang2019rnan]  8.96M  36.60  30.73 
GroupSC (ours)  112K  36.42  30.48 
Grayscale Denoising
In order to simplify the comparison, we train our models under the same setting of [zhang2017beyond, lefkimmiatis2017non, liu2018non]. We corrupt images with synthetic additive gaussian noise with a variance . We train a different model for each and report the performance in terms of PSNR. For gray denoising we choose a patch size of and dictionary with 256 atoms. Our method appears to perform on par with DnCNN for and performs significantly better for lownoise settings.
Method  Params  Noise ()  

5  15  25  50  
Unsupervised  
BM3D [dabov2007image]    37.57  31.07  28.57  25.62 
LSCC [mairal2009non]    37.70  31.28  28.71  25.72 
BM3D PCA [dabov2009bm3d]    37.77  33.38  28.82  25.80 
WNNM [gu2014weighted]    37.76  31.37  28.83  25.87 
Trainable  
CSCnet [simon2019rethinking] ^{4}^{4}4We run here the model with the code provided by the authors online on the smaller training set BSD400.  62k  37.84  31.57  29.11  26.24 
CSCnet [simon2019rethinking] ^{4}^{4}4We run here the model with the code provided by the authors online on the smaller training set BSD400.  62k  37.69  31.40  28.93  26.04 
TNRD [chen2016trainable]      31.42  28.92  25.97 
NLNet [lefkimmiatis2017non]      31.52  29.03  26.07 
FFDNet [zhang2018ffdnet]  486k    31.63  29.19  26.29 
DnCNN [zhang2017beyond]  556k  37.68  31.73  29.22  26.23 
N3 [plotz2018neural]  706k      29.30  26.39 
NLRN [liu2018non]  330k  37.92  31.88  29.41  26.47 
SC (baseline)  68K  37.84  31.46  28.90  25.84 
CSR (ours)  68K  37.88  31.64  29.16  26.08 
GroupSC (ours)  68K  37.95  31.69  29.19  26.18 
6 Conclusion
We have presented a differentiable algorithm based on nonlocal sparse image models, which performs on par or better than recent deep learning models, while using significantly less parameters. We believe that the performance of such approaches (including the simple SC baseline) is surprising given the small model size, and given the fact that the algorithm can be interpreted as a single sparse coding layer operating on fixedsize patches.
This observation paves the way for future work for sparse coding models that should be able to model the local stationarity of natural images at multiple scales, which we expect should perform even better.
Acknowledgements
Julien Mairal and Bruno Lecouat were supported by the ERC grant number 714381 (SOLARIS project) and by ANR 3IA MIAI@Grenoble Alpes. Jean Ponce was supported in part by the Louis Vuitton/ENS chair in artificial intelligence and the Inria/NYU collaboration.
References
Appendix A Appendix
a.1 Additional experimental details
In order to accelerate the inference time of the nonlocal models, we update patch similarities every steps. Where is the frequency of the correlation updates. We summarize in Table 6
the set of hyperparameters that we selected for the experiments reported in the main tables. We selected the same hyperparameters for the baselines, except that we do not compute pairwise patch similarities.
Experiment  Color d.  Gray d.  Demosaicking  

Patch size  7  9  9  
Dictionnary size  256  256  256  
Nr epochs  300  300  200  
Batch size  25  25  16  
iterations  24  24  24  

a.2 Influence of patch and dictionnary size
We investigate in Table 7 the influence of two hyperparameters: the patch size and the dictionnary size for grayscale image denoising. For this experiment we run a lighter version of the model groupSC, in order to accelerate the training. The batch size was decreased from 25 to 16, the frequency of the correlation updates was decreased from to and the intermediate patches are not approximated with averaging. These changes accelerate the training but lead to slightly lower performances when compared with the model trained in the standard setting. It explains the gap between the scores in Table 7 and in Table LABEL:tablenlrn.
Noise ()  Patch size  n=128  n=256  512 

k=7  37.91  37.92    
k=9  37.90  37.92  37.96  
k=11  37.89  37.89    
k=7  31.60  31.63    
k=9  31.62  31.67  31.71  
k=11  31.63  31.67    
k=7  29.10  29.11    
k=9  29.12  29.17  29.20  
k=11  29.13  29.18   
a.3 Grayscale denoising: evaluation on multiple datasets
We provide additional grayscale denoising results on other datasets in term of PSNR of our model in Table 8.
Dataset  Noise  BM3D  DnCnn  NLRN  GroupSC 

Set12  5        38.40 
15  32.37  32.86  33.16  32.85  
25  29.97  30.44  30.80  30.44  
50  26.72  27.18  27.64  27.14  
BSD68  5  37.57  37.68  37.92  37.95 
15  31.07  31.73  31.88  31.70  
25  28.57  29.23  29.41  29.20  
50  25.62  26.23  26.47  26.18  
Urban100  5        38.51 
15  32.35  32.68  33.45  32.71  
25  29.70  29.91  30.94  30.05  
50  25.95  26.28  27.49  26.44 
a.4 Color denoising: evaluation on multiple datasets
We provide additional color denoising results on other datasets in term of PSNR of our model in Table 9.
Dataset  Noise  CBM3D  GroupSC 

Kodak24  5    40.72 
15  33.25  34.98  
25  32.06  32.44  
50  28.75  29.16  
CBSD68  5  40.24  40.61 
15  33.49  34.10  
25  30.68  31.42  
50  27.36  28.03  
Urban100  5    39.74 
15  33.22  34.11  
25  30.59  31.63  
50  26.59  28.20 
a.5 Proof of propostion 1
The proximal operator of the function for in is defined as
The optimality condition for the previous problem is
We consider each component separately. We suppose that , otherwise boils down to the norm. And we also suppose .
Let’s examine the first case where . The subdifferential of the norm is the interval and the optimality condition is
Similarly if
Finally let’s examine the case where and : then, and . The minimum is obtained as
We study separately the cases where , and when and proceed similarly when . With elementary operations we can derive the expression of for each case. Putting the cases all together we obtain the formula.
Comments
There are no comments yet.