Revisiting Non Local Sparse Models for Image Restoration

12/05/2019 ∙ by Bruno Lecouat, et al. ∙ Inria 4

We propose a differentiable algorithm for image restoration inspired by the success of sparse models and self-similarity priors for natural images. Our approach builds upon the concept of joint sparsity between groups of similar image patches, and we show how this simple idea can be implemented in a differentiable architecture, allowing end-to-end training. The algorithm has the advantage of being interpretable, performing sparse decompositions of image patches, while being more parameter efficient than recent deep learning methods. We evaluate our algorithm on grayscale and color denoising, where we achieve competitive results, and on demoisaicking, where we outperform the most recent state-of-the-art deep learning model with 47 times less parameters and a much shallower architecture.



There are no comments yet.


page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Research on image restoration has originally focused on designing image models by hand in order to address inverse problems that require good a priori knowledge about the structure of natural images. For that purpose, various regularization functions have been investigated, ranging from linear differential operators enforcing smooth signals [perona1990], to total variation [rudin1992nonlinear], or wavelet sparsity [mallat].

Later, a bit more than ten years ago, image restoration paradigms have shifted towards data-driven approaches. For instance, non-local means [buades2005non]

is a non-parametric estimator that exploits image self-similarities, following pioneer works on texture synthesis 


, and many successful approaches have relied on unsupervised learning, such as learned sparse models 

[aharon2006k, mairal2014sparse], Gaussian scale mixtures [portilla], or fields of experts [roth]. Then, models combining several image priors, in particular self-similarities and sparse representations, have proven to further improve the reconstruction quality for various restoration tasks [dabov2007image, dabov2009bm3d, dong2012nonlocally, gu2014weighted, mairal2009non]

. Among these approaches, the most famous one is probably block matching with 3D filtering (BM3D) 


Only relatively recently, this last class of methods has been outperformed by deep learning models, which are able to leverage pairs of corrupted/clean images for training in a supervised fashion. More specifically, deep models have shown great effectiveness on many tasks such as denoising [lefkimmiatis2017non, zhang2017beyond, plotz2018neural, liu2018non], demoisaicking [kokkinos2019iterative, zhang2017learning, zhang2019rnan]

, super-resolution 

[dong2015image, kim2016accurate], or artefact removal, to name a few. Yet, they also suffer from inherent limitations such as lack of interpretability and they often require learning a huge number of parameters, which can be prohibitive for some applications. Improving these two aspects is one of the key motivation of our paper. Our goal is to design algorithms for image restoration that bridge the gap in performance between earlier approaches that are interpretable and parameter efficient, and current state-of-the-art deep learning models.

Our strategy consists of considering non-local sparse image models, the LSSC [mairal2009non] and the centralized sparse coding (CSR) methods [dong2012nonlocally], and use their principles to design a differentiable algorithm—that is, we design a restoration algorithm that optimizes a well-defined (and thus interpretable) cost function, but the algorithm and the cost also involve parameters that may be learned end-to-end with supervision. Such a principle was introduced for sparse coding problems involving the -penalty in the LISTA algorithm [gregor2010learning], which was then improved later in [chen2018theoretical, liu2018alista]. Such a differentiable approach for sparse coding was recently used for image denoising in [simon2019rethinking] and for super-resolution in [wang2015deep].

Our main contribution is to extend this idea of differentiable algorithms to structured sparse models [jenatton2011structured], which is a key to exploit self-similarities in the LSSC [mairal2009non] and CSR [dong2012nonlocally] approaches. Groups of similar patches are indeed processed together in order to obtain a joint sparse representation. Empirically, such a joint sparsity principle leads to simple architectures with few parameters that are competitive with the state of the art.

We present indeed a model with 68k parameters for image denoising that performs on par with the classical deep learning baseline DnCNN [zhang2017beyond] (556k parameters), and even performs better on low-noise settings. For color image denoising, our model with 112k parameters significantly ourperforms the color variant of DnCNN (668k parameters), and for image demosaicking, we obtain slightly better results than the state-of-the-art approach [zhang2019rnan], while reducing the number of parameters by x. Perhaps more importantly than improving the PSNR, we also observe that the principle of non local sparsity also reduces visual artefacts when compared to using sparsity alone (an observation also made in LSSC [mairal2009non]), which is illustrated in Figure 2.

Our models are implemented in PyTorch and our implementation is provided in the supplemental material.

2 Preliminaries and related work

In this section, we introduce non-local sparse coding models for image denoising and present a differentiable algorithm for sparse coding [gregor2010learning], which we extend later.

Sparse coding models on learned dictionaries.

A simple and yet effective approach for image denoising introduced in [elad2006image] consists of assuming that patches from natural images can often be represented by linear combinations of few dictionary elements. Thus, computing such a sparse approximation for a noisy patch is expected to yield a clean estimate of the signal. Then, given a noisy image , we denote by in the set of  overlapping patches of size

, which we represent by vectors in 

for grayscale images. Each noisy patch is then approximated by solving the sparse decomposition problem


where in

is the dictionary, which we assume to be given (at the moment) and good at representing image patches, and

is a sparsity-inducing penalty that encourages sparse solutions. This is indeed known to be the case when is the -norm (), see [mairal2014sparse]; when , is called the penalty and simply counts the number of nonzero elements in a vector.

Then, a clean estimate of is simply , which is a sparse linear combination of dictionary elements. Since the patches overlap, we obtain estimates for each pixel and the denoised image is obtained by averaging


where is a linear operator that places the patch at the adequate position centered on pixel

on the output image. Note that for simplicity, we neglect here the fact that pixels close to the image border admit less estimates unless zero padding is used.

Whereas we have previously assumed that a good dictionary  for natural images is available—e.g., it could be the discrete cosine transform (DCT)  [ahmed1974discrete]—it has been proposed in [elad2006image] to adapt to the image, by solving a matrix factorization problem called dictionary learning [field].

Finally, variants of the previous formulation have been shown to improve the results, see [mairal2009non]. In particular, it seems helpful to center each patch (removing the mean intensity) before performing the sparse approximation and adding back the mean intensity to the final estimate. Instead of (1), it is also possible to minimize under the constraint , where

is proportional to the noise variance

, which is assumed to be known.

Differentiable algorithms for sparse coding.

A popular approach to solve (1) when is the iterative shrinkage algorithm (ISTA) [figueiredo2003]. Denoting by the soft-thresholding operator, which can be applied pointwise to a vector, and by

an upper-bound on the largest eigenvalue of

, ISTA performs the following steps for solving (1):


Note that such a step performs a linear operation on followed by a pointwise non-linear function . It is thus tempting to consider

steps of the algorithm, see it as a neural network with

layers, and learn the corresponding weights. Following such an insight, the authors of [gregor2010learning] have proposed the LISTA algorithm, which is trained such that the resulting “neural network” with layers learns to approximate the solution of the sparse coding problem (1).

Other variants were then proposed, see [chen2018theoretical, liu2018alista], and the one we have adopted in our paper may be written as


where has the same size as and is a vector in such that for in performs a soft-thresholding operation on with a different threshold for each entry of . Then, the variables and will be learned for a supervised task, thus allowing to implement efficiently a task-driven dictionary learning method [mairal2011task].

Note that when and , the recursion recovers the ISTA algorithm. Empirically, it has been observed that allowing accelerates convergence and could be interpreted as learning a pre-conditioner for the ISTA method [liu2018alista], while allowing to have entries different than corresponds to using a weighted -norm instead of

and learning the weights. The concept of differentiable algorithm is interesting and differs from classical machine learning paradigms: it could be seen indeed as a way to learn a cost function and tune at the same time an optimization algorithm to minimize it.

There have been already a few attempts to leverage the LISTA algorithm for specific image restoration tasks such as super-resolution [wang2015deep] or denoising [simon2019rethinking], which we extend in our paper with non-local priors and structured sparsity.

Exploiting non-local self-similarities.

The non-local means approach [buades2005non] consists of averaging patches that are similar to each other, but that are corrupted by different independent zero-mean noise variables, such that averaging reduces the noise variance without corrupting too much the underlying signal. The intuition is relatively simple and relies on the fact that natural images admit many local self-similarities. Non-local means is a non-parametric approach, which may be seen as a Nadaraya-Watson estimator.

Non local sparse models.

Noting that self-similarities and sparsity are two relatively different image priors, the authors of [mairal2009non] have introduced the LSSC approach that uses a structured sparsity prior based on image self similarities. If we denote by the set of patches similar to ,


for some threshold , then, we may consider the matrix in of coefficients forming a group of similar patches. LSSC then encourages the sparsity pattern—that is, the set of non-zero coefficients—of the decompositions for to be similar. This can be achieved by using a group-sparsity regularizer [turlach]


where is the -th row in , and (leading to a convex penalty), or ; then, simply counts the number of non-zero rows in . The effect of such penalties is to encourage sparsity patterns to be shared across similar patches, as illustrated in Figure 1.

Figure 1: (Left) sparsity pattern of independent codes  with grey values representing non-zero entries; (right) group sparsity of codes for a subset of similar patches. Figure courtesy of [mairal2009non].
Deep learning models.

Deep neural networks have been successful over the past years and give state-of-the-art results for many restoration tasks. In particular, successful principles include very deep networks, residual connections, batch norm, and residual learning 

[lefkimmiatis2018universal, zhang2017beyond, zhang2018ffdnet, zhang2019rnan]. Recent models also use so-called attention mechanisms to model self similarities, which are in fact pooling operations akin to non-local means. More precisely, a generic non local module has been proposed in [liu2018non], which performs weighed average of similar features. In [plotz2018neural], a relaxation of the k-nearest selection rule is introduced, which can be used for designing deep neural networks for image restoration.

3 Methods

In this section, we first introduce trainable sparse coding models for image denoising, following [simon2019rethinking] (while introducing two minor improvements to the method), before introducing several approaches to model self-similarities.

3.1 Trainable sparse coding

In [simon2019rethinking], the sparse coding approach (SC) described in Section 2 is combined with the LISTA algorithm to perform denoising tasks.111Specifically, [simon2019rethinking] considers the SC approach as a baseline, and proposes an improved model based on the principle of convolutional sparse coding (CSC). CSC is a variant of SC, where an image is approximated by a sparse linear combination of small dictionary elements placed at all possible positions in the image. Unfortunately, CSC leads to ill-conditioned sparse optimization problems and has shown to perform poorly for image denoising. For this reason, [simon2019rethinking]

introduces strides, which yields a hybrid approach between SC and CSC. In our paper, we have decided to stick to the SC baseline and leave the investigation of CSC models for future work.

The only modification we introduce here is a centering step for the patches, which empirically yields better results (and thus a stronger baseline).

SC Model - inference.

We now explain how an input image is represented in the SC model, before discussing how to learn the parameters. Following the classical approach from Section 2, the first step consists of extracting all overlapping patches from , which we denote by , where is a linear patch extraction operator.

Then, we perform the centering operation for every patch


The mean value  is recorded and added after denoising . Hence, the low frequency component of the signal does not flow through the model. This observation is related to the residual approach for deep learning methods for denoising and super resolution [zhang2017beyond], where neural networks learn to predict the corruption noise rather than the full image. The centering step is not used in [simon2019rethinking], but we have found it to provide better reconstruction quality.

The next step consists of sparsely encoding each centered patch with steps of the LISTA variant presented in (4), replacing by there, assuming the parameters and are given. Here, a minor change compared to [simon2019rethinking] is the use of varying parameters at each LISTA step, which leads to a minor increase in the number of parameters (on the order in our experiments).

Finally, the final image is obtained by averaging the patch estimates as in (2), after adding back the mean value:


but note that the dictionary  is replaced by another one  of the same size. The reason for decoupling from is that the weighted penalty that is implicitly used by the LISTA method is known to shrink the coefficients  too much and to provide biased estimates of the signal. For this reason, classical denoising approaches such as [elad2006image, mairal2009non] based on sparse coding use instead the -penalty, but we have found it ineffective for end-to-end training. Therefore, as in [simon2019rethinking], we have chosen to decouple from .

Note that in terms of implementation, it is worth noting that all operations above can be simply expressed in classical frameworks for deep learning. LISTA steps involve indeed convolutions after representing the

’s as a traditional feature map, akin to that of convolutional neural networks, whereas the averaging step (


) corresponds to the “transpose convolution” in Tensorflow or PyTorch.

Training the parameters.

We now show how to train the parameters in a supervised fashion, which differs from the traditional dictionary learning approach where only the noisy image is available [elad2006image, mairal2009non]. Here, we assume that we are given a data distribution of pairs of clean/noisy images , and we simply minimize the reconstruction loss.


where is the denoised image defined in (8), given a noisy image 

. This is then achieved by using stochastic gradient descent or one of its variants (see experimental section).

3.2 Embedding non-local sparse priors

1:Extract patches and center them with (7);

Initialize tensor of codes

to ;
3:Initialize image estimate to the noisy input ;
4:Initialize pairwise similarities  between patches of ;
5:for  do
6:     Compute pairwise patch similarities  on ;
7:     Update ;
8:     for  in parallel do
9:         ;
10:     end for
11:     Update the denoised image  by averaging (8);
12:end for
Algorithm 1 Pseudo code for the inference model

In this section, we replace the -norm (or its weighted variant) by structured sparsity-inducing regularization functions that take into account non-local image self similarities. This idea allow us to turn classical non-local sparse models [dong2012nonlocally, mairal2009non] into differentiable algorithms.

The generic approach is presented in Algorithm 1. The algorithm performs steps, where it computes pairwise patch similarities  between patches of a current estimate , using various possible metrics that we discuss in Section 3.3. Then, is updated by computing a so-called proximal operator, defined below, for a particular penalty that depends on and some parameters . Practical variants where the pairwise similarities are only updated once in a while, are discussed in Section 3.4.

Definition 1 (Proximal operator).

Given a convex function , the proximal operator of is defined as


The proximal operator plays a key role in optimization and admits a closed form for many sparsity-inducing penalties, see [mairal2014sparse]. Indeed, given , it may be shown that the iterations are instances of the ISTA algorithm [beck2009fast] for minimizing the problem

and then the update of in Algorithm 1 becomes simply an extension of LISTA to deal with the penalty .

Note that for the weighted -norm , the proximal operator is the soft-thresholding operator introduced in Section 2 for in , and we simply recover the SC algorithm from Section 3.1 since does not depend on the pairwise similarities  (which then do not need to be computed). Next, we present different structured sparsity-inducing penalties that yield more effective algorithms.

3.2.1 Group Lasso and LSSC

For each location , the LSSC approach [mairal2009non] defines groups of similar patches; however, for computational reasons, LSSC relaxes the definition (5) in practice, and implements instead a simple clustering method such that if and belong to the same group. Then, under this clustering assumption and given a dictionary , LSSC minimizes


where in represents all codes, , is the group sparsity regularizer defined in (6), is the Frobenius norm, , and depends on the group size. As explained in Section 2, the role of the Group Lasso penalty is to encourage the codes belonging to the same cluster to share the same sparsity pattern, see Figure 1. For homogeneity reasons, we also consider the normalization factor , as in [mairal2009non]. Minimizing (11) when is easy with the ISTA method (and thus it is compatible with LISTA) since we know how to compute its proximal operator, which is described below, see [mairal2014sparse]:

Lemma 1 (Proximal operator for the Group Lasso).

Consider a matrix and call . Then, for all row of ,


Unfortunately, the procedure used to design the groups  does not yield a differentiable relation between the denoised image  and the parameters to learn, which raises a major difficulty. Therefore, we first relax the hard clustering assumption into a soft one, which is able to exploit a similarity matrix  representing pairwise relations between patches.

To do so, we first consider a similarity matrix  that encodes the hard clustering assignment used by LSSC—that is, if is in  and  otherwise. Second, we note that where is the -th column of that encodes the -th cluster membership. Then, we adapt LISTA to problem (11), with a different shrinkage parameter per coordinate  and per iteration  as in Section 3.1, which yields the following iteration


where the second update is performed for all , the superscript denotes the -th row of a matrix, as above, and is simply the -th entry of .

We are now in shape to relax the hard clustering assumption by allowing any similarity matrix in (13), and then use a relaxation of the Group Lasso penalty in Algorithm 1. The resulting model is able to encourage similar patches to share similar sparsity patterns, while being trainable by minimization of the cost (9

) with backpropagation.

3.2.2 Centralised sparse coding

A different approach to take into account self similarities in sparse models is the centralized sparse coding approach of [dong2012nonlocally]. This approach is easier to turn into a differentiable algorithm than the LSSC method, but we have empirically observed that it does not perform as well. Nevertheless, we believe it to be conceptually interesting, and we provide a brief description below.

The idea is relatively simple, and consists of regularizing each code with the regularization function


where is obtained by a weighted average of codes obtained from a previous iteration, in the spirit of non-local means, where the weights involve pairwise distances between patches. Specifically, given some codes  obtained at iteration  and a similarity matrix , we compute


and the weights are used to define the penalty (14) in order to compute the codes . Note that the original CSR method of [dong2012nonlocally] uses similarities of the form , which is based on the distance between two clean estimates of the patches, but other similarities functions may be used, see Section 3.3.

Even though [dong2012nonlocally] does not use a proximal gradient descent method to solve the problem regularized with (14), the next proposition shows that it admits a closed form, which is a key to turn CSR into a differentiable algorithm.

Proposition 1 (Proximal operator of the CSR penalty).

Consider defined in (14). Then, for all in ,

where is the soft-thresholding operator.

The proof of this proposition can be found in the appendix. The proximal operator is then differentiable almost everywhere, and thus can easily be plugged into Algorithm 1. At each iteration, the similarity matrix is updated along with the codes . Note also that a variant with different thresholding parameters per iteration  and coordinate  can be used in this model, as before for LSSC and SC.

3.3 Practical similarity metrics

We have computed similarities 

in various manners, and implemented the following practical heuristics.

Semi-local grouping.

As in all methods that exploit non-local self similarities in images, we restrict the search for similar patches to to a window of size centered around the patch. This approach is commonly used to reduce the size of the similarity matrix and the global memory cost of the method. This means that we will always have if pixels  and  are too far apart.

Learned distance.

We always use a similarity function of the form , where is a distance between patches  and , and is a parameter used at iteration  of Algorithm 1, which we learn by backpropagation on the objective function. As in classical deep learning models using non-local approaches [zhang2019rnan], we do not directly use a Euclidean distance between patches, but allow to learn a few parameters. Specifically, we consider


where and are the and -th patches from the current estimate of the denoise image, respectively, and in is a set of weights, which are also learned by backpropagation.

Online averaging of similarity matrices.

As shown in Algorithm 1, we use a convex combination of similarity matrices (using the parameter  in , also learned by backpropagation), which provides better results than computing the similarity on the current estimate only. This is expected since the current estimate may lose too much signal information to compute accurately the similarities.

3.4 Practical variants and implementation

Finally, we conclude this methodological section by discussing other practical variants and implementation details.

Dictionary initialization.

A great benefit of designing an architecture that admits a sparse coding interpretation, is that the parameters can be initialized with a classical dictionary learning approach, instead of using random weights, which makes it more robust to initialization. To do so, we use the online method of [mairal2010online], implemented in the SPAMS toolbox, due to its robustness and speed.

Block processing and dealing with border effects.

The size of the tensor grows quadratically with the image size, which requires processing sequentially sub image blocks rather than the full image directly. Here, the block size is chosen to match the size  of the non local window, which requires taking into account two important details:

(i) Pixels close to the image border belong to fewer image patches than those from the center, and thus receive less estimates in the averaging procedure. When processing images per block, it is thus important to have a small overlap between blocks, such that the number of estimates per pixel is consistent across the image.

(ii) For training, we also process image blocks. It then is important to take border effects into account, by rescaling the reconstruction loss by the number of estimates per pixel.

4 Extension to demosaicking

Most modern digital cameras acquire color images by measuring only one color channel per pixel, red, green, or blue, according to a specific pattern called the Bayer pattern. Demosaicking is the processing step that reconstruct a full color image given these incomplete measurements.

Originally addressed by using interpolation techniques 

[gunturk], demosaicking has been successfully tackled by sparse coding [mairal2009non] and deep learning models. Most of them such as [zhang2017learning, zhang2019rnan] rely on generic architectures and black box models that do not encode a priori knowledge about the problem, whereas the authors of [kokkinos2019iterative] propose an iterative algorithm that relies on the physics of the acquisition process. Extending our model to demosaicking (and in fact to other inpainting tasks with small holes) can be achieved by introducing a mask  in the formulation for unobserved pixel values. Formally we define for patch as a vector in , and in represents all masks. Then, the sparse coding formulation becomes


where denotes the elementwise product between two matrices. The first updating rule of equation (13) is modified accordingly. This lead to a different update which has the effect of discarding reconstruction error of masked pixels.


5 Experiments

Figure 2: Demosaicking result obtained by our method. Top right: Ground truth. Middle: Image demosaicked with our sparse coding baseline without non-local prior. Bottom: demosaicking with sparse coding and non-local prior. The reconstruction does not exhibit any artefact on this image which is notoriously difficult for demosaicking.
(a) (b) (c) (d) (e) (f)
Figure 3: Color denoising results for 4 images from the Kodak24 dataset. (a) Original image and close-up region; (b) Ground truth; (c) Noisy image ; (d) CBm3D; (e) CDnCNN; (f) Group-sc (Ours). Best seen by zooming on a computer screen.
Training dataset.

In our experiments, we adopt the setting of [zhang2017beyond], which is the most standard one used by recent deep learning methods, allowing a simple and fair comparison. In particular, we use as a training set a subset of the Berkeley Segmentation Dataset (BSD) [martin2001database], commonly called BSD400, even though we believe it to be suboptimal. BSD400 is indeed relatively small, with only 400 medium-resolution images, some of which suffering from a few compression artefacts. We evaluate the models on 4 popular benchmarks, called Set12, BSD68 (with no overlap with BSD400), Kodak24, and Urban100, see [zhang2019rnan]. For demosaicking we evaluate our model on Kodak24 and BSD68.

Training details.

During training, we randomly extract patches of size whose size equals the neighborhood for non-local operations. We apply a light data augmentation (random rotation by and horizontal flips). We optimize the parameters of our models using ADAM [kingma2014adam] with a minibatch size of

. All the models are trained for 300 epochs for denoising and 200 epochs for demosaicking. The learning rate is set to

at initialization and is sequentially lowered during training by a factor of 0.35 every 80 training steps. Similar to [simon2019rethinking], we normalize the initial dictionnary

by its largest singular value, which helps the LISTA algorithm to converge. We initialize the dictionary

, and  with the same value similarly to the implementation of [simon2019rethinking] released by the authors.

Since large learning rates can make the model diverge, we have implemented a backtracking strategy that automatically decreases the learning rate by a factor 0.8 when the loss function increases, and restore a previous snapshot of the model. Divergence is monitored by computing the loss on the training set every 20 epochs. Training the GroupSC model from Section 

3.2.1 for color denoising takes about 1.5 days on a Titan RTX GPU, whereas inference speed for one block of size 56x56x3 pixels is 50 ms.

Method RNAN GroupSc (Ours)
Parameters 8.96M 192k
Depth (number of layers) 120 25
Nr training epochs 1300 200
Table 1: Architecture comparison between our model GroupSc and the second best method for trainable demosaicking.

We follow the same experimental setting as IRCNN [zhang2017learning], but we do not crop the output images similarly to [zhang2017learning, mairal2009non] since [zhang2019rnan] does not seem to perform such an operation according to their code online. At inference time, we replace pixel prediction by its corresponding obervation when the pixel is non occluded by the Bayer pattern mask.

We evaluate the performance of the three variants of the algorithm SC, CSR, GroupSC of our proposed framework. We compare our model with sate-of-the-art deep learning methods [kokkinos2019iterative, zhang2019rnan, zhang2019rnan]. We also report the performance of LSCC. For the concurent methods we provide the numbers reported in the corresponding papers, unless specified. We first observe that our baseline provides already very good results, which is surprising given its simplicity. Compared to RNAN, our model is much smaller and shallower, as shown in Table 1. We report the number of parameters of [zhang2019rnan] based on the implementation of the authors. We also note that CSR performs poorly in comparaison with our baseline and groupSC.

Method Params Kodak24 BSD68
LSCC [mairal2009non] - 41.39 -
IRCNN [zhang2019rnan] - 40.41 -
MMNet [kokkinos2019iterative] 380k 42.0 -
RNAN [zhang2019rnan] 8.96M 42.86 42.61
SC (ours) 192k 42.51 42.33
CSR (ours) 192k 42.44 -
GroupSC 222We report here our scores without any cropping similarly to [zhang2019rnan]. If we crop 10 pixels from the border following [zhang2017learning] we obtain a score of 42.98 db on Kodak24 and 42.64 db on BSD68.(ours) 192k 42.87 42.71
Table 2: Demosaicking. Training on BSD400. Performance is measured in terms of average PSNR. Best is in bold.
Color Image Denoising

For fair comparaison, we train our models under the same setting of [lefkimmiatis2017non, zhang2017beyond] We corrupt images with synthetic additive gaussian noise with a variance . We train a different model for each variance of noise. We choose a patch size of and a set the size of the dictionary to 256. We report the performance in term of PSNR of our model in Table 3, along with those of competitive approaches, and provide results on other datasets in the appendix.

Finally, we compare our model with [zhang2019rnan] in Table 4 for because we did not manage to run their code for the sigma values considered in Table 3. Overall, it seems that RNAN performs slightly better than GroupSC, at a cost of using times more parameters.

Method Params Noise level ()
5 15 25 50
CBM3D [dabov2007image] - 40.24 33.49 30.68 27.36
CSCnet [simon2019rethinking] 333CSCnet has been trained on a larger dataset made of 5214 images (waterloo + bsd432). 186k - 33.83 31.18 28.00
NLNet[lefkimmiatis2017non] - - 33.69 30.96 27.64
FFDNET [zhang2018ffdnet] 486k - 33.87 31.21 27.96
CDnCNN [zhang2017beyond] 668k 40.11 33.89 31.22 27.91
SC (baseline) 112K 40.44 33.75 30.94 27.39
CSR (ours) 112K 40.53 34.05 31.33 28.01
GroupSC (ours) 112K 40.61 34.10 31.42 28.03
Table 3: Color denoising on CBSD68. Training on CBSD400 unless specified. Performance is measured in terms of average PSNR (in dB). Best is in bold.
Method Params
RNAN [zhang2019rnan] 8.96M 36.60 30.73
GroupSC (ours) 112K 36.42 30.48
Table 4: Color denoising on CBSD68. Training on BSD400. Performance is measured in terms of average PSNR (in dB). Best is in bold.
Grayscale Denoising

In order to simplify the comparison, we train our models under the same setting of [zhang2017beyond, lefkimmiatis2017non, liu2018non]. We corrupt images with synthetic additive gaussian noise with a variance . We train a different model for each  and report the performance in terms of PSNR. For gray denoising we choose a patch size of and dictionary with 256 atoms. Our method appears to perform on par with DnCNN for and performs significantly better for low-noise settings.

Method Params Noise ()
5 15 25 50
BM3D [dabov2007image] - 37.57 31.07 28.57 25.62
LSCC [mairal2009non] - 37.70 31.28 28.71 25.72
BM3D PCA [dabov2009bm3d] - 37.77 33.38 28.82 25.80
WNNM [gu2014weighted] - 37.76 31.37 28.83 25.87
CSCnet [simon2019rethinking] 444We run here the model with the code provided by the authors online on the smaller training set BSD400. 62k 37.84 31.57 29.11 26.24
CSCnet [simon2019rethinking] 444We run here the model with the code provided by the authors online on the smaller training set BSD400. 62k 37.69 31.40 28.93 26.04
TNRD [chen2016trainable] - - 31.42 28.92 25.97
NLNet [lefkimmiatis2017non] - - 31.52 29.03 26.07
FFDNet [zhang2018ffdnet] 486k - 31.63 29.19 26.29
DnCNN [zhang2017beyond] 556k 37.68 31.73 29.22 26.23
N3 [plotz2018neural] 706k - - 29.30 26.39
NLRN [liu2018non] 330k 37.92 31.88 29.41 26.47
SC (baseline) 68K 37.84 31.46 28.90 25.84
CSR (ours) 68K 37.88 31.64 29.16 26.08
GroupSC (ours) 68K 37.95 31.69 29.19 26.18
Table 5: Grayscale Denoising on BSD68. Training on BSD400 unless specified. Performance is measured in terms of average PSNR (in dB). Best is in bold.

6 Conclusion

We have presented a differentiable algorithm based on non-local sparse image models, which performs on par or better than recent deep learning models, while using significantly less parameters. We believe that the performance of such approaches (including the simple SC baseline) is surprising given the small model size, and given the fact that the algorithm can be interpreted as a single sparse coding layer operating on fixed-size patches.

This observation paves the way for future work for sparse coding models that should be able to model the local stationarity of natural images at multiple scales, which we expect should perform even better.


Julien Mairal and Bruno Lecouat were supported by the ERC grant number 714381 (SOLARIS project) and by ANR 3IA MIAI@Grenoble Alpes. Jean Ponce was supported in part by the Louis Vuitton/ENS chair in artificial intelligence and the Inria/NYU collaboration.


Appendix A Appendix

a.1 Additional experimental details

In order to accelerate the inference time of the non-local models, we update patch similarities every steps. Where is the frequency of the correlation updates. We summarize in Table 6

the set of hyperparameters that we selected for the experiments reported in the main tables. We selected the same hyper-parameters for the baselines, except that we do not compute pairwise patch similarities.

Experiment Color d. Gray d. Demosaicking
Patch size 7 9 9
Dictionnary size 256 256 256
Nr epochs 300 300 200
Batch size 25 25 16
iterations 24 24 24
Correlation update
Table 6: Hyper-parameters of our experiments.

a.2 Influence of patch and dictionnary size

We investigate in Table 7 the influence of two hyperparameters: the patch size and the dictionnary size for grayscale image denoising. For this experiment we run a lighter version of the model groupSC, in order to accelerate the training. The batch size was decreased from 25 to 16, the frequency of the correlation updates was decreased from to and the intermediate patches are not approximated with averaging. These changes accelerate the training but lead to slightly lower performances when compared with the model trained in the standard setting. It explains the gap between the scores in Table  7 and in Table LABEL:tablenlrn.

Noise () Patch size n=128 n=256 512
k=7 37.91 37.92 -
k=9 37.90 37.92 37.96
k=11 37.89 37.89 -
k=7 31.60 31.63 -
k=9 31.62 31.67 31.71
k=11 31.63 31.67 -
k=7 29.10 29.11 -
k=9 29.12 29.17 29.20
k=11 29.13 29.18 -
Table 7: Influence of the dictionnary size and the patch size on the denoising performance. Grayscale denoising on BSD68. Models are trained on BSD400. Models are trained in a light setting to accelerate the training.

a.3 Grayscale denoising: evaluation on multiple datasets

We provide additional grayscale denoising results on other datasets in term of PSNR of our model in Table 8.

Dataset Noise BM3D DnCnn NLRN GroupSC
Set12 5 - - - 38.40
15 32.37 32.86 33.16 32.85
25 29.97 30.44 30.80 30.44
50 26.72 27.18 27.64 27.14
BSD68 5 37.57 37.68 37.92 37.95
15 31.07 31.73 31.88 31.70
25 28.57 29.23 29.41 29.20
50 25.62 26.23 26.47 26.18
Urban100 5 - - - 38.51
15 32.35 32.68 33.45 32.71
25 29.70 29.91 30.94 30.05
50 25.95 26.28 27.49 26.44
Table 8: Grayscale denoising on different datasets.Training on BSD400. Performance is measured in terms of average PSNR (in dB).

a.4 Color denoising: evaluation on multiple datasets

We provide additional color denoising results on other datasets in term of PSNR of our model in Table 9.

Dataset Noise CBM3D GroupSC
Kodak24 5 - 40.72
15 33.25 34.98
25 32.06 32.44
50 28.75 29.16
CBSD68 5 40.24 40.61
15 33.49 34.10
25 30.68 31.42
50 27.36 28.03
Urban100 5 - 39.74
15 33.22 34.11
25 30.59 31.63
50 26.59 28.20
Table 9: Color denoising on different datasets.Training on CBSD400. Performance is measured in terms of average PSNR (in dB).

a.5 Proof of propostion 1

The proximal operator of the function for in  is defined as

The optimality condition for the previous problem is

We consider each component separately. We suppose that , otherwise boils down to the norm. And we also suppose .

Let’s examine the first case where . The subdifferential of the norm is the interval and the optimality condition is

Similarly if

Finally let’s examine the case where and : then, and . The minimum is obtained as

We study separately the cases where , and when and proceed similarly when . With elementary operations we can derive the expression of for each case. Putting the cases all together we obtain the formula.