1 Introduction
Deep learning methods have been successfully applied in various computer vision tasks, including image classification
[20, 16] and object detection [11, 29], and have dramatically improved the performance of these systems, setting the new stateoftheart. Recently, very promising results have also been reported for image processing applications such as image restoration [5, 39][18] and optical flow [1].The significant boost in performance achieved by deep networks can be mainly attributed to their advanced modeling capabilities, thanks to their deep structure and the presence of nonlinearities that are combined with discriminative learning on large training datasets. However, most of the current deep learning methods developed for image restoration tasks are based on general network architectures that do not fully exploit problemspecific knowledge. It is thus reasonable to expect that incorporating such information could lead to further improvements in performance. Only very recently, Schmidt and Roth [34] and Chen and Pock [6] introduced deep networks whose architecture is specifically tailored to certain image restoration problems. However, even in these cases, the resulting models are local ones and do not take into account the inherent nonlocal selfsimilarity property of natural images. On the other hand, conventional methods that have exploited this property have been shown to gain significant improvements compared to standard local approaches. A notable example is the Block Matching and 3D Collaborative Filtering (BM3D) method [7] which is a very efficient and highly engineered approach that held the stateoftheart record in image denoising for almost a decade.
(a)  (b) 
In this work, motivated by the recent advances in deep learning and relying on the rich body of algorithmic ideas that have been developed in the past for tackling image reconstruction problems, we study deep network architectures for image denoising. Inspired by nonlocal variational methods and other related approaches, we design a network that performs nonlocal processing and at the same time it significantly benefits from discriminative learning. Specifically, our strategy is instead of manually designing a nonlocal regularization functional, to learn the nonlocal regularization operator and the potential function following a lossbased training approach.
Our contributions in this work can be summarized as follows: (1) We propose a novel deep network architecture that is discriminatively trained for image denoising. As opposed to the existing deeplearning methods for image restoration, which are based on local models, our network explicitly models the nonlocal selfsimilarity property of natural images through a grouping operation of similar patches and a joint filtering. (2) We unroll a proximal gradient method into a deep network and learn the relevant parameters using a simple yet effective backpropagation strategy. (3) In contrast to the majority of recent denoising methods that are designed for processing singlechannel images, we introduce a variation of our network that applies to color images and leads to stateoftheart results. (4) We highlight a direct link of our proposed nonlocal networks with convolutional neural networks (CNNs). This connection allows our models to take full advantage of the latest advances on GPU computing in deep learning and makes them amenable to efficient implementations through their inherent parallelism.
2 Variational Image Restoration Revisited
The goal of image denoising is the restoration of a grayscale or color image from a corrupted observation , with the later obtained according to the observation model
(1) 
In this setting, ,
are the vectorized versions of the observed and latent images, respectively,
is the number of pixels, the number of image channels, andis assumed to be i.i.d Gaussian noise with variance
.Due to the illposedness of the studied problem [38], Eq. (1) that relates the latent image to the observation cannot uniquely characterize the solution. This implies that in order to obtain a physically or statistically meaningful solution, the image evidence must be combined with suitable image priors.
Among the most popular and powerful strategies available in the literature for combining the observation and prior information is the variational approach. In this framework the recovery of from heavily relies on the formation of an objective function
(2) 
whose role is to quantify the quality of the solution. Typically the objective function consists of two terms, namely the data fidelity term , which measures the proximity of the solution to the observation, and the regularizer which constrains the set of plausible solutions by penalizing those that do not exhibit the desired properties. The regularization parameter balances the contributions of the two terms. Then, the restoration task is cast as the minimization of this objective function and the minimizer corresponds to the restored image. Note that for the problem under consideration and since the noise corrupting the observation is i.i.d Gaussian, the data term should be equal to
. This variational restoration approach has also direct links to Bayesian estimation methods and can be interpreted either as a penalized maximum likelihood or a maximum a posteriori (MAP) estimation problem
[2, 13].2.1 Image Regularization
The choice of an appropriate regularizer is very important, since it is one of the main factors that determine the quality of the restored image. For this reason, a lot of effort has been made to design novel regularization functionals that can model important image properties and consequently lead to improved reconstruction results. Most of the existing regularization methods are based either on a synthesis or an analysisbased approach. Synthesisbased regularization takes place in a sparsifyingdomain, such as the wavelet basis, and the restored image is obtained by applying an inverse transform [13]. On the other hand, analysisbased regularization involves regularizers that are directly applied on the image one aims to restore. For general inverse problems, the latter regularization strategy has been reported to lead to better reconstruction results [9, 35] and therefore is mostly preferred.
The analysisbased regularizers are typically defined as:
(3) 
where is the regularization operator ( denotes the Ddimensional th entry of the result obtained by applying to the image ) and is the potential function. Common choices for are differential operators of the first or of higher orders such as the gradient [31, 3]
, the structure tensor
[23], the Laplacian and the Hessian [21, 24], or waveletlike operators such as wavelets, curvelets and ridgelets (see [13] and references therein). For the potential function the most popular choices are vector and matrix norms, but other type of functions are also frequently used such as the pseudonorm and the logarithm. Combinations of the above regularization operators and potential functions lead to existing regularization functionals that have been proven very effective in several inverse problems, including image denoising. A notable representative of the above regularizers is the Total Variation (TV) [31], where the regularization operator corresponds to the gradient and the potential function to the vector norm.TV regularization and similar methods that penalize derivatives are essentially local methods, since they involve operators that act on a restricted region of the image domain. More recently, a different regularization paradigm has been introduced where nonlocal operators are employed to define new regularization functionals [40, 19, 10, 14, 22]. The resulting nonlocal methods are wellsuited for image processing and computervision applications and produce very competitive results. The reason is that they allow longrange dependencies between image points and are able to exploit the inherent nonlocal selfsimilarity property of natural images. This property implies that images often consist of localized patterns that tend to repeat themselves possibly at distant locations in the image domain.
It is worth noting that alternative image denoising methods that do not fall in the category of analysisbased regularization schemes but still exploit the selfsimilarity property have been developed and produce excellent results. A nonexhaustive list of these methods is the nonlocal means filter (NLM) [4], BM3D [7], the Learned Simultaneous Sparse Coding (LSSC) [25], and the Weighted Nuclear Norm Minimization (WNNM) [15].
2.2 Objective Function Minimization
Besides the formulation of the objective function and the proper selection of the regularizer, another important aspect in the variational approach is the minimization strategy that will be employed to obtain the solution. For the case under study, the solution to the image denoising problem can be mathematically formulated as:
(4) 
where is the indicator function of the convex set . The indicator function takes the value 0 if and otherwise. The presence of this additional term in Eq. (4) stems from the fact that these type of constraints on the image intensities arise naturally. For example it is reasonable to require that the intensity of the restored image should either be nonnegative (nonnegativity constraint with ) or its values should lie in a specific range (boxconstraint).
2.3 Proximal Gradient Method
There is a variety of powerful optimization strategies for dealing with Eq. (4). The simplest approach however, which we will follow in this work, is to directly use a gradientdescent algorithm. Since the indicator function is nonsmooth, instead of the classical gradient descent algorithm we employ the proximal gradient method [28]. According to this method, the objective function is split into two terms, one of which is differentiable. Here we assume that the potential function is smooth and therefore we can compute its partial derivatives. In this case, the splitting that we choose for the objective function has the form , where is defined as
(5) 
Note that in the above definition we have gone one step further and we have expressed the multivariable potential function as the sum of singlevariable functions,
(6) 
As it will become clear later, this choice will allows us to reduce significantly the computational cost for training our network and will make the learning process feasible. It is also worth noting that this decoupled formulation of the potential function is met frequently in image regularization, as in wavelet regularization [13], anisotropic TV [12] and FieldofExperts (FoE) [30].
After the splitting of the objective function, the proximal gradient method recovers the solution in an iterative fashion, using the updates
(7) 
where is a step size and is the proximal operator [28] related to the indicator function . The proximal map in this case corresponds to the orthogonal projection of the input onto , and hereafter will be denoted as .
Given that the gradient of is computed as
(8) 
where and , each proximal gradient iteration can be finally rewritten as
(9) 
where .
In order to obtain the solution of the minimization problem in Eq. (4) using this iterative scheme, a large number of iterations is required. In addition, the exact form of the operator and the potential function must be specified. Determining appropriate values for these quantities is in general a very difficult task. This has generated increased research interest and a lot of effort has been made for designing regularization functionals that can lead to good reconstruction results.
3 Proposed NonLocal Network
In this work, we pursue a different approach than conventional regularization methods and instead of handpicking the exact forms of the potential function and the regularization operator, we design a network that has the capacity to learn these quantities directly from training data. The core idea is to unroll the proximal gradient method and use a limited number of the iterations derived in Eq. (9) to construct the graph of the network. Then, we learn the relevant parameters by training the network using pairs of corrupted and groundtruth data.
Next, we describe in detail the overall architecture of the proposed network, which is trained discriminatively for image denoising. First we motivate and derive its structure for processing grayscale images, and then we explain the necessary modifications for processing color images.
3.1 NonLocal Regularization Operator
As mentioned earlier, nonlocal regularization methods have been shown to produce superior reconstruction results than their local counterparts [14, 22] for several inverse problems, including image denoising. Their superiority in performance is mainly attributed to their ability of modeling complex image structures by allowing longrange dependencies between points in the image domain. This fact highly motivates us to explore the design of a network that will exhibit a similar behavior. To this end, our starting point is the definition of a nonlocal operator that will serve as the backbone of our network structure.
Let us consider a singlechannel image of size and let , where , be the vector that is formed by stacking together the columns of . Further, we consider image patches of size and we denote by , with , the vector whose elements correspond to the pixels of the th image patch extracted from . The vector is derived from as , where is a binary matrix that indicates which elements of belong to . For each one of the extracted image patches, its closest neighbors are selected. Let , with , be the set of indices of the most similar patches to the th patch ^{1}^{1}1The convention used here is that the set includes the reference patch, i.e. .. Next, a twodimensional transform is applied to every patch . The patch transform can be represented by a matrixvector multiplication where . Note that if then the patch representation in the transform domain is redundant. In this work, we focus on the nonredundant case where . For the transformed patch , a group is formed using the closest patches. This is denoted as
(10) 
The final step of the nonlocal operator involves collaborating filtering among the group, which can be expressed as , where is a weighting matrix and is constructed by retaining the first rows of a circulant matrix. The first row of this matrix corresponds to the vector , where . This collaborative filtering amounts to performing a weighted sum of the transformed patches in the group, i.e.
(11) 
Based on the above, the nonlocal operator acting on an image patch can be expressed as the composition of three linear operators, that is
(12) 
where and is a block diagonal matrix whose diagonal elements correspond to the patchtransform matrix . The nonlocal operator described above bears strong resemblance to the BM3D analysis operator studied in [8]. The main difference between the two is that for the proposed operator in (12) a weighted average of the transformed patches in the group takes place, as described in Eq. (11), while for the operator of [8] a 1D Haar wavelet transform is applied on the group. Our decision for this particular setup of the nonlocal operator was mainly based on computational considerations and for decreasing the memory requirements of the network that we propose next.
Due to the specific structure of the nonlocal operator (composition of linear operators) it is now easy to derive its adjoint as
(13) 
The adjoint of the nonlocal operator is an important component of our network since it provides a reverse mapping from the transformed patch domain to the original image domain, that is .
3.1.1 Convolutional Implementation of the NonLocal Operator
As we explain next, both the nonlocal operator defined in (12) and its adjoint defined in (13) can be computed using convolution operations and their transpose. Therefore, they can be efficiently implemented using modern software libraries such as OMP and cuDNN that support multithreaded CPU and parallel GPU implementations.
Concretely, the image patch extraction and the 2D patch transform, , can be combined and computed by passing the image from a convolutional layer. In order to obtain the desired output, the filterbank should consist of as many 2D filters as the number of coefficients in the transform domain. In addition, the support of these filters should match the size of the image patches. This implies that in our case filters with a support of
should be used. Also note that based on the desired overlap between consecutive image patches, an appropriate stride for the convolution layer should be chosen. Finally, the nonlocal weighted sum operation of (
11) can also be computed using convolutions. In particular, following the grouping operation of the similar transformed patches, which is completely defined by the set , the desired output can be obtained by convolving the grouped data with a single 3D filter of support . The necessary steps for computing the nonlocal operator using convolutional layers are illustrated in Fig. 2. To compute the adjoint of the nonlocal operator one simply has to follow the opposite direction of the graph shown in Fig. 2 and replace the convolution and patch grouping operations with their transpose operations.3.2 Parameterization of the Potential Function
Besides the nonlocal operator , we further need to model the potential function . We do this indirectly by representing its partial derivatives
as a linear combination of Radial Basis Functions (RBFs), that is
(14) 
where are the expansion coefficients and are the centers of the basis functions . There are a few radial functions to choose from [17], but in this work we use Gaussian RBFs, . For our network we employ Gaussian kernels whose centers are distributed equidistantly and they all share the same precision parameter . The representation of using mixtures of RBFs is very powerful and allow us to approximate with high accuracy arbitrary nonlinear functions. This is an important advantage over conventional regularization methods that mostly rely on a limited set of potential functions such as the ones reported in Section 2.1. Also note that this parameterization of the potential gradient would have been computationally very expensive if we had not adopted the decoupled formulation of Eq. (6) for the potential function.
Having all the pieces of the puzzle in order, the architecture of a single “iteration” of our network, which we will refer to it as stage, is depicted in Fig. 3. We note that our network follows very closely the proximal gradient iteration in Eq. (9). The only difference is that the parameter has been absorbed by the potential gradient , whose representation is learned. We further observe that every stage of the network consists of both convolutional and deconvolutional layers and in between there is a layer of trainable nonlinear functions.
3.3 Color Image Denoising
The architecture of the proposed network as shown in Fig. 3 can only handle grayscale images. To deal with RGB color images, a simple approach would be to use the same network to process each image channel independently. However, this would result to a suboptimal restoration performance since the network would not be able to explore the existing correlations between the different channels.
To circumvent this limitation, we follow a similar strategy as in [7] and before we feed the noisy color image to the network, we apply the same opponent color transformation which results to one luminance and two chrominance channels. Due to the nature of the color transform, the luminance channel contains most of the valuable information about primitive image structures and has a higher signaltonoiseratio (SNR) than the two chroma channels. We take advantage of this fact and since the blockmatching operation can be sensitive to the presence of noise, we perform the grouping of the patches only from the luminance channel. Then, we use exactly the same set of group indices for the other two image channels. Another important modification that we make to the original network is that for every image channel we learn a different RBF mixture. The reason for this is that due to the color transformation the three resulting channels have different SNRs that need to be correctly accounted for. Finally, it is important to note that all the image channels share the same filters of the convolutional and weightedsum layers and their transposes. The reasoning here is that this way the network can better exploit the channel correlations. A byproduct of the specific network design is that the search for similar patches needs to be performed only once compared to the naive implementation that would demand it to be computed independently for each channel. In addition, since this operation is computed only once from the noisy input and then it is reused in all the network stages, the processing of the color channels can take place in a completely decoupled way and therefore the network admits a very efficient parallel implementation.
[width=.165]im_gray_gt  [width=.165]im_gray_noisy_std25  [width=.165]im_gray_NLNet_std25  [width=.165]im_gray_mlp_std25  [width=.165]im_gray_tnrd_std25  [width=.165]im_gray_wnnm_std25 
(a)  (b)  (c)  (d)  (e)  (f) 
[width=.25]im_color_gt  [width=.25]im_color_noisy_std50  [width=.25]im_color_NLNet_std50  [width=.25]im_color_bm3d_std50 
(a)  (b)  (c)  (d) 
Noise  Methods  

(std.)  BM3D [7]  LSSC [25]  EPLL [41]  WNNM [15]  [34]  [6]  [37]  MLP[5]  
15  31.08  31.27  31.19  31.37  31.24  31.42  31.43  –  31.49  31.52 
25  28.56  28.70  28.68  28.83  28.72  28.92  28.89  28.96  28.98  29.03 
50  25.62  25.72  25.67  25.83  –  25.96  –  26.02  25.99  26.07 
4 Discriminative Network Training
We train our network, which consists of stages, for grayscale and color image denoising, where the images are corrupted by i.i.d Gaussian noise. The network parameters , where denotes the set of parameters for the th stage, are learned using a lossminimization strategy given pairs of training data , where is a noisy input and is the corresponding groundtruth image. To achieve an increased capacity for the network, we learn different parameters for each stage. Therefore, the overall architecture of the network does not exactly map to the proximal gradient method but rather to an adaptive version. Nevertheless, in each stage the convolution and deconvolution layers share the same filter parameters and, thus, they correspond to proper proximal gradient iterations.
Since the objective function that we need to minimize is nonconvex, in order to avoid getting stuck in a bad localminima but also to speedup the training, initially we learn the network parameters by following a greedytraining strategy. The same approach has been followed in [34, 6]. In this case, we minimize the cost
(15) 
where is the output of the
th stage and the loss function
corresponds to the negative peak signaltonoiseratio (PSNR). This is computed as(16) 
where is the total number of pixels of the input images and is the maximum intensity level (i.e. for grayscale images and for color images).
Noise  Methods  
(std.)  [6]  MLP [5]  CBM3D [7]  
15  31.37  –  33.50  33.69 
25  28.88  28.92  30.69  30.96 
50  25.94  26.00  27.37  27.64 
To minimize the objective function in Eq. (15) w.r.t the parameters we employ the LBFGS algorithm [27] (we use the available implementation of [33]). The LBFGS is a QuasiNewton method and therefore it requires the gradient of w.r.t
. This can be computed using the chainrule as
(17) 
where and is the Jacobian of the output of the th stage, which can be computed using Eq. (9). We omit the details about the computation of the derivatives w.r.t specific network parameters and we provide their derivations in the appendix. Here, it suffices to say that the gradient of the loss function can be efficiently computed using the backpropagation algorithm [32], which is a clever implementation of the chainrule.
For the greedytraining we run 100 LBFGS iterations to learn the parameters of each stage independently. Then we use the learned parameters as initialization of the network and we train all the stages jointly. The joint training corresponds to minimizing the cost function
(18) 
w.r.t to all the parameters of the network . This cost function does not take into account anymore the intermediate results but only depends on the final output of the network . In this case we run 400 LBFGS iterations to refine the result that we have obtained from the greedytraining. Similarly to the previous case, we still employ the backpropagation algorithm to compute the required gradients.
5 Experiments
To train our grayscale and color nonlocal models we generated the training data using the Berkeley segmentation dataset (BSDS) [26] which consists of 500 images. We split these images in two sets, a training set which consists of 400 images and the validation/test set which consists of the remaining 100 images. All the images were randomly cropped and their resulting size was pixel. We note that the 68 BSDS images of [30] that are used for the comparisons reported in Tables 1 and 2 are strictly excluded from the training set. The proposed models were trained on a NVIDIA Tesla K40 GPU and the software we used for training and testing was built on top of MatConvnet [36].
Grayscale denosing
Following the strategy described in Section 4, we have trained 5 stages of two different variations of our model, which we will refer to as and . The main difference between them is the configuration of the nonlocal operator. For the first network we considered patches of size while for the second one we have considered slightly larger patches of size
. In both cases, the patch stride is one, that is every pixel in the image is considered as the center of a patch. Consequently, the input images at each network stage are padded accordingly, using symmetric boundaries. In addition, a nonredundant patchtransform, which was learned by training, is applied to every imagepatch
^{2}^{2}2Similarly to variational methods, we do not penalize the DC component of the patchtransform. Therefore, the number of the transformdomain coefficients for a patch of size is equal to . and the group is formed using the closest neighbors. The similar patches are searched on the noisy input of the network in a window of centered around each pixel. The same group indices are then used for all the stages of the network.In Table 1 we report comparisons of our proposed and models with several recent stateoftheart denoising methods on the standard evaluation dataset of 68 images [30]. From these results we observe that both our nonlocal models lead to the best overall performance, with the only exception being the case of where the MLP denoising method [5] achieves a slightly better average PSNR compared to that of . It worths noting that while has a lower capacity (it uses approximately half of the parameters) than both and , it still produces better restoration results in all tested cases. This is attributed to the nonlocal information that exploits, as opposed to and which are local models. Representative grayscale denoising results that demonstrate visually the restoration quality of the proposed models are shown in Fig. 4.
Color denoising
Given that in the grayscale case the use of patches did not bring any substantial improvements compared to the use of patches, for the color case we have trained a single configuration of our model, considering only color image patches of size . Besides the standard differences, as they are described in Section 3.3, between the color and the grayscale versions of the model, the rest of the parameters about the size of the patchgroup and the search window remain the same.
An important remark to make here is that most of the denoising methods that were considered previously have been explicitly designed to treat singlechannel images, with the most notable exception being the BM3D, for which it indeed exists a colorversion (CBM3D) [7]. In practice, this means that if we need to restore colorimages then each of these methods should be applied independently on every image channel. In this case however, their denoising performance does not anymore correspond to stateoftheart. The reason is that due to their singlechannel design they fail to capture the existing correlations between the image channels, and this limitation has a direct impact in the final restoration quality. This fact is also verified by the color denoising comparisons reported in Table 2. From these results we observe that the TNRD and MLP models, which outperform BM3D for singlechannel images, fall behind in restoration performance by more than 1.3 dBs. In fact, for low noise levels CBM3D, which currently produces stateoftheart results, leads to PSNR gains that exceed 2 dBs. Comparing the proposed nonlocal model with CBM3D, we observe that manages to provide better restoration results for all the reported noise levels, with the PSNR gain ranging approximately between 0.20.3 dBs. We are not aware of any other colordenoising method that manages to compete with CBM3D on such large set of images. For a visual inspection of the color restoration performance of we refer to Figs. 1 and 9.
6 Conclusions and Future Work
In this work we have proposed a novel network architecture for grayscale and color image denoising. The design of the resulting models has been inspired by nonlocal variational methods and it exploits the nonlocal selfsimilarity property of natural images. We believe that nonlocal modeling coupled with discriminative learning are the key factors of the improved restoration performance that our models achieve compared to several recent stateoftheart methods. Meanwhile, the proposed models have direct links to convolutional neural networks and therefore can take full advantage of all the latest advances on parallel GPU computing in deep learning.
We are confident that image restoration is just one of the many inverse imaging problems that our nonlocal networks can successfully handle. We believe that a very interesting research direction is to investigate the necessary modifications on the design of our current nonlocal models that would allow them to be efficiently applied to other important reconstruction problems. Another very relevant research question is if it is possible to train a single model that can handle all noise levels.
Appendix A Derivative Calculations
In this section we provide the necessary derivations for the gradients of the loss function of the network w.r.t the parameters . We note that for all the derivative calculations we use the denominator layout notation^{3}^{3}3For the details of this notation we refer to https://en.wikipedia.org/wiki/Matrix_calculus#Denominatorlayout_notation.. Further, we recall that in order to learn the parameters of the network, which consists of stages, we use two different strategies, namely greedy and joint training. During greedy training we learn the parameters of each stage of the network independently from the parameters of the other stages by minimizing the loss function of Eq. (15). On the other hand, in joint training the complete set of the network parameters is learned simultaneously by minimizing the loss function given in Eq. (18).
a.1 SingleStage Parameter Learning
First we will consider the greedy training scheme. The results computed here will also be useful in the joint estimation scheme. Since the gradient of the overall loss in Eq. (15) is decomposed as:
(19) 
hereafter we will consider the case of a single training example . In order to retain the notation simplicity, in the following computations we will also drop the superscript from all the variables and use it only when it is necessary.
As we mentioned earlier, to compute the gradients w.r.t the network parameters we rely on the chain rule and we get
(20) 
where
(21) 
is a vector of size . Now we focus on the computation of the Jacobian of the output of the stage, , w.r.t the stage parameters. Before doing so, we recall that the output, , of a stage given an input , is computed according to the mapping
(22) 
Note that the Eq. (22) is just a modified version of Eq. (9), where the variable is absorbed by the function .
The Jacobian of w.r.t the parameters of the stage, , can now be expressed as
(23) 
where
(24) 
Regarding the projection operator , this is applied elementwise to the vector and it is defined as:
(25) 
The derivative of w.r.t is computed as:
(26) 
and therefore the Jacobian corresponds to a binary diagonal matrix of size , whose diagonal elements are nonzero only if the corresponding values of are in the range . Now, let us denote as the vector obtained by the matrix vector product of the Jacobian with the gradient , that is
(27) 
Weight parameter :
Expansion coefficients :
To compute the gradient of the loss function w.r.t to the expansion coefficients of the mixture of Gaussian RBFs, we first express the output of the RBF mixture as a vector inner product. Specifically, it holds that
(30) 
where . We note that in the definition of we have dropped the subscript from the Gaussian RBF , since we use a common precision parameter for all the mixture components, i.e. . Based on this notation we can further express as
(31) 
where , and
(32) 
Now, using Eqs. (24), (30) and (31) we have
(33) 
which directly leads us to compute the Jacobian as
(34) 
(35) 
Weighted sum coefficients :
To simplify the computation of the gradient of the loss function w.r.t , first we obtain an equivalent expression for the nonlocal operator defined in Eq. (12). Indeed, the nonlocal operator can be rewritten as
(36) 
Plugging the new expression of into Eq. (24) we get
(37) 
where . Now, it is straightforward to compute the partial derivative of w.r.t each . Based on Eq. (37), we obtain
(38) 
where . Note that due to the decoupled formulation of , the Jacobian is a diagonal matrix of the form:
(39) 
where
(40) 
Combining Eqs (27) and (38) we obtain:
(41) 
Patchtransform coefficients :
Let us express the matrix in terms of its column vectors, i.e. with . Now, let us also rewrite as
(42) 
Next, we use Eq. (42) to rewrite Eq. (24) as
(43) 
This last reformulation of greatly facilitates the computation of its Jacobian w.r.t . Now, we can show that
(44) 
where
is the identity matrix. Consequently, it holds
(45) 
a.2 Joint Parameter Learning
In the jointtraining scheme the parameters of all the stages of the network are learned simultaneously by minimizing the loss function of Eq. (18) which depends only on the final output of the network . In this case we need to compute the gradient of the loss function w.r.t the parameters of each stage . Using the chainrule this can be computed as
(46) 
where is calculated by combining Eq. (23) and the results of Section A.1, while is given by Eq. (21). Therefore, the only remaining Jacobian that we need to compute is . This quantity can be computed recursively as
(47) 
Consequently, it suffices to derive the expression for the Jacobian where is obtained from according to
(48) 
Using Eq. (48), we finally get
(49) 
where is the identity matrix, and
Appendix B Grayscale and Color Image Denoising Comparisons
In this section we provide additional grayscale and color image denoising results for different noise levels. For grayscale image denoising we compare the performance of our nonlocal models with TNRD [6], MLP [5], EPLL [41] and BM3D [7], while for color image denoising we compare our nonlocal CNN with the stateoftheart CBM3D method [7]. Besides the visual comparisons, in the captions of the figures we provide the PSNR score (in dB) of each method to also allow a quantitative comparison.
(a)  (b) 
(c)  (d) 
(e)  (f) 
(a)  (b) 
(c)  (d) 
(e)  (f) 
(a)  (b)  (c)  (d) 
(a)  (b) 
(c)  (d) 
Appendix C Acknowledgments
The author would like to thank NVIDIA for supporting this work by donating a Tesla K40 GPU.
References
 [1] C. Bailer, B. Taetz, and D. Stricker. Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation. In Proc. IEEE Int. Conf. on Computer Vision, pages 4015–4023, 2015.
 [2] M. Bertero and P. Boccacci. Introduction to Inverse Problems in Imaging. IOP Publishing, 1998.
 [3] K. Bredies, K. Kunisch, and T. Pock. Total generalized variation. SIAM J. Imaging Sci., 3:492–526, 2010.
 [4] A. Buades, B. Coll, and J.M. Morel. Image denoising methods. A new nonlocal principle. SIAM review, 52:113–147, 2010.

[5]
H. C. Burger, C. J. Schuler, and S. Harmeling.
Image denoising: Can plain neural networks compete with bm3d?
In
Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition
, pages 2392–2399, 2012.  [6] Y. Chen and T. Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell, 2016. to appear.
 [7] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3d transformdomain collaborative filtering. IEEE Trans. Image Process., 16(8):2080–2095, 2007.
 [8] A. Danielyan, V. Katkovnik, and K. Egiazarian. Bm3d frames and variational image deblurring. IEEE Trans. Image Process., 21(4):1715–1728, 2012.
 [9] M. Elad, P. Milanfar, and R. Rubinstein. Analysis versus synthesis in signal priors. Inverse problems, 23(3):947, 2007.
 [10] A. Elmoataz, O. Lezoray, and S. Bougleux. Nonlocal discrete regularization on weighted graphs: a framework for image and manifold processing. IEEE Trans. Image Proces., 17:1047–1060, 2008.
 [11] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pages 2147–2154, 2014.
 [12] S. Esedoglu and S. Osher. Decomposition of images by the anisotropic RudinOsherFatemi model. Communications on pure and applied mathematics, 57(12):1609–1626, 2004.
 [13] M. Figueiredo, J. BioucasDias, and R. Nowak. Majorization–minimization algorithms for waveletbased image restoration. IEEE Trans. Image Process., 16:2980–2991, 2007.
 [14] G. Gilboa and S. Osher. Nonlocal operators with applications to image processing. Multiscale Model. Simul., 7:1005–1028, 2008.
 [15] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pages 2862–2869, 2014.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2016.
 [17] Y. H. Hu and J.N. Hwang. Handbook of neural network signal processing. CRC press, 2001.
 [18] J. Kim, K. Lee, and K. M. Lee. Accurate image superresolution using very deep convolutional networks. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
 [19] S. Kindermann, S. Osher, and P. W. Jones. Deblurring and denoising of images by nonlocal functionals. Multiscale Model. Simul., 4:1091–1115, 2005.
 [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [21] S. Lefkimmiatis, A. Bourquard, and M. Unser. Hessianbased norm regularization for image restoration with biomedical applications. IEEE Trans. Image Process., 21(3):983–995, 2012.
 [22] S. Lefkimmiatis and S. Osher. Nonlocal Structure Tensor functionals for image regularization. IEEE Trans. Comput. Imaging, 1:16–29, 2015.
 [23] S. Lefkimmiatis, A. Roussos, P. Maragos, and M. Unser. Structure tensor total variation. SIAM J. Imaging Sci., 8:1090–1122, 2015.
 [24] S. Lefkimmiatis, J. Ward, and M. Unser. Hessian Schattennorm regularization for linear inverse problems. IEEE Trans. Image Process., 22(5):1873–1888, 2013.
 [25] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Nonlocal sparse models for image restoration. In Proc. IEEE Int. Conf. Computer Vision, pages 2272–2279, 2009.
 [26] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. IEEE Int. Conf. Computer Vision, pages 416–423, 2001.
 [27] J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
 [28] N. Parikh and S. Boyd. Proximal Algorithms. Now Publishers, 2013.
 [29] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [30] S. Roth and M. J. Black. Fields of experts. International Journal of Computer Vision, 82(2):205–229, 2009.
 [31] L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.
 [32] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. Nature, 323(6088):533–536, 1986.
 [33] M. Schmidt. minFunc: unconstrained differentiable multivariate optimization in Matlab. http://www.cs.ubc.ca/~schmidtm/Software/, 2005.
 [34] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pages 2774–2781, 2014.
 [35] I. Selesnick and M. Figueiredo. Signal restoration with overcomplete wavelet transforms: Comparison of analysis and synthesis priors. In SPIE (Wavelets XIII), 2009.
 [36] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia, 2015.
 [37] R. Vemulapalli, O. Tuzel, and M.Y. Liu. Deep Gaussian conditional random field network: A modelbased deep network for discriminative denoising. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pages 4801–4809, 2016.
 [38] C. R. Vogel. Computational Methods for Inverse Problems. SIAM, 2002.
 [39] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In Advances in Neural Information Processing Systems, pages 341–349, 2012.
 [40] D. Zhou and B. Schölkopf. Regularization on discrete spaces. In Pattern Recognition, pages 361–368. Springer, 2005.
 [41] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In Proc. IEEE Int. Conf. Computer Vision, pages 479–486. IEEE, 2011.
Comments
There are no comments yet.