I Introduction
The sheer diversity of content, that can be present in photographs of natural scenes, makes them a challenge for algorithms that must model their statistics for various image restoration tasks, including the classical task of image denoising: recovering an estimate of a clean image from a noisy observation. A common approach is to rely on image models for local image regions—either explicitly as parametric priors or implicitly as estimators trained via regression—with parameters learned on databases of natural images [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].
An important class of methods adopt a different modeling approach, to exploit selfsimilarity in images by relying on their “internal statistics” [11, 12, 13]. A particularly successful example from this class is the BM3D [12, 13] algorithm, which identifies sets of similar patches in noisy images using sum of squared distances (SSD) as the matching metric, and then uses the statistics of each set to denoise patches in that set. Applying this process twice, BM3D produced highquality estimates that, until recently, represented the stateoftheart in image denoising performance.
However, recent methods have been able to exceed this performance by using neural networks trained to regress to clean image patches from noisy ones [5, 6, 7]. With carefully chosen architectures, these methods are able to use the powerful expressive capacity of neural networks to better learn and encode image statistics from external databases, and thus exceed the capability of selfsimilarity based methods. In this work, we describe a denoising method that brings the expressive capacity of neural networks to the task of identifying and leveraging recurring patterns in a natural image.
We train a neural network that takes in pairs of noisy image patches and provides a set of matching scores corresponding to its estimate of the similarity between their noisefree versions. Each patch is denoised by computing an average of other similar patches in a local neighborhood, weighted by these scores. Instead of simply using a single matching score for each patchpair, we consider wavelet coefficients of each patch in a decorrelated color space. Our network produces distinct matching scores for different sets of coefficient pairs, expressing the fact that two patches might share patterns at some orientations and scales, but not others. Accordingly, the denoised patch is constructed by separately averaging different coefficients based on their respective scores. This process is applied to all overlapping patches to form an initial estimate of the denoised image.
The matching network is trained with respect to denoising quality, i.e., to ensure that patches formed by weighted averaging based on the network’s outputs are close to the true noisefree patch. We describe a twostep approach to training—a pretraining step where we optimize denoising performance from averaging only pairs of patches, followed by endtoend training involving averaging the full set of candidate patches for each reference patch. We find that the pretraining step allows the network to converge to a better solution.
Experimental results show that the initial matchaveraged estimates produced by our method are significantly more accurate than those produced by BM3D, as well as other internal denoising approaches. Moreover, we combine the power of internal and external modeling by providing these initial estimates, along with the original noisy image, as input to a standard regressionbased denoising network. This leads to further improvement in quality, and we find that our overall method outperforms externalonly denoising approaches to achieve stateoftheart denoising quality.
Ii Related Work
Denoising is a classical problem in image restoration. In addition to its practical utility in improving the quality of photographs taken in lowlight or by cheaper sensors, image denoisers can be used as generic “plugandplay” priors within iterative approaches to solve a larger variety of generic image restoration tasks (e.g., [14, 15, 7, 16]).
Many classical approaches to image denoising are based on exploiting statistics of general natural images, using estimators or explicit statistical priors [1, 2, 3, 4], whose parameters are learned from datasets of clean images. A different category of approaches use patchrecurrence of selfsimilarity [11, 12], to address the fact that there is significant diversity in content across images, while the variation within a specific image is far more limited. There is a natural tradeoff between these two approaches: methods based on external statistics have the opportunity to learn them from clean images, but these statistics may be too general for a specific image; while those based on selfsimilarity work with a more relevant model for each image, but must find a way to derive its parameters from the noisy observation itself. We refer the reader to the work of Mosseri et al. [17] for an insightful discussion.
Until recently, the most successful denoising algorithm was one based on selfsimilarity: BM3D [12]. It works by organizing similar patches into groups (using SSD as the matching metric, and doing two rounds of matching), and denoising patches based on the statistics of its group through collaborative filtering. However, many recent denoising methods [5, 6, 7, 9, 8, 10] have been able to surpass BM3D’s performance using estimators trained on external datasets, leveraging the powerful implicit modeling capacity of deep neural networks.
In this work, we propose a new method that uses neural networks to identify and exploit selfsimilarity in noisy images. Recent work by Lefkimmiatis [18] and Yang and Sun [19] share the same goal. They propose interesting approaches that are based on designing network architectures that “unroll” and carry out the computations in BM3D and nonlocal means denoising, and then train the parameters of these steps discriminatively through backpropagation, resulting in performance gains over the baseline methods. In contrast, we employ a substantially different approach: denoising in our framework is achieved by weighted averaging of different subband coefficients, with our neural network is tasked with finding these matching weights. As our experiments show, this approach leads to better denoising performance.
The primary component of our method is a network that must learn to match patches through noise. Several neural networkbased methods have been proposed to solve matching problems [20, 21, 22], with the goal of finding correspondences across images for tasks like stereo. Our method is motivated by their success, and we borrow several design principles of their architectures. However, our matching network has a completely different task: denoising. Our network is thus trained with a loss optimized for denoising (as opposed to classification or triplet losses), and instead of predicting a single matching score for a pair of patches, produces a richer description of their commonality with distinct scores for different subbands.
Iii Proposed Denoising Algorithm
Our goal is to produce an estimate of an image given observation that is degraded by i.i.d. Gaussian noise, i.e.,
(1) 
Our algorithm leverages the notion that many patterns will occur repeatedly in different regions in the underlying clean image , while the noise in those regions in will be uncorrelated and can be attenuated by averaging. In this section, we describe our approach to training and using a deep neural network to identify these recurring patterns from the noisy image, and forming an initial estimate of by averaging matched patterns. The full version of our method then uses a second standard network to regress to the final denoised output from a combination of these initial estimates and the original noisy observation.
Iiia Denoising by Averaging Recurring Patterns
Our initial estimate is formed by denoising individual patches in the image, by computing a weighted average over neighboring noisy patches with weights provided by a matching network. Formally, given the noisy observation of an image , we consider sets of overlapping patches (corresponding to clean versions ), where each is a linear operator that crops out intensities of a different square patch (of size in our implementation) from the image. We then use a decorrelating color space followed by a Harr wavelet transform to obtain coefficients , where
is a unitary matrix representing the color and wavelet transforms. Note that since we assume the noise in
is i.i.d. Gaussian and is unitary, is also a noisy observation of with i.i.d. noise of the same variance.We group these coefficients into sets where each set includes all coefficients with the same orientation (horizontal, vertical, or diagonal derivative), scale or pyramid level, and color channel^{1}^{1}1For patches, this gives us 30 groups: 27 corresponding to 3 color channels, 3 scales, and 3 derivative orientations; and an additional 3 coefficients for the scaling / DC coefficients of the 3 color channels.. Then, for every patch we consider a set of candidate matches composed of other noisy patches in the image , typically restricted to some local neighborhood around . As illustrated in Fig. 1, our method produce an estimate of the denoised coefficients as a weighted average of the corresponding coefficients of the candidate patches:
(2) 
where are scalar matching weights that are a prediction of the similarity between the set of coefficients in patches and respectively.
This gives us a denoised estimate for each patch in the image as . We then obtain an estimate of the full clean image simply by averaging the denoised patches , i.e., the denoised estimate of each pixel is computed as the average of its estimate from all patches that contain it.
IiiB Predicting Matches from Noisy Observations
The success of our matchaveraging strategy in (2) depends on obtaining optimal values for the matching scores . Intuitively, we want to be high when the clean coefficients and are close, so that the averaging in (2) will attenuate noise and yield close to . Conversely, we want to be low where the two sets of underlying clean coefficients are not similar, because averaging them would yield poor results, potentially worse than the noisy observation itself. However, note that while the optimal values of these matching scores depend on the characteristics of the clean coefficients , we only have access to their noisy counterparts .
Therefore, we train a neural network to predict the matching scores given a pair of larger noisy patches ( in our implementation) and centered around and respectively: , where is a vector of matching scores for all sets of coefficients. We don’t require the output of the network to be symmetric ( need not be the same as ), and we use the same network model for evaluating all patch pairs, being agnostic to their absolute or relative locations.
The matching network has a Siameselike architecture as illustrated in Fig. 2. It begins with a common feature extraction subnetwork applied to both input patches to produce a featurevector for each. This subnetwork has a receptive field of , and includes a total of fourteen convolutional layers with skip connections [23] including at the final output (see Fig. 2
). The computed featurevectors for each of the two inputs are then concatenated and passed through a comparison subnetwork, which comprises of a set of five fullyconnected layers. All layers have ReLU activations, except for the last which uses a sigmoid to produce the matchscores
. These scores are thus constrained to lie in . Note that during inference, the feature extraction subnetwork needs to be applied only once to compute featurevectors for all patches in a fullyconvolutional way. Only the final five fully connected layers need to be repeatedly applied for different patch pairs.IiiC Training
We train the matching network to produce matching scores that are optimal with respect to the quality of the denoised patches . Specifically, we use an loss between the true and estimated clean patches:
(3) 
where the denoised coefficients are computed using (2) based on matchingscores predicted by the network. Note that the loss for a single patch will depend on matching scores produced by the network for paired with all candidate patches in its neighborhood .
While it is desirable to train the network in this endtoend manner with our actual denoising approach, we empirically find that training the network with this loss from a random initialization often converges to a suboptimal local minima. We hypothesize that this is because we compute gradients corresponding to a large number of matching scores (all candidates in ) with respect to supervision from a single denoised patch, which also limits the number of reference patches we are able to fit in to a single batch at each iteration.
Therefore, we adopt a pretraining strategy using a loss defined on pairs of patches at a time. Specifically, we consider a simplified loss for denoising patch by averaging it with patch as:
(4) 
This is equivalent to the actual loss in (3) with performing the averaging in (2) with only one candidate patch , by dropping the cross term between and , i.e., by assuming the noise is uncorrelated with the difference between the two patches. It is interesting to note here if we assume that the deviations between reference and candidate patches are uncorrelated, for different candidates , then the optimal averaging weight for a given candidate is the same whether averaging with one or multiple candidates. This is why the above modified loss serves as a good initial proxy for pretraining. However, since the uncorrelated deviation assumption will likely not hold in practice, we follow this with training with the actual loss in (3).
In particular, we pretrain the network for a number of iterations using the modified loss in (4)—constructing the training set by considering all nonoverlapping patches in an image, with random shuffling to select candidate for each patch , and train with respect to the loss of both matching to and viceversa. This allows us to compute updates with respect to a much more diverse set of patches into a training batch, and to make maximal use of the feature extraction computation during training. The pretraining step is followed by training the network with the true loss in (3) till convergence—here, we extract a smaller number of training reference patches from each image, along with all their neighboring candidates.
IiiD Final Estimates via Regression
The initial estimates produced by our method as described above are already significantly more accurate than those produced by BM3D. Nevertheless, (2) is restricted to expressing every clean patch as a weighted average of observed noisy patches, which limits its denoising ability in certain regions and patterns. To overcome this and achieve further improvements in quality, we use a second network trained via traditional regression to derive our final denoised estimate. Specifically, we adopt the architecture of IRCNN [7] with has seven dilated convolutional layers. In our case, this network takes a sixchannel input formed by concatenating the original noisy input and our initial denoised estimate from matchbased averaging. The output of the last layer is interpreted as a residual, and added to the initial estimates to yield the final denoised image output.
After the matching network has been trained, we generate sets of clean, noisy, and initial denoised estimates. This serves as the training set for this second network which is trained using an regression loss. We find that this step leads to further improvement over our initial estimates, while also outperforming stateoftheart denoising networks (including IRCNN [7]) that are based only on direct regression.
Iv Experiments
Preliminaries.
We train our algorithm on a set of 1600 color images from the Waterloo exploration dataset [24], and 168 images from the BSD300 [25] train set, using the remaining 32 images for validation and parameter setting. We train different models for different noise levels, generating noisy observations by adding Gaussian noise to the clean images. Unless otherwise specified, we construct the candidate set of patches by considering all the overlapping patches in a search window around (i.e., the topleft corners of all patches are within a window around the topleft corner of ).
We use Adam [26] for training both the matching and regression networks for our method, beginning with learning rates of and respectively. We pretrain the matching network using for a 100k iterations based on the modified loss (4), with batches of 16 images cropped at size . Pairing all nonoverlapping patches with a randomly shuffled counterparts, this generates more than 46k ordered matching pairs for training in each batch. We then continue training the matching network with the true loss in (3), in this case forming a batch with 256 unique reference patches from various images, and computing matching scores for each with respect to all candidates. We train for a total of 600k iterations, dropping the learning rate by once at 400k, and again at 500k iterations. Once the matching network is trained, we store a set of noisy and denoised version of our training set, and use these to train the refinement network. We again use the same training schedule, a total of 600k iterations with learning rate drops at 400k and 500k. Code and trained models for our method are available at https://projects.ayanc.org/rpcnn/.
Results.
We train and evaluate our method at five different noise levels, corresponding to standard deviations of
25, 35, 50, and 75 gray levels. We begin by evaluating the quality of our initial estimates generated using “internal” statistics alone, i.e., from weighted averaging based on matching network outputs. In Table I, we report their PSNR values for different noise levels on the standard CBSD68 test set [2] of color images. We compare these to results from CBM3D [13], and find that our estimates are much more accurate despite the fact we only perform one round of matching, and denoise based simply on averaging instead of collaborative filtering. Our initial estimates are also better than the results from BM3DNet [19] and CNLNet [18]—two neural networkbased “internal” denoising methods that are designed by unrolling the computational steps of BM3D and nonlocal means denoising, and training their parameters discriminatively.Method  =25  =35  =50  =75 

CBM3D [13]  30.69  28.86  27.36  25.73 
CBM3DNet [19]  30.91    27.48   
CNLNet [18]  30.96    27.64   
Ours (Match Only)  31.00  29.40  27.83  26.15 
Then in Table II, we evaluate the performance of our overall method, i.e., based on the regression network applied to our initial estimates. Here, we compare to three stateoftheart color denoising methods IRCNN [7], DnCNN [8], and FFDNet [9] on CBSD68 as well as the McMaster [27] and Kodak24 [28] datasets. We see that our results are consistently better across all datasets, except for at the lowest noiselevel on the McMaster dataset where FFDNet has essentially equivalent performance (with a slightly higher PSNR of 0.02 dB). Interestingly, our results are better than those of IRCNN—by as large a margin as 0.2 to 0.44 dB at the noise level—despite the fact that our regression network uses an identical architecture. This improvement is therefore due entirely to the fact that, in our setting, the regression network has access to the initial denoised estimates based on our approach to exploiting internal image statistics.
We include examples of denoised images in Fig. 3 for a qualitative evaluation. We see that the initial estimates from our method are often already of highquality, and that the second regression step improves these results by removing certain localized distortions and subtle artifacts. We also find that our overall method is often better at reconstructing textures and image detail than stateoftheart denoising methods.
Visualizing Matching Scores.
We next take a closer look at the behavior of the matching network in Fig. 4. For a number of reference patches, and corresponding search windows, cropped from different training images, we visualize the matching scores predicted by our network. We show the average matching score across all subbands, as well as average weights corresponding to combinations of sets at the same wavelet scale (averaging over color channels and orientation), and at the same orientation (averaging over scale and color). Our results show that for any pair of patches, the network produces very different averaging weights for different subbands.
We find that that the weights tend to be generally higher at the finest scale (indicating more averaging), and lowest for the scaling coefficients. This is likely because the highestfrequencies are close to zero in most patches, and thus to each other. For lowerfrequencies and DC values, the network selects only those patches that are close to the reference patch (in the clean image). For different orientations, the high matches are sometimes concentrated at different locations for the same reference, especially when there are strong edges and repeating textures. Thus, free from the restriction of matching patches as a whole with a single score, our algorithm finds different distributions of matches for different subbands in order to achieve optimal denoising.
Datasets  Method  =25  =35  =50  =75 

CBSD68  IRCNN [7]  31.16  29.50  27.86   
CDnCNN [8]  31.23  29.58  27.92  24.47  
FFDNet [9]  31.21  29.58  27.96  26.24  
Ours (Match+Regr.) 
31.24  29.64  28.06  26.39  
McMaster  IRCNN [7]  32.18  30.59  28.91   
CDnCNN [8]  31.51  30.14  28.61  25.10  
FFDNet [9]  32.35  30.81  29.18  27.33  
Ours (Match+Regr.) 
32.33  30.90  29.35  27.59  
Kodak24  IRCNN [7]  32.03  30.43  28.81   
CDnCNN [8]  32.03  30.46  28.85  25.04  
FFDNet [9]  32.13  30.57  28.98  27.27  
Ours (Match+Regr.) 
32.34  30.81  29.25  27.56 
Window Size  15  23  31  31 (NoPretraining) 

PSNR (dB)  31.31  31.40  31.46  31.35 
Run Time  1.07s  2.47s  4.42s  4.42s 
: On validation set from BSD images.
: For a input image on a 1080Ti GPU.
Effect of Window Size and Pretraining.
In Table III, we characterize the tradeoff between quality and computational cost when choosing different search window sizes over which to match and average patches. For different window sizes, we report average PSNR (over our validation set) for our initial estimatesat . We also report the corresponding running time required to compute matching scores and perform the averaging for different window sizes—for a input image on an NVIDIA 1080Ti GPU. Note that computing the initial estimates takes a majority of the time in our denoising method—the following regression step takes only an additional 0.01 seconds, and is independent of window size.
As expected, running time goes up roughly linearly with the number of candidate matches (i.e., as square of the search window size), but we find that the drop in PSNR is a modest dB when going down to a window. Table III also demonstrates the importance of pretraining, and reports performance (again, of our initial estimates) achieved by a network that is initialized with random weights instead of with pretraining. We find that this leads to a PSNR drop of about 0.1 dB, highlighting that pretraining is critical to convergence to a good network model.
V Conclusion
In this work, we presented a novel method that employed a neural network to characterize the similarity between pairs of patches from noisy observations, in terms of the similarity scores of different corresponding subband components. We showed that this network could be used to identify and exploit recurring patterns in an image for denoising, and our algorithm was able to recover highquality estimates of the clean image from noisy observations. We believe that the potential of using neural networks to discover and leverage selfsimilarity is still largely untapped. A natural direction of future work is in exploring applications to other restoration and estimation tasks for images (such as blind deblurring), and other imagelike signals such as depth maps and motionfields.
Acknowledgments
This work was supported by the National Science Foundation under award no. IIS1820693.
References
 [1] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image processing, 2006.
 [2] S. Roth and M. J. Black, “Fields of experts,” IJCV, 2009.
 [3] U. Schmidt and S. Roth, “Shrinkage fields for effective image restoration,” in Proc. CVPR, 2014.
 [4] D. Zoran and Y. Weiss, “From learning models of natural image patches to whole image restoration,” in Proc. ICCV, 2011.
 [5] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with bm3d?” in Proc. CVPR, 2012.
 [6] Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration,” PAMI, 2017.
 [7] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in Proc. CVPR, 2017.
 [8] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
 [9] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn based image denoising,” IEEE Transactions on Image Processing, 2018.

[10]
C. Chen, Z. Xiong, X. Tian, and F. Wu, “Deep boosting for image denoising,”
in
The European Conference on Computer Vision (ECCV)
, September 2018.  [11] A. Buades, B. Coll, and J.M. Morel, “A nonlocal algorithm for image denoising,” in Proc. CVPR, 2005.
 [12] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3d transformdomain collaborative filtering,” IEEE Transactions on image processing, 2007.
 [13] ——, “Color image denoising via sparse 3d collaborative filtering with grouping constraint in luminancechrominance space,” in Proc. ICIP, 2007.
 [14] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plugandplay priors for model based reconstruction,” in IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2013.
 [15] J.H. R. Chang, C.L. Li, B. Poczos, B. V. Kumar, and A. C. Sankaranarayanan, “One network to solve them allsolving linear inverse problems using deep projection models.” in Proc. ICCV, 2017.
 [16] S. A. Bigdeli, M. Zwicker, P. Favaro, and M. Jin, “Deep meanshift priors for image restoration,” in Advances in Neural Information Processing Systems, 2017.
 [17] I. Mosseri, M. Zontak, and M. Irani, “Combining the power of internal and external denoising,” in Proc. ICCP, 2013.

[18]
S. Lefkimmiatis, “Nonlocal color image denoising with convolutional neural networks,”
Proc. CVPR, 2017.  [19] D. Yang and J. Sun, “Bm3dnet: A convolutional neural network for transformdomain collaborative filtering,” IEEE Signal Processing Letters, 2018.

[20]
J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural
network to compare image patches,”
Journal of Machine Learning Research
, 2016. 
[21]
B. Kumar, G. Carneiro, and I. Reid, “Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions,” in
Proc. CVPR, 2016.  [22] S. Zagoruyko and N. Komodakis, “Learning to compare image patches via convolutional neural networks,” in Proc. CVPR, 2015.
 [23] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proc. CVPR, 2017.
 [24] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo exploration database: New challenges for image quality assessment models,” IEEE Transactions on Image Processing, 2017.
 [25] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. ICCV, 2001.
 [26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[27]
L. Zhang, X. Wu, A. Buades, and X. Li, “Color demosaicking by local directional interpolation and nonlocal adaptive thresholding,”
Journal of Electronic imaging, vol. 20, no. 2, p. 023016, 2011.  [28] R. Franzen, “Kodak lossless true color image suite,” source: http://r0k. us/graphics/kodak, vol. 4, 1999.
Comments
There are no comments yet.