The sheer diversity of content, that can be present in photographs of natural scenes, makes them a challenge for algorithms that must model their statistics for various image restoration tasks, including the classical task of image denoising: recovering an estimate of a clean image from a noisy observation. A common approach is to rely on image models for local image regions—either explicitly as parametric priors or implicitly as estimators trained via regression—with parameters learned on databases of natural images [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].
An important class of methods adopt a different modeling approach, to exploit self-similarity in images by relying on their “internal statistics” [11, 12, 13]. A particularly successful example from this class is the BM3D [12, 13] algorithm, which identifies sets of similar patches in noisy images using sum of squared distances (SSD) as the matching metric, and then uses the statistics of each set to denoise patches in that set. Applying this process twice, BM3D produced high-quality estimates that, until recently, represented the state-of-the-art in image denoising performance.
However, recent methods have been able to exceed this performance by using neural networks trained to regress to clean image patches from noisy ones [5, 6, 7]. With carefully chosen architectures, these methods are able to use the powerful expressive capacity of neural networks to better learn and encode image statistics from external databases, and thus exceed the capability of self-similarity based methods. In this work, we describe a denoising method that brings the expressive capacity of neural networks to the task of identifying and leveraging recurring patterns in a natural image.
We train a neural network that takes in pairs of noisy image patches and provides a set of matching scores corresponding to its estimate of the similarity between their noise-free versions. Each patch is denoised by computing an average of other similar patches in a local neighborhood, weighted by these scores. Instead of simply using a single matching score for each patch-pair, we consider wavelet coefficients of each patch in a de-correlated color space. Our network produces distinct matching scores for different sets of coefficient pairs, expressing the fact that two patches might share patterns at some orientations and scales, but not others. Accordingly, the denoised patch is constructed by separately averaging different coefficients based on their respective scores. This process is applied to all overlapping patches to form an initial estimate of the denoised image.
The matching network is trained with respect to denoising quality, i.e., to ensure that patches formed by weighted averaging based on the network’s outputs are close to the true noise-free patch. We describe a two-step approach to training—a pre-training step where we optimize denoising performance from averaging only pairs of patches, followed by end-to-end training involving averaging the full set of candidate patches for each reference patch. We find that the pre-training step allows the network to converge to a better solution.
Experimental results show that the initial match-averaged estimates produced by our method are significantly more accurate than those produced by BM3D, as well as other internal denoising approaches. Moreover, we combine the power of internal and external modeling by providing these initial estimates, along with the original noisy image, as input to a standard regression-based denoising network. This leads to further improvement in quality, and we find that our overall method outperforms external-only denoising approaches to achieve state-of-the-art denoising quality.
Ii Related Work
Denoising is a classical problem in image restoration. In addition to its practical utility in improving the quality of photographs taken in low-light or by cheaper sensors, image denoisers can be used as generic “plug-and-play” priors within iterative approaches to solve a larger variety of generic image restoration tasks (e.g., [14, 15, 7, 16]).
Many classical approaches to image denoising are based on exploiting statistics of general natural images, using estimators or explicit statistical priors [1, 2, 3, 4], whose parameters are learned from datasets of clean images. A different category of approaches use patch-recurrence of self-similarity [11, 12], to address the fact that there is significant diversity in content across images, while the variation within a specific image is far more limited. There is a natural trade-off between these two approaches: methods based on external statistics have the opportunity to learn them from clean images, but these statistics may be too general for a specific image; while those based on self-similarity work with a more relevant model for each image, but must find a way to derive its parameters from the noisy observation itself. We refer the reader to the work of Mosseri et al.  for an insightful discussion.
Until recently, the most successful denoising algorithm was one based on self-similarity: BM3D . It works by organizing similar patches into groups (using SSD as the matching metric, and doing two rounds of matching), and denoising patches based on the statistics of its group through collaborative filtering. However, many recent denoising methods [5, 6, 7, 9, 8, 10] have been able to surpass BM3D’s performance using estimators trained on external datasets, leveraging the powerful implicit modeling capacity of deep neural networks.
In this work, we propose a new method that uses neural networks to identify and exploit self-similarity in noisy images. Recent work by Lefkimmiatis  and Yang and Sun  share the same goal. They propose interesting approaches that are based on designing network architectures that “un-roll” and carry out the computations in BM3D and non-local means denoising, and then train the parameters of these steps discriminatively through back-propagation, resulting in performance gains over the baseline methods. In contrast, we employ a substantially different approach: denoising in our framework is achieved by weighted averaging of different sub-band coefficients, with our neural network is tasked with finding these matching weights. As our experiments show, this approach leads to better denoising performance.
The primary component of our method is a network that must learn to match patches through noise. Several neural network-based methods have been proposed to solve matching problems [20, 21, 22], with the goal of finding correspondences across images for tasks like stereo. Our method is motivated by their success, and we borrow several design principles of their architectures. However, our matching network has a completely different task: denoising. Our network is thus trained with a loss optimized for denoising (as opposed to classification or triplet losses), and instead of predicting a single matching score for a pair of patches, produces a richer description of their commonality with distinct scores for different sub-bands.
Iii Proposed Denoising Algorithm
Our goal is to produce an estimate of an image given observation that is degraded by i.i.d. Gaussian noise, i.e.,
Our algorithm leverages the notion that many patterns will occur repeatedly in different regions in the underlying clean image , while the noise in those regions in will be un-correlated and can be attenuated by averaging. In this section, we describe our approach to training and using a deep neural network to identify these recurring patterns from the noisy image, and forming an initial estimate of by averaging matched patterns. The full version of our method then uses a second standard network to regress to the final denoised output from a combination of these initial estimates and the original noisy observation.
Iii-a Denoising by Averaging Recurring Patterns
Our initial estimate is formed by denoising individual patches in the image, by computing a weighted average over neighboring noisy patches with weights provided by a matching network. Formally, given the noisy observation of an image , we consider sets of overlapping patches (corresponding to clean versions ), where each is a linear operator that crops out intensities of a different square patch (of size in our implementation) from the image. We then use a de-correlating color space followed by a Harr wavelet transform to obtain coefficients , where
is a unitary matrix representing the color and wavelet transforms. Note that since we assume the noise inis i.i.d. Gaussian and is unitary, is also a noisy observation of with i.i.d. noise of the same variance.
We group these coefficients into sets where each set includes all coefficients with the same orientation (horizontal, vertical, or diagonal derivative), scale or pyramid level, and color channel111For patches, this gives us 30 groups: 27 corresponding to 3 color channels, 3 scales, and 3 derivative orientations; and an additional 3 coefficients for the scaling / DC coefficients of the 3 color channels.. Then, for every patch we consider a set of candidate matches composed of other noisy patches in the image , typically restricted to some local neighborhood around . As illustrated in Fig. 1, our method produce an estimate of the denoised coefficients as a weighted average of the corresponding coefficients of the candidate patches:
where are scalar matching weights that are a prediction of the similarity between the set of coefficients in patches and respectively.
This gives us a denoised estimate for each patch in the image as . We then obtain an estimate of the full clean image simply by averaging the denoised patches , i.e., the denoised estimate of each pixel is computed as the average of its estimate from all patches that contain it.
Iii-B Predicting Matches from Noisy Observations
The success of our match-averaging strategy in (2) depends on obtaining optimal values for the matching scores . Intuitively, we want to be high when the clean coefficients and are close, so that the averaging in (2) will attenuate noise and yield close to . Conversely, we want to be low where the two sets of underlying clean coefficients are not similar, because averaging them would yield poor results, potentially worse than the noisy observation itself. However, note that while the optimal values of these matching scores depend on the characteristics of the clean coefficients , we only have access to their noisy counterparts .
Therefore, we train a neural network to predict the matching scores given a pair of larger noisy patches ( in our implementation) and centered around and respectively: , where is a vector of matching scores for all sets of coefficients. We don’t require the output of the network to be symmetric ( need not be the same as ), and we use the same network model for evaluating all patch pairs, being agnostic to their absolute or relative locations.
The matching network has a Siamese-like architecture as illustrated in Fig. 2. It begins with a common feature extraction sub-network applied to both input patches to produce a feature-vector for each. This sub-network has a receptive field of , and includes a total of fourteen convolutional layers with skip connections  including at the final output (see Fig. 2
). The computed feature-vectors for each of the two inputs are then concatenated and passed through a comparison sub-network, which comprises of a set of five fully-connected layers. All layers have ReLU activations, except for the last which uses a sigmoid to produce the match-scores. These scores are thus constrained to lie in . Note that during inference, the feature extraction sub-network needs to be applied only once to compute feature-vectors for all patches in a fully-convolutional way. Only the final five fully connected layers need to be repeatedly applied for different patch pairs.
We train the matching network to produce matching scores that are optimal with respect to the quality of the denoised patches . Specifically, we use an loss between the true and estimated clean patches:
where the denoised coefficients are computed using (2) based on matching-scores predicted by the network. Note that the loss for a single patch will depend on matching scores produced by the network for paired with all candidate patches in its neighborhood .
While it is desirable to train the network in this end-to-end manner with our actual denoising approach, we empirically find that training the network with this loss from a random initialization often converges to a sub-optimal local minima. We hypothesize that this is because we compute gradients corresponding to a large number of matching scores (all candidates in ) with respect to supervision from a single denoised patch, which also limits the number of reference patches we are able to fit in to a single batch at each iteration.
Therefore, we adopt a pre-training strategy using a loss defined on pairs of patches at a time. Specifically, we consider a simplified loss for denoising patch by averaging it with patch as:
This is equivalent to the actual loss in (3) with performing the averaging in (2) with only one candidate patch , by dropping the cross term between and , i.e., by assuming the noise is un-correlated with the difference between the two patches. It is interesting to note here if we assume that the deviations between reference and candidate patches are un-correlated, for different candidates , then the optimal averaging weight for a given candidate is the same whether averaging with one or multiple candidates. This is why the above modified loss serves as a good initial proxy for pre-training. However, since the un-correlated deviation assumption will likely not hold in practice, we follow this with training with the actual loss in (3).
In particular, we pre-train the network for a number of iterations using the modified loss in (4)—constructing the training set by considering all non-overlapping patches in an image, with random shuffling to select candidate for each patch , and train with respect to the loss of both matching to and vice-versa. This allows us to compute updates with respect to a much more diverse set of patches into a training batch, and to make maximal use of the feature extraction computation during training. The pre-training step is followed by training the network with the true loss in (3) till convergence—here, we extract a smaller number of training reference patches from each image, along with all their neighboring candidates.
Iii-D Final Estimates via Regression
The initial estimates produced by our method as described above are already significantly more accurate than those produced by BM3D. Nevertheless, (2) is restricted to expressing every clean patch as a weighted average of observed noisy patches, which limits its denoising ability in certain regions and patterns. To overcome this and achieve further improvements in quality, we use a second network trained via traditional regression to derive our final denoised estimate. Specifically, we adopt the architecture of IRCNN  with has seven dilated convolutional layers. In our case, this network takes a six-channel input formed by concatenating the original noisy input and our initial denoised estimate from match-based averaging. The output of the last layer is interpreted as a residual, and added to the initial estimates to yield the final denoised image output.
After the matching network has been trained, we generate sets of clean, noisy, and initial denoised estimates. This serves as the training set for this second network which is trained using an regression loss. We find that this step leads to further improvement over our initial estimates, while also outperforming state-of-the-art denoising networks (including IRCNN ) that are based only on direct regression.
We train our algorithm on a set of 1600 color images from the Waterloo exploration dataset , and 168 images from the BSD-300  train set, using the remaining 32 images for validation and parameter setting. We train different models for different noise levels, generating noisy observations by adding Gaussian noise to the clean images. Unless otherwise specified, we construct the candidate set of patches by considering all the overlapping patches in a search window around (i.e., the top-left corners of all patches are within a window around the top-left corner of ).
We use Adam  for training both the matching and regression networks for our method, beginning with learning rates of and respectively. We pre-train the matching network using for a 100k iterations based on the modified loss (4), with batches of 16 images cropped at size . Pairing all non-overlapping patches with a randomly shuffled counterparts, this generates more than 46k ordered matching pairs for training in each batch. We then continue training the matching network with the true loss in (3), in this case forming a batch with 256 unique reference patches from various images, and computing matching scores for each with respect to all candidates. We train for a total of 600k iterations, dropping the learning rate by once at 400k, and again at 500k iterations. Once the matching network is trained, we store a set of noisy and denoised version of our training set, and use these to train the refinement network. We again use the same training schedule, a total of 600k iterations with learning rate drops at 400k and 500k. Code and trained models for our method are available at https://projects.ayanc.org/rpcnn/.
We train and evaluate our method at five different noise levels, corresponding to standard deviations of25, 35, 50, and 75 gray levels. We begin by evaluating the quality of our initial estimates generated using “internal” statistics alone, i.e., from weighted averaging based on matching network outputs. In Table I, we report their PSNR values for different noise levels on the standard CBSD-68 test set  of color images. We compare these to results from CBM3D , and find that our estimates are much more accurate despite the fact we only perform one round of matching, and denoise based simply on averaging instead of collaborative filtering. Our initial estimates are also better than the results from BM3D-Net  and CNL-Net —two neural network-based “internal” denoising methods that are designed by unrolling the computational steps of BM3D and non-local means denoising, and training their parameters discriminatively.
|Ours (Match Only)||31.00||29.40||27.83||26.15|
Then in Table II, we evaluate the performance of our overall method, i.e., based on the regression network applied to our initial estimates. Here, we compare to three state-of-the-art color denoising methods IRCNN , DnCNN , and FFDNet  on CBSD-68 as well as the McMaster  and Kodak-24  datasets. We see that our results are consistently better across all datasets, except for at the lowest noise-level on the McMaster dataset where FFDNet has essentially equivalent performance (with a slightly higher PSNR of 0.02 dB). Interestingly, our results are better than those of IRCNN—by as large a margin as 0.2 to 0.44 dB at the noise level—despite the fact that our regression network uses an identical architecture. This improvement is therefore due entirely to the fact that, in our setting, the regression network has access to the initial denoised estimates based on our approach to exploiting internal image statistics.
We include examples of denoised images in Fig. 3 for a qualitative evaluation. We see that the initial estimates from our method are often already of high-quality, and that the second regression step improves these results by removing certain localized distortions and subtle artifacts. We also find that our overall method is often better at reconstructing textures and image detail than state-of-the-art denoising methods.
Visualizing Matching Scores.
We next take a closer look at the behavior of the matching network in Fig. 4. For a number of reference patches, and corresponding search windows, cropped from different training images, we visualize the matching scores predicted by our network. We show the average matching score across all sub-bands, as well as average weights corresponding to combinations of sets at the same wavelet scale (averaging over color channels and orientation), and at the same orientation (averaging over scale and color). Our results show that for any pair of patches, the network produces very different averaging weights for different sub-bands.
We find that that the weights tend to be generally higher at the finest scale (indicating more averaging), and lowest for the scaling coefficients. This is likely because the highest-frequencies are close to zero in most patches, and thus to each other. For lower-frequencies and DC values, the network selects only those patches that are close to the reference patch (in the clean image). For different orientations, the high matches are sometimes concentrated at different locations for the same reference, especially when there are strong edges and repeating textures. Thus, free from the restriction of matching patches as a whole with a single score, our algorithm finds different distributions of matches for different sub-bands in order to achieve optimal denoising.
|Window Size||15||23||31||31 (No-Pretraining)|
Effect of Window Size and Pre-training.
In Table III, we characterize the trade-off between quality and computational cost when choosing different search window sizes over which to match and average patches. For different window sizes, we report average PSNR (over our validation set) for our initial estimatesat . We also report the corresponding running time required to compute matching scores and perform the averaging for different window sizes—for a input image on an NVIDIA 1080Ti GPU. Note that computing the initial estimates takes a majority of the time in our denoising method—the following regression step takes only an additional 0.01 seconds, and is independent of window size.
As expected, running time goes up roughly linearly with the number of candidate matches (i.e., as square of the search window size), but we find that the drop in PSNR is a modest dB when going down to a window. Table III also demonstrates the importance of pre-training, and reports performance (again, of our initial estimates) achieved by a network that is initialized with random weights instead of with pre-training. We find that this leads to a PSNR drop of about 0.1 dB, highlighting that pre-training is critical to convergence to a good network model.
In this work, we presented a novel method that employed a neural network to characterize the similarity between pairs of patches from noisy observations, in terms of the similarity scores of different corresponding sub-band components. We showed that this network could be used to identify and exploit recurring patterns in an image for denoising, and our algorithm was able to recover high-quality estimates of the clean image from noisy observations. We believe that the potential of using neural networks to discover and leverage self-similarity is still largely untapped. A natural direction of future work is in exploring applications to other restoration and estimation tasks for images (such as blind deblurring), and other image-like signals such as depth maps and motion-fields.
This work was supported by the National Science Foundation under award no. IIS-1820693.
-  M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image processing, 2006.
-  S. Roth and M. J. Black, “Fields of experts,” IJCV, 2009.
-  U. Schmidt and S. Roth, “Shrinkage fields for effective image restoration,” in Proc. CVPR, 2014.
-  D. Zoran and Y. Weiss, “From learning models of natural image patches to whole image restoration,” in Proc. ICCV, 2011.
-  H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with bm3d?” in Proc. CVPR, 2012.
-  Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration,” PAMI, 2017.
-  K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in Proc. CVPR, 2017.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
-  K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn based image denoising,” IEEE Transactions on Image Processing, 2018.
C. Chen, Z. Xiong, X. Tian, and F. Wu, “Deep boosting for image denoising,”
The European Conference on Computer Vision (ECCV), September 2018.
-  A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in Proc. CVPR, 2005.
-  K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on image processing, 2007.
-  ——, “Color image denoising via sparse 3d collaborative filtering with grouping constraint in luminance-chrominance space,” in Proc. ICIP, 2007.
-  S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and-play priors for model based reconstruction,” in IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2013.
-  J.-H. R. Chang, C.-L. Li, B. Poczos, B. V. Kumar, and A. C. Sankaranarayanan, “One network to solve them all-solving linear inverse problems using deep projection models.” in Proc. ICCV, 2017.
-  S. A. Bigdeli, M. Zwicker, P. Favaro, and M. Jin, “Deep mean-shift priors for image restoration,” in Advances in Neural Information Processing Systems, 2017.
-  I. Mosseri, M. Zontak, and M. Irani, “Combining the power of internal and external denoising,” in Proc. ICCP, 2013.
S. Lefkimmiatis, “Non-local color image denoising with convolutional neural networks,”Proc. CVPR, 2017.
-  D. Yang and J. Sun, “Bm3d-net: A convolutional neural network for transform-domain collaborative filtering,” IEEE Signal Processing Letters, 2018.
J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural
network to compare image patches,”
Journal of Machine Learning Research, 2016.
B. Kumar, G. Carneiro, and I. Reid, “Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions,” inProc. CVPR, 2016.
-  S. Zagoruyko and N. Komodakis, “Learning to compare image patches via convolutional neural networks,” in Proc. CVPR, 2015.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proc. CVPR, 2017.
-  K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo exploration database: New challenges for image quality assessment models,” IEEE Transactions on Image Processing, 2017.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. ICCV, 2001.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
L. Zhang, X. Wu, A. Buades, and X. Li, “Color demosaicking by local directional interpolation and nonlocal adaptive thresholding,”Journal of Electronic imaging, vol. 20, no. 2, p. 023016, 2011.
-  R. Franzen, “Kodak lossless true color image suite,” source: http://r0k. us/graphics/kodak, vol. 4, 1999.