The problem of denoising consists of estimating a signal from measurements corrupted by noise, and is a canonical application of statistical estimation that has been studied since the 1950’s. Achieving high-quality denoising results requires (at least implicitly) quantifying and exploiting the differences between signals and noise. In the case of natural photographic images, the denoising problem is both an important application, as well as a useful test-bed for our understanding of natural images.
The classical solution to the denoising problem is the Wiener filter wiener1950extrapolation , which assumes a translation-invariant Gaussian signal model. Under this prior, the Wiener filter is the optimal estimator in terms of mean squared error. It operates by mapping the noisy image to the frequency domain, shrinking the amplitude of all components, and mapping back to the signal domain. In the case of natural images, the high-frequency components are shrunk more aggressively than the lower-frequency components because they tend to contain less energy. This is equivalent to convolution with a lowpass filter, implying that each pixel is replaced with a weighted average over a local neighborhood.
In the 1990’s, more powerful solutions were developed based on multi-scale ("wavelet") transforms. These transforms map natural images to a domain where they have sparser representations. This makes it possible to perform denoising by applying nonlinear thresholding operations in order to discard components that are small relative to the noise level Donoho95a ; Simoncelli96c ; chang2000adaptive . From a linear-algebraic perspective, these algorithms operate by projecting the noisy input onto a lower-dimensional subspace that contains plausible signal content. The projection eliminates the orthogonal complement of the subspace, which mostly contains noise. This general methodology laid the foundations for the state-of-the-art models in the 2000’s (e.g. dabov2006image ), some of which added a data-driven perspective, learning sparsifying transforms elad2006image , and nonlinear shrinkage functions directly from natural images HelOr2008 ; Raphan08 .
In the past decade, purely data-driven models based on convolutional neural networks lecun2015deep have come to dominate all previous methods in terms of performance. These models consist of cascades of convolutional filters, and rectifying nonlinearities, which are capable of representing a diverse and powerful set of functions. Training such architectures to minimize mean square error over large databases of noisy natural-image patches achieves current state-of-the-art results zhang2017beyond (see also chen2017trainable for a related approach).
Neural networks have achieved particularly impressive results on the blind denoising problem, in which the noise amplitude is unknown zhang2017beyond ; DURR ; lefkimmiatis2018universal . Despite their success, these solutions are not well understood. We lack intuition about the denoising mechanisms they implement. Network architecture and functional units are often borrowed from the image-recognition literature, and it is unclear which of these aspects contribute positively, or limit, the denoising performance. Many authors claim critical importance of specific aspects of architecture (e.g., skip connections, batch normalization, recurrence), but the benefits of these attributes are difficult to isolate and evaluate in the context of the many other elements of the system.
In this work, we show that bias-free CNNs (BF-CNNs), in which the bias terms (additive constants) throughout the network have been eliminated, yield performance matching the state-of-the-art in image denoising, while offering two important advantages. First, BF-CNNs are locally linear, and hence amenable to direct analysis with linear-algebraic tools. In Section 2 we leverage such tools to visualize locally adaptive properties of the denoising map, and to show that the network approximates a projection onto an adaptively-selected low-dimensional subspace. The analysis uncovers direct connections between the denoising mechanism of the neural networks and classical methods based on filtering and projection operations. Second, we show that BF-CNNs generalize robustly to noise levels well beyond the range at which they have been trained, in stark contrast to existing architectures. Section 3.1 provides a theoretical explanation of this phenomenon, based on the insight that removing the bias results in invariance to rescaling. Section 3.2 demonstrates the overall performance and generalization capabilities of BF-CNNs through comprehensive numerical experiments, showing that the advantages of bias removal hold for several popular architectures.
2 Analysis of bias-free neural networks for denoising
We assume a measurement model in which images are corrupted by additive noise: , where is the original image, containing pixels,
is an image of i.i.d. samples of Gaussian noise with variance, and is the noisy observation. The denoising problem consists of finding a function that provides a good estimate of the original image, . Commonly, one minimizes the mean squared error : , where the expectation is taken over some distribution over images,
, as well as over the distribution of noise realizations. Finally, if the noise standard deviation,, is unknown, the expectation must also be taken over a distribution of this variable. This problem is often called blind denoising in the literature.
). The orange points show the cosine of the angle between the left and right singular vectors, for a single image (orange curve). Larger blue point (and vertical line) indicates the effective dimensionality (sum of squared singular values) for that image. Note that orange points at dimensionalities less than this are nearly 1, implying that the corresponding left and right singular vectors are nearly identical.(b) Effective dimensionality of the signal subspaces as a function of noise level (orange curve for same image highlighted inf panel (a)). For comparison, the total dimensionality of the space is ( pixels). (c) Histogram of dot products (cosine of angle) between the left and right singular vectors that lie within the signal subspaces.
Feedforward neural networks with rectified linear units (ReLUs) are piecewise affine: for a given activation pattern of the ReLUs, the effect of the network on the input is a cascade of linear transformations (convolutional or fully connected layers), additive constants, and pointwise multiplications by a binary mask corresponding to the fixed activation pattern. Since each of these is affine, the entire cascade implements a single affine transformation. For a fixed noisy input imagewith pixels, the function computed by a denoising neural network may be written
where is the Jacobian of evaluated at input , and represents the net bias. The subscripts on and
serve as a reminder that the corresponding matrix and vector, respectively, depend on the ReLU activation patterns, which in turn depend on the input vector.
If we remove all the additive ("bias") terms from every stage of a CNN, the resulting bias-free CNN (BF-CNN) is strictly linear, and its net action may be expressed as
where is again the Jacobian of evaluated at . In the following two sections, we analyze this local representation to reveal and visualize the noise-removal mechanisms implemented by BF-CNNs. We illustrate our analysis using a BF-CNN based on the architecture of the Denoising CNN (DnCNN, zhang2017beyond ). A detailed description of the architecture is provided in Section 3.2.
Visualization approaches based on differentiating neural-network functions with respect to their input have been previously proposed in the context of image classification (see e.g. simonyan2013deep ; montavon2017explaining
). However, architectures used for classification include non-ReLU activation functions, and thus the Jacobian serves only to provide a 1st-order Taylor approximation of the mapping. A similar complication arises when considering denoising architectures that are not bias-free: The resulting filters do not fully represent the denoising map due to the presence of the net bias, which confounds the analysis.
2.1 Visualization of equivalent filters
The linear representation of the denoising map given by equation (2) implies that the th pixel of the output image is computed as an inner product between the th row of , denoted , and the input image:
The vectors can be interpreted as adaptive filters that produce an estimate of the denoised pixel via a weighted average of noisy pixels. Examination of these filters reveals their diversity, and their relationship to the underlying image content (Figure 1): They are adapted to the local features of the noisy image, averaging over homogeneous regions of the image without blurring across edges.
Analyzing BF-CNNs in terms of equivalent filters facilitates a comparison with existing denoising methods. As mentioned in the introduction, classical Wiener filtering denoises images by convolving with a lowpass filter, preserving low-frequency information, while suppressing fine-scale details. Many modern denoising techniques can be interpreted as implementing nonlinear spatially-varying filters designed to preserve fine-scale details such as edges (e.g. tomasi1998bilateral , see also milanfar2012tour for a comprehensive review, and choi2018fast for a recent learning-based approach). Our analysis shows that BF-CNNs can be interpreted as a data-driven method to learn adaptive filters implicitly.
2.2 Analysis via the singular-value decomposition
The local linear structure of a BF-CNN facilitates analysis of its functional capabilities via the singular value decomposition (SVD). For a given input, we compute the SVD of the Jacobian matrix: , with and orthogonal matrices, and a diagonal matrix. We can decompose the effect of the network on its input in terms of the left singular vectors (columns of ), the singular values (diagonal elements of ), and the right singular vectors (columns of ):
The output is a linear combination of the left singular vectors, each weighted by the projection of the input onto the corresponding right singular vector, and scaled by the corresponding singular value.
Analyzing the SVD of a BF-CNN on a set of ten natural images reveals that most singular values are very close to zero (Figure 1(a)). The network is thus discarding all but a very low-dimensional portion of the input image. We can measure an "effective dimensionality" of this preserved subspace by computing the total noise variance remaining in the denoised image, , which corresponds to the sum of the squared singular values.
We also observe that the left and right singular vectors corresponding to the singular values with non-negligible amplitudes are approximately the same (Figures 1(a) and 1(c)). This means that the Jacobian is (approximately) symmetric, and we can interpret the action of the network as projecting the noisy signal onto a low-dimensional subspace, as is done in wavelet thresholding schemes.
For inputs of the form (where is the clean image and the noise), the subspace spanned by the singular vectors corresponding to the non-negligible singular values contains almost entirely, in the sense that projecting onto the subspace preserves most of its energy. This holds for the whole range of noise levels over which the network is trained. Specifically, averaged over 50 example images, the fraction of the energy (squared norm) preserved by the projection falls from 0.999 to 0.986 as the standard deviation of the noise grows from 10 to 100 (relative to the image pixels, which lie in the range ). The low-dimensional subspace encoded by the Jacobian is therefore tailored to the input image. This is confirmed by visualizing the singular vectors as images. The singular vectors corresponding to non-negligible singular values capture features of the input image; the ones corresponding to near-zero singular values are unstructured (Figure 3). The BF-CNN therefore implements an approximate projection onto an adaptive signal subspace that preserves image structure, while suppressing the noise.
The signal subspace depends on the noise level. We find that for a given clean image corrupted by noise, the effective dimensionality of the signal subspace decreases as the noise level increases (Figure 1(b)). In addition, the signal subspaces are nested: the subspaces corresponding to lower noise levels contain at least 95% of the subspaces corresponding to higher noise levels. At lower noise levels the network detects a richer set of image features, and constructs a larger signal subspace.
In conclusion, the SVD analysis of the BF-CNN reveals that the network learns a prior on natural images that is structured as a union of subspaces. This is reminiscent of sparsity-based techniques, in which images are assumed to lie in unions of subspaces spanned by sparse linear combinations of basis functions, which may be handcrafted (e.g., chang2000adaptive ; portilla2003image ) or learned (e.g., elad2006image ). The BF-CNN implements the prior implicitly: the subspaces are obtained through a concatenation of weight matrices governed by the activation patterns corresponding to specific inputs.
3 Generalization across noise levels
In order to apply learning-based denoising methods on real data, it is crucial to understand their generalization properties. In particular, methods should be robust to deviations between the data used for training and the data encountered at test time. Here, we take a step in this direction by studying generalization across noise levels, i.e. denoising performance for noise levels not included in the training set. We provide theoretical analysis and computational experiments showing that BF-CNNs generalize robustly across noise levels, in stark contrast to architectures that have nonzero net bias.
3.1 Scaling invariance
A key property of bias-free neural networks with ReLU activations is invariance to scaling: rescaling the input by a constant value simply rescales the output by the same amount.
Let be a feedforward neural network with ReLU activation functions and no constant terms in any layer. For any input and any nonnegative constant ,
We can write the action of a bias-free neural network with layers in terms of the weight matrix , , of each layer and a rectifying operator , which sets to zero any negative entries in its input. Multiplying by a nonnegative constant does not change the sign of the entries of a vector, so for any with the right dimension and any , which implies
The proof breaks down if any of the layers in the network contains additive constant parameters because scaling the input may change the activation pattern of the ReLUs. Networks with a nonzero net bias are not scaling invariant.
Scaling invariance renders BF-CNNs robust to varying noise levels. Consider a clean image and a noise realization corresponding to a noise level within the training range. Let err denote the error of a bias-free network when estimating from ,
By Lemma 1, the network is automatically able to denoise clean images at a different noise level, as long as they are scaled accordingly. For any , the error of the network when denoising an image corrupted by a noise realization scales linearly with ,
A bias-free network therefore generalizes across new noise levels, as long as examples with the same signal-to-noise ratio are present in the training data. Note that this argument applies to image patches of size equal to the field-of-view of the neural network. This theoretical analysis shows that bias-free networks should provide robust generalization across noise levels when trained on a collection of image patches with different intensities, even if the range of noise levels is limited. This is confirmed by the numerical experiments reported in the following section.
3.2 Computational experiments
In this section, we investigate generalization across noise levels computationally, comparing networks with and without net bias. In order to do this, we train the networks using images corrupted by i.i.d. Gaussian noise with a range of standard deviations. This range is the training range of the network. We then evaluate the network for noise levels that are both within and beyond the training range. Our experiments are carried out on natural images from the Berkeley Segmentation Dataset MartinFTM01 . To make our experiments consistent with previous results including schmidt2014shrinkage ; chen2017trainable ; zhang2017beyond , we use a subset of images as a training set. The training set is augmented via downsampling, random flips, and random rotations of patches in these images to produce a total of 541,600 patches zhang2017beyond . We use a separate set of images as the test set.
We implement a BF-CNN based on the architecture of the Denoising CNN (DnCNN) zhang2017beyond , a neural network with convolutional layers, each containing filters and channels. Each intermediate layer in a DnCNN applies a convolution followed by batch normalization ioffe2015batch and a ReLU. There is a skip connection from the initial layer to the final layer, which has no nonlinear units. To construct a BF-CNN from the DnCNN, we remove all sources of additive bias, including the mean parameter of the batch-normalization in every layer (note however that the rescaling parameters are preserved). We train the BF-CNN, and a DnCNN, on natural image patches corrupted by i.i.d. Gaussian noise with varying standard deviation , sampled uniformly at random from a training range specified by minimum and maximum standard deviations, . The architectures are trained by minimizing the mean squared denoising error over the entire training set. Here denotes the th training pair of clean and noisy images. The loss is minimized using the Adam Optimizer kingma2014adam over epochs with an initial learning rate of and a decay factor of at the and epochs.
We begin by analyzing the affine local model of a DnCNN trained over different noise levels. In particular, for a neural network we write the residual corresponding to a fixed noisy image as (see equation (1)). Figure 4 shows the magnitudes of the residual, the linear term, , and the net bias, , as a function of noise level. Over the training range, the linear term dominates, implying that it is responsible for most of the denoising effort. The net bias is significantly smaller, although it does grow systematically with the noise level. However, when the network is evaluated out of the training range, the norm of the bias increases dramatically. The strange behavior of the bias term out of the training range coincides with a substantial drop in denoising performance. Figure 5 compares DnCNN and BF-CNN for different noise levels, inside and outside of the training range. Performance is quantified in peak signal-to-noise-ratio (PSNR). In all cases, DnCNN generalizes very poorly to noise levels outside the training range. In contrast, BF-CNN generalizes robustly, as predicted by the theoretical argument in Section 3.1. Section A shows that the same holds for the more perceptually-meaningful Structural Similarity Index (SSIM) wang2004ssim . Figure 6 shows an example that demonstrates visually the striking difference in generalization performance. The presence of a net bias in the architecture results in severe overfitting to the training range.
The theoretical analysis in Section 3.1 holds for any neural network architecture that combines convolutional layers and ReLUs. This suggests that the overfitting phenomenon observed for DnCNN is likely to be true of other networks as well. In order to investigate this, we trained several different CNN architecture with and without net biases. These architectures include popular features of existing neural-network techniques in image processing: recurrence, multiscale filters, and skip connections.
The first architecture is a recurrent neural network, where the basic module is a CNN with 5 layers,filters and channels in the intermediate layers (see Section B.1). The experiment is inspired by Ref. DURR , in which the authors argue that recurrence is responsible for improved generalization across noise levels. Our results are inconsistent with this hypothesis (Figure 6(b)). The recurrent architecture fails to generalize if it has a net bias, but generalizes robustly when this bias is removed, both in terms of PSNR and SSIM. The second network is an example of a popular multiscale architecture known as a UNet ronneberger2015unet (see Section B.2). Figure 6(b) shows that generalization is again governed by the presence or absence of a net bias. Finally, Figure 6(c) shows that the same phenomenon occurs for a network inspired by the DenseNet architecture huang2017densnet ; zhang2018rdnir . The network consists of 4 consecutive CNN modules with 5 layers, filters and channels, where there is a skip connection from the input to the end of each module (see Section B.3). These experiments suggest that the presence of a net bias is the primary impediment to denoising generalization for CNN architectures.
In this work, we show that removing constant terms from CNN architectures boosts generalization performance across noise levels, and also provides interpretability of the denoising method via linear-algebra techniques, such as the SVD. Our linear-algebraic analysis uncovers interesting aspects of the denoising map, but these interpretations are very local: small changes in the input image change the activation patterns of the network, resulting in a change in the corresponding linear mapping. Extending the analysis to reveal global characteristics of the neural-network functionality is a challenging direction for future research. It is also of interest to examine whether bias removal can facilitate generalization in other image-processing tasks, such as single-image superresolution.
- (1) Chang, S. G., Yu, B., and Vetterli, M. Adaptive wavelet thresholding for image denoising and compression. IEEE Trans. Image Processing 9, 9 (2000), 1532–1546.
- (2) Chen, Y., and Pock, T. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Trans. Patt. Analysis and Machine Intelligence 39, 6 (2017), 1256–1272.
- (3) Choi, S., Isidoro, J., Getreuer, P., and Milanfar, P. Fast, trainable, multiscale denoising. In 2018 25th IEEE International Conference on Image Processing (ICIP) (2018), IEEE, pp. 963–967.
Dabov, K., Foi, A., Katkovnik, V., and Egiazarian, K.
Image denoising with block-matching and 3d filtering.
Image Processing: Algorithms and Systems, Neural Networks, and Machine Learning(2006), vol. 6064, International Society for Optics and Photonics, p. 606414.
- (5) Donoho, D., and Johnstone, I. Adapting to unknown smoothness via wavelet shrinkage. J American Stat Assoc 90, 432 (December 1995).
- (6) Elad, M., and Aharon, M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. on Image processing 15, 12 (2006), 3736–3745.
- (7) Hel-Or, Y., and Shaked, D. A discriminative approach for wavelet denoising. IEEE Trans. Image Processing (2008).
- (8) Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In (2017), pp. 4700–4708.
- (9) Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
- (10) Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- (11) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature 521, 7553 (2015), 436.
- (12) Lefkimmiatis, S. Universal denoising networks: a novel cnn architecture for image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 3204–3213.
- (13) Martin, D., Fowlkes, C., Tal, D., and Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision (July 2001), vol. 2, pp. 416–423.
- (14) Milanfar, P. A tour of modern image filtering: New insights and methods, both practical and theoretical. IEEE signal processing magazine 30, 1 (2012), 106–128.
- (15) Montavon, G., Lapuschkin, S., Binder, A., Samek, W., and Müller, K.-R. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Rec. 65 (2017), 211–222.
- (16) Portilla, J., Strela, V., Wainwright, M. J., and Simoncelli, E. P. Image denoising using scale mixtures of gaussians in the wavelet domain. IEEE Trans. Image Processing 12, 11 (2003).
- (17) Raphan, M., and Simoncelli, E. P. Optimal denoising in redundant representations. IEEE Trans Image Processing 17, 8 (Aug 2008), 1342–1352.
- (18) Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (2015), Springer, pp. 234–241.
- (19) Schmidt, U., and Roth, S. Shrinkage fields for effective image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 2774–2781.
- (20) Simoncelli, E. P., and Adelson, E. H. Noise removal via Bayesian wavelet coring. In Proc 3rd IEEE Int’l Conf on Image Proc (Lausanne, Sep 16-19 1996), vol. I, IEEE Sig Proc Society, pp. 379–382.
- (21) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013).
- (22) Tomasi, C., and Manduchi, R. Bilateral filtering for gray and color images. In ICCV, vol. 98.
- (23) Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P., et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing 13, 4 (2004), 600–612.
Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications. Technology Press, 1950.
- (25) Zhang, K., Zuo, W., Chen, Y., Meng, D., and Zhang, L. Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Processing 26, 7 (2017), 3142–3155.
- (26) Zhang, X., Lu, Y., Liu, J., and Dong, B. Dynamically unfolding recurrent restorer: A moving endpoint control method for image restoration. arXiv preprint arXiv:1805.07709 (2018).
- (27) Zhang, Y., Tian, Y., Kong, Y., Zhong, B., and Fu, Y. Residual dense network for image restoration. CoRR abs/1812.10477 (2018).
Appendix A Generalization performance measured by SSIM
Figure 8 compares the performance of DnCNN (red curves) and BF-CNN (blue curves) for the experimental design described in Section 3.2 of the main paper. The performance is quantified by the Structural Similarity Index (SSIM) of the denoised image as a function of the input SSIM.
Appendix B Experiments with different architectures
In this section we describe the computational experiments used to evaluate the performance of the recursive architecture, the UNet, and DenseNet-style architecture in Section 3.2. We train these models using a smaller version of the dataset described in Section 3.2 of the main paper. In more detail, we use patches of size instead of , which yields a total of 22,400 training patches. We train the models using the Adam optimizer kingma2014adam with an initial learning rate of and train for epochs with a learning rate schedule which decreases by a factor of if the validation PSNR decreases from one epoch to the next. We use early stopping and select the model with the best validation PSNR. The following subsections provide details regarding the different architectures.
b.1 Recursive Framework
Inspired by Ref. DURR , we consider a recurrent framework that produces a denoised image estimate of the form , at time where is a neural network. We use a 5-layer fully convolutional network with filters in all layers and channels in each intermediate layer to implement . We initialize the denoised estimate as the noisy image, i.e . For the version of the network with net bias, we add trainable additive constants to every filter in all but the last layer. During training, we run the recurrence for a maximum of times, sampling uniformly at random from for each mini-batch. At test time we fix .
Our UNet model ronneberger2015unet has the following layers:
conv1 - Takes in input image and maps to channels with convolutional kernels.
conv2 - Input: channels. Output: channels. convolutional kernels.
conv3 - Input: channels. Output: channels.
convolutional kernels with stride2.
conv4- Input: channels. Output: channels. convolutional kernels.
conv5- Input: channels. Output: channels. convolutional kernels with dilation factor of 2.
conv6- Input: channels. Output: channels. convolutional kernels with dilation factor of 4.
conv7- Transpose Convolution layer. Input: channels. Output: channels. filters with stride .
conv8- Input: channels. Output: channels. convolutional kernels. The input to this layer is the concatenation of the outputs of layer conv7 and conv2.
conv9- Input: channels. Output: channels. convolutional kernels.
b.3 Simple DenseNet
Our simplified version of the DenseNet architecture huang2017densnet has blocks in total. Each block is a fully convolutional -layer CNN with filters and channels in the intermediate layers with ReLU nonlinearity. The first three blocks have an output layer with channels while the last block has an output layer with only one channel. The output of the block is concatenated with the input noisy image and then fed to the block, so the last three blocks have input channels. In the version of the network with bias, we add trainable additive parameters to all the layers except for the last layer in the final block.