LIDIA: Lightweight Learned Image Denoising with Instance Adaptation (NTIRE, 2020)
Image denoising is a well studied problem with an extensive activity that has spread over several decades. Despite the many available denoising algorithms, the quest for simple, powerful and fast denoisers is still an active and vibrant topic of research. Leading classical denoising methods are typically designed to exploit the inner structure in images by modeling local overlapping patches. In contrast, recent newcomers to this arena are supervised neural-network-based methods that bypass this modeling altogether, targeting the inference goal directly and globally, while tending to be very deep and parameter heavy. This work proposes a novel low-weight learnable architecture that embeds in it several of the main concepts from the classical methods, while being trained for best denoising performance. More specifically, our proposed network relies on patch processing, leveraging non-local self-similarity, representation sparsity and a multiscale treatment. The proposed architecture achieves near state-of-the-art denoising results, while using a small fraction of the typical number of parameters. Furthermore, we demonstrate the ability of the proposed network to adapt itself to an incoming image by leveraging similar clean ones.READ FULL TEXT VIEW PDF
LIDIA: Lightweight Learned Image Denoising with Instance Adaptation (NTIRE, 2020)
Image denoising is a well studied problem, and many successful algorithms have been developed for handling this task over the years, e.g. NLM , KSVD , BM3D , EPLL , WNNM  and others [25, 15, 33, 6, 21, 18, 10, 24, 29, 31, 27, 19]. These classically-oriented algorithms strongly rely on models that exploit properties of natural images, usually employed while operating on small fully overlapped patches. For example, both EPLL  and PLE 
perform denoising using Gaussian Mixture Modeling (GMM) imposed on the image patches. The K-SVD algorithm restores images using a sparse modeling of such patches. BM3D  exploits self-similarity by grouping similar patches to 3D blocks and filtering them jointly. The algorithms reported in [27, 19] harness a multi-scale analysis framework on top of the above-mentioned local models.
Recently, supervised deep-learning based methods entered the denoising arena, showing state-of-the-art (SOTA) results in various contexts[2, 3, 30, 34, 16, 35, 23, 28, 14, 36, 13]. In contrast to the above-mentioned classical algorithms, deep-learning based methods tend to bypass the need for an explicit modeling of image redundancies, operating instead by directly learning the inference from the incoming images to their desired outputs. In order to obtain a non-local flavor in their treatment, as self-similarity or multi-scale methods would do, most of these algorithms ( being an exception) tend to increase their footprint by utilizing very deep and parameter heavy networks. These reflect badly on their memory consumption, the required amount of training images and the time for training and inference.
An interesting recent line of work by Lefkimmiatis proposes a denoising network with a significantly reduced number of parameters, while persisting on near SOTA performance [11, 12]. This method leverages the non-local self-similarity property of images by jointly operating on groups of similar patches. The network’s architecture consists of several repeated stages, each resembling a single step of the proximal gradient descent method under sparse modeling . In comparison with DnCNN , the work reported in [11, 12] shows a reduction by factor of in the number of parameters, while achieving denoising PSNR that is only dB lower.
Inspired by Lefkimmiatis’ work, in this paper we continue with his line of low-weight networks and propose a novel, easy, and learnable architecture that harnesses several main concepts from classical methods: (i) Operating on small fully overlapping patches; (ii) Exploiting non-local self-similarity; (iii) Leveraging representation sparsity; and (iv) Employing a multi-scale treatment. Our network resembles the one proposed in [11, 12], with several important differences:
We introduce a multi-scale treatment in our network that combats spatial artifacts, especially noticeable in smooth regions . While this change does not reflect strongly on the PSNR results, it has a clear visual contribution;
Our network is more effective by operating in the residual domain, similar to the approach taken by ;
Our patch fusion operator includes a spatial smoothing, which adds an extra force to our overall filtering; and
Our architecture is trained end-to-end, whereas Lefkimiatis’s scheme consists of a greedy training of the separate layers, followed by an end-to-end warm-start update.
, our proposed method operates on all the overlapping patches taken from the processed image by augmenting each with its nearest neighbors and filtering these jointly. The patch grouping stage is applied only once before any filtering, and it is not a part of the learnable architecture. Each patch group undergoes a series of trainable steps that aim to predict the noise in the candidate patch, thereby operating in the residual domain. These include several layers of the triplet (i) a linear separable transform, (ii) a ReLU on the features obtained, and (iii) an inverse transform for returning to the image domain. As already mentioned above, our scheme includes a multi-scale treatment, in which we fuse the processing of corresponding patches from different scales. This paper has two key contributions:
Leveraging the low-weight nature of our network, we propose a novel ability of adapting our trained network to the content of the treated image, boosting this way its denoising performance. This is obtained by denoising the incoming image regularly, seeking few similar images to the outcome, updating the trained network to their content, and then running the denoising again. We show the tendency of this approach to lead to improved denoising performance, as demonstrated in Figure 1.
This paper is organized as follows. In section II we describe the proposed scheme and its various ingredients. Section III presents experimental results and compares our method to other recently published methods. In Section IV we introduce the ability of our low-weight network to adapt to an incoming image by finding similar ones to it, and get a boost in the denoising performance. Section V concludes this paper and raises directions for a future work.
Our proposed method extracts all possible overlapping patches of size from the processed image, and cleans each in a similar way. The final reconstructed image is obtained by combining these restored patches via averaging. The algorithm is shown schematically in Figure 2.
In order to formulate the patch extraction, combination and filtering operations, we introduce some notations. Assume that the processed image is of size . We denote by
the noisy and denoised images respectively, both reshaped to a 1D vector. Similarly, the corrupted and restored patches in locationare denoted by respectively, where .***
Note that we handle boundary pixels by padding the processed image using mirror reflection withpixels from each side. Thus, the number of extracted patches is equal to the number of pixels in the image. denotes the matrix that extracts a patch centered at the -th location. The patch extraction operation from the noisy image is given by , and the denoised image is obtained by combining the denoised patches using weighted averaging,
where smooth patches get higher weights. More precisely,
is a sample variance of, and is learned.
Zooming in on the local treatment, it starts by augmenting the patch with a group of its nearest neighbors, forming a matrix of size . The nearest neighbor search is done using a Euclidean metric, , limited to a search window of size around the center of the patch.
The matrix undergoes a series of trainable operations that aim to recover a clean candidate patch
. Our filtering network consists of several blocks, each consisting of (i) a forward 2D linear transform; (ii) a non-negative thresholding (ReLU) on the obtained features; and (iii) a transform back to the image domain. All transforms operate separately in the spatial and the similarity domains. In contrast to BM3D and other methods, our filtering predicts the residual rather than the clean patches, just as done in
. This means that we estimate the noise using the network. Restored patches are obtained by subtracting the estimated noise from the corrupted patches.
Our scheme includes a multi-scale treatment and a fusion of corresponding patches from the different scales. The adopted strategy borrows its rationale from 
, which studied the single image super resolution task. In their algorithm, high-resolution patches and their corresponding low-resolution versions are jointly treated by assuming that they share the same sparse representation. The two resolution patches are handled by learning a pair of coupled dictionaries. In a similar fashion, we augment the corresponding patches from the two scales, and learn a joint transform for their fusion. In our experiments, the multi-scale scheme includes only two scales, but the same concept can be applied to a higher pyramid. In our notations, the 1scale is the original noisy image , and the 2 scale images are created by convolving with the low-pass filter
and down-sampling the result by a factor of two. In order to synchronize between patch locations in the two scales, we create four downscaled images by sampling the convolved image at either even or odd locations:(even/odd columns & even/odd rows). For each 1 scale patch, the corresponding 2 scale patch (of the same size ) is extracted from the appropriate down-sampled image, such that both patches are centred at the same pixel in the original image, as depicted in Figure 3.
We denote the 2 scale patch that corresponds to by . This patch is augmented with a group of its nearest neighbors, forming a matrix of size . The nearest neighbor search is performed in the same down-scaled image from which is taken, while limiting the search to a window of size . Both matrices, and , are fed to the filtering network, which fuses the scales using a joint transform. The architecture of this network is described next.
We turn to present the architecture of the filtering network, starting by describing the involved building blocks and then discussing the whole scheme. A basic component within our network is the TRT (Transform–ReLU–Transform) block. This follows the classic Thresholding algorithm in sparse approximation theory  in which denoising is obtained by transforming the incoming signal, discarding of small entries (this way getting a sparse representation), and applying an inverse transform that returns to the signal domain. Note that the same conceptual structure is employed by the well-known BM3D algorithm . In a similar fashion, our TRT block applies a learned transform, non-negative thresholding (ReLU) and another transform on the resulting matrix. Both transforms are separable and linear, denoted by the operator and implemented using a Separable Linear (SL) layer,
where and operate in the spatial and the similarity domains respectively. Separability of the SL layer allows a substantial reduction in the number of parameters of the network. In fact, computing is equivalent to applying , where is a Kronecker product between and , and is a vectorized version of .
Since concatenation of two layers can be replaced by a single effective , due to their linearity, we remove one layer in any concatenation of two -s, as shown in Figure 4. The component without the second transform is denoted by , and when concatenating -s, the first blocks should be replaced by -s. Another variant we use in our network is , which is a version of
with batch normalization added before the ReLU.
Another component of the filtering network is an Aggregation block (), depicted in Figure 5. This block imposes consistency between overlapping patches by combining them to a temporary image using plain averaging (as described in Eq. (1) but without the weights), and extracting them back from the obtained image by .
The complete architecture of the filtering network is presented in Figure 6. The network receives as input two sets of matrices, and , and its output is an array of filtered overlapping patches . At first, each of these matrices is multiplied by a diagonal weight matrix . Recall that the columns of (or ) are image patches, where the first is the processed patch and the rest are its neighbors. The weights express the network’s belief regarding the relevance of each of the neighbor patches to the denoising process. These weights are calculated using an auxiliary network denoted as “weight net”, which consists of seven FC (Fully Connected) layers of size with batch normalization and ReLU between each two FC’s. The network gets as input the sample variance of the processed patch and squared distances between the patch and its nearest neighbors.
After multiplication by the matrix undergoes a series of operations that include transforms, ReLUs and , until it gets to the block, as shown in Figure 6. The aggregation block imposes consistency of the matrices, which represent overlapping patches, but also causes loss of some inf.ormation, therefore we split the flow to two branches: with and without . Since the output of any or component is in the feature domain, we wrap the block with and , where transforms the features to the image domain, and transforms the output back to the feature space, while imposing sparsity. The 2 scale matrices, , undergo very similar operations as the 1 scale ones, but with different learned parameters. The only difference in the treatment of the two scales is in the functionality of the aggregation blocks. Since the operates on downsampled patches, combination and extraction is done with . Additionally, applies bilinear low-pass filter as defined in Equation (3) on the temporary image obtained after the patch combination.
The block applies a joint transform that fuses the features coming from four origins: the 1 and 2 scales with and without aggregation. The columns of all these matrices are concatenated together such that the same spatial transformation is applied on all. Note that the network size can be reduced at the cost of a slight degradation in performance by removing the , and components. We discuss this option in the result section.
This section reports the performance of the proposed scheme, with a comprehensive comparison to recent SOTA denoising algorithms. In particular, we include in these comparisons the classical BM3D  due to its resemblance to our network architecture, the TNRD , DnCNN  and FFDNet  networks, the non-local and high performance NLRN  architecture, and the recently published Learned K-SVD (LKSVD)  method. We also include comparisons to Lefkimiatis’ networks, NLNet  and UNLNet , which inspired our work. Our algorithm is denoted as Non-Local Multi-Scale (NLMS), and we present two versions of it, NLMS and NLMS-S. The second is a simplified network with slightly weaker performance (see more below).
We start with plain denoising experiments, in which the noise is Gaussian white of known variance. This is the common case covered by all the above mentioned methods.
Our network is trained on 432 images from the BSD500 set , and the evaluation uses the remaining 68 images (BSD68). The network is trained end-to-end using decreasing learning rate over batches of 4 images, using the mean-squared-error loss. We start training with the Adam optimizer with a learning rate of , and switch to SGD at the last part of the training with an initial learning rate of .
Figure 7 presents a comparison between our algorithm and leading alternative ones by presenting their PSNR performance versus their number of trained parameters. This figure exposes the fact that the performance of denoising networks is heavily influenced by their complexity. As can be seen, the various algorithms can be roughly split into two categories: low-weight architectures with a number of parameters below 100K (TNRD , LKSVD , NLNet  and UNLNet†††UNLNet is a blind denoising network trained for . ), and much larger and slightly better performing networks (DnCNN , FFDNet , and NLRN ) that use hundreds of thousands of parameters. As we proceed in this section, we emphasize low-weight architectures in our comparisons, a category to which our network belongs. Figure 7 shows that our networks (both NLMS and NLMS-S) achieve the best results within this low-weight category.
Detailed quantitative denoising results per noise level are reported‡‡‡ NLNet  complexity and PSNR are taken from the released code. in Table I. For each noise level, the best denoising performance is marked in red, and the best performance within the low-weight category is marked in blue. Table II reports the number of trained parameters per each of the competing networks. Figures 8, 9, 10 and 11 present examples of denoising results. Since our architecture is related to both BM3D and NLNet, we focus on qualitative comparisons with these algorithms. For all noise levels our results are significantly sharper, contain less artifacts and preserve more details than those of BM3D. In comparison to NLNet, our method is significantly better in high noise levels due to our multi-scale treatment, recovering large and repeating elements, as shown in Figure 7(p). In fact our algorithm manages to recover repeating elements better than all methods presented in Figure 8 except NLRN. In addition, in cases of high noise levels, the multi-scale treatment allows handling smooth areas with less artifacts than NLNet, as one can see from the results in Figure 8(p) and 8(o). In medium noise levels, our algorithm recovers more details, while NLNet tends to over-smooth the recovered image. For example, see the Elephant skin in Figure 10 and the mountain glacier in Figure 11.
For denoising of color images we use 3D patches of size and increase the size of the matrices from (, , ) to (, , ) accordingly, which increases the total number of our network’s parameters to 94K. The nearest neighbor search is done using the Luminance component, . Quantitative denoising results are reported in Table III, where our network is denoted as CNLMS (Color NLMS). As can be seen, our network is the best within the low-weight category, and gets quite close to the CDnCNN performance . Figures 12 and 13 present examples of denoising results which show that CNLMS handles low frequency noise better than CBM3D and CNLNet due to its multi-scale treatment.
Blind denoising, i.e., denoising with unknown noise level, is a useful feature when it comes to neural networks. This allows using a fixed network for performing image denoising, while serving a range of noise levels. This is a more practical solution, when compared to the one discussed above, in which we have designed a series of networks, each trained for a particular . We report blind denoising performance of our architecture and compare to similar results by DnCNN-b  (a version of DnCNN that has been trained for a range of values) and UNLNet . Our blind denoising network (denoted NLMS-b) preserves all its structure, but simply trained by mixing noise level examples in the range . The evaluation of all three networks is performed on images with . The results of this experiment are brought in Table IV. As can be seen, our method obtains a higher PSNR than UNLNet, while being slightly weaker than DnCNN-b. Considering again the fact that our network has nearly of the parameters of DnCNN-b, we can say that our approach leads to SOTA results in the low-weight category.
Our NLMS denoising network can be further simplified by removing the , and components. The resulting smaller network, denoted by NLMS-S, contains 30% less parameters than the original NLMS architecture (see Table II), while achieving slightly weaker performance. Table V shows that for both regular and blind denoising scenarios, NLMS-S achieves an average PSNR that is only dB lower than the full-size NLMS network. Denoising examples are presented in Figure 14, showing that the visual quality gap between NLMS and NLMS-S is marginal.
A network trained on a set of general natural images might fail to obtain high quality results when applied to images that are not well represented in the training set. For example, applying our network on astronomical or text images creates pronounced artifacts, as can be seen in Figures 16(g) and 17(g), as these images contain specific structures that are atypical of natural images. The work reported in [22, 23] suggest training class-aware denoisers, showing that they lead to better performance. However, this approach requires a large amount of images for training each class (e.g. [22, 23] train their networks on 900 images per class), and holding many networks for covering the variety of classes to handle.
In this work we propose a different approach: Instead of learning class-aware denoisers, we adapt the above described universal network, which has been trained on natural images, to handle incoming images with special content. Adaptation is obtained by first denoising the input image regularly, then seeking (e.g., using Google image search) few closely related images to it, and then re-training the network on this small set of clean images. This process concludes by denoising the input image by the updated network. An advantage of low-weight schemes is their ability to be retrained on small amounts of data without overfitting. Indeed, we present experiments in which our network (NLMS) is updated with a single similar image. The training images used for our experiments are shown in Figure 16. In each experiment, except the text, the network has been trained over 500 batches of 4 cropped sub-images with random offsets, which takes about 6-7 minutes on Nvidia GeForce GTX 1080 Ti GPU. The adaptation does not require early stopping of the training, i.e. training the network over tens of thousands of batches leads to similar and even better results. Since the statistics of the text images is distant from that of natural images, the adaptation of the network takes more time. In our experiment, training has been done over 6300 batches. In accordance with the long training time, this adaptation gains more than 4dB improvement in PSNR, where improvement of 2.8dB is achieved with only 400 batches, as shown in Figure 15.
The first two experiments, presented in Figures 17 and 18, show adaptation examples for non-natural images: astronomical and text images. As can be seen in Figures 16(g), 17(g), the denoising results achieved by NLMS before adaptation are poor. However, adapting the network by training on a single similar image (different training image for each experiment) significantly improves both PSNR and the visual quality of the results. When it comes to natural images, our regular trained network usually achieves satisfactory results that are harder to improve. However, even in such cases, the denoising quality might be boosted by adapting the network using one similar image, as indeed shown in the experiments presented in Figures 19 and 20. Adapting the network leads to PSNR improvement of more than 0.25dB.
This work presents a low-weight network for supervised image denoising. Our patch-based architecture exploits non-local self-similarity and representation sparsity, augmented by a multiscale treatment. Separable linear layers, combined with non-local neighbor search, allow capturing non-local interrelations between pixels using small number of learned parameters. The proposed network achieves SOTA results in the low-weight category, and competitive performance overall. In addition, the presented network can be adapted to incoming noisy images by a re-training on similar images, leading to boosted denoising performance. In our future work we intend to extend the algorithm to treat color images, and other noise models. In addition, we believe that the architecture composed could be found effective in more challenging inverse problems.
Non-local color image denoising with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3587–3596. Cited by: item 1, §I, §I, §I, 10(g), 11(f), 12(f), 7(g), 8(g), 9(g), §III-A, TABLE I, TABLE III, §III, footnote ‡.