Non-Local Video Denoising by CNN

11/30/2018 ∙ by Axel Davy, et al. ∙ 4

Non-local patch based methods were until recently state-of-the-art for image denoising but are now outperformed by CNNs. Yet they are still the best ones for video denoising, as video redundancy is a key factor to attain high denoising performance. The problem is that CNN architectures are hardly compatible with the search for self-similarities. In this work we propose a new and efficient way to feed video self-similarities to a CNN. The non-locality is incorporated into the network via a first non-trainable layer which finds for each patch in the input image its most similar patches in a search region. The central values of these patches are then gathered in a feature vector which is assigned to each image pixel. This information is presented to a CNN which is trained to predict the clean image. We apply the proposed architecture to image and video denoising. For the latter patches are searched for in a 3D spatio-temporal volume. The proposed architecture achieves state-of-the-art results, specially in video denoising, where it outperforms most state-of-the-art methods. To the best of our knowledge, this is the first successful application of a CNN to video denoising.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 6

page 7

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in image sensor hardware have steadily improved the acquisition quality of image and video cameras. However, a low signal-to-noise ratio is unavoidable in low lighting conditions if the exposure time is limited (for example to avoid motion blur). This results in high levels of noise, which negatively affects the visual quality of the video and hinders its use for many applications. As a consequence, denoising is a crucial component of any camera pipeline. Furthermore, by interpreting denoising algorithms as proximal operators, several inverse problems in image processing can be solved by iteratively applying a denoising algorithm [31]. Hence the need for video denoising algorithms with a low running time.

Literature review on image denoising.

Image denoising has a vast literature where very varied methods have been used: PDEs and variational methods (including MRF models), transform domain methods, non-local (or patch-based) methods, etc. In the last two or three years, CNNs have taken over the state-of-the-art. In addition to attaining better results, CNNs are amenable to efficient parallelization on GPUs potentially enabling real-time performance. We can distinguish two types of CNN approaches: trainable inference networks and black box networks.

In the first type, the architecture mimics the operations performed by a few iterations of optimization algorithms used for MAP inference with MRFs prior models. Some approaches are based on the Field-of-Experts model [32], such as [4, 35, 10]. The architecture of [39] is based on EPLL [44]

, which models the a priori distribution of image patches as a Gaussian mixture model.

Trainable inference networks reflect the operations of an optimization algorithm, which leads in some cases to unusual architectures, and to some restrictions in the network design. For example, in the trainable reaction diffusion network (TRDN) of [10] even layers must be an image (i.e. have only one feature). As pointed out in [21] these architectures have strong similarities with the residual networks of [16].

The black-box approaches treat the denoising problem as a standard regression problem. They don’t use much of the domain knowledge acquired during decades of research in denoising. In spite of this, these techniques are currently topping the list of state-of-the-art algorithms. The first denoising approaches using neural networks were proposed in the mid and late 2000s. Jain and Seung

[19] proposed a five layer CNN with

filters, with 24 features in the hidden layers and sigmoid activation functions. Burger et al.

[7]

reported the first state-of-the-art results with a multilayer perceptron trained to denoise

patches, but with a heavy architecture. More recently, DnCNN [42] obtained impressive results with a far lighter 17 layer deep CNN with

convolutions, ReLU activations and batch normalization

[18]. This work also proposes a blind denoising network that can denoise an image with an unknown noise level , and a multi-noise network trained to denoise blindly three types of noise. A faster version of DnCNN, named FFDNet, was proposed in [43]

, which also allows handling noise with spatially variant variance

by adding the noise variance map as an additional input. The architectures of DnCNN and FFDnet keep the same image size throughout the network. Other architectures [27, 34, 8]

use pulling or strided convolutions to downscale the image, and then up-convolutional layers to upscale it back. Skip connections connect the layers before the pulling with the output of the up-convolution to avoid loss of spatial resolution. Skip connections are used extensively in

[38].

Although these architectures produce very good results, for textures formed by repetitive patterns, non-local patch-based methods still perform better [42, 7]. Some works have therefore attempted to incorporate the non-local patch similarity into a CNN framework. Qiao et al. [30] proposed inference networks derived from the non-local FoE MRF model [37]. This can be seen as a non-local version of the TRDN network of [10]. A different non-local TRDN was introduced by [23]. BM3D-net [41]

pre-computes for each pixel a stack of similar patches which are fed into a CNN, which reproduces the operations done by (the first step of) the BM3D algorithm: a linear transformation of the group of patches, a non-linear shrinkage function and a second linear transform (the inverse of the first). The authors train the linear transformations and the shrinkage function. In

[11] the authors propose an iterative approach that can be used to reinforce non-locality to any denoiser. Each iteration consists of the application of the denoiser followed by a non-local filtering step using a fixed image (denoised with BM3D) for computing the non-local correspondences. This approach obtains good results and can be applied to any denoising network. An inconvenience is that the resulting algorithm requires to iterate the denoising network.

In summary, existing non-local CNNs are either trainable versions of previous non-local approaches (such as BM3D or the non-local FoE model) or iterative meta denoisers that apply a non-local filtering separately from the denoising network.

Literature review on video denoising.

CNNs have been successfully applied to several video processing tasks such as deblurring [36], video frame synthesis [24]

or super-resolution 

[17, 33], but their application to video denoising has been limited so far. In [9] a recurrent architecture is proposed, but the results are below the state-of-the-art. Some works have tackled the related problem of burst denoising. Recently [15, 28] focused on the related problem of image burst denoising reporting very good results.

In terms of output quality the state-of-the-art is achieved by patch-based methods [12, 26, 3, 14, 6, 40]. They exploit drastically the self-similarity of natural images and videos, namely the fact that most patches have several similar patches around them (spatially and temporally). Each patch is denoised using these similar patches, which are searched for in a region around it. The search region generally is a space-time cube, but more sophisticated search strategies involving optical flow have also been used. Because of the use of such broad search neighborhoods these methods are called non-local. While these video denoising algorithms perform very well, they often are computationally costly. Because of their complexity they are usually unfit for high resolution video processing.

Patch-based methods usually follow three steps that can be iterated: (1) search for similar patches, (2) denoise the group of similar patches, (3) aggregate the denoised patches to form the denoised frame. VBM3D [12] improves the image denoising algorithm BM3D [13] by searching for similar patches in neighboring frames using a ”predictive search” strategy which speeds up the search and gives some temporal consistency. VBM4D [26] generalizes this idea to 3D patches. In VNLB [2] spatio-temporal patches that were not motion compensated are used to improve the temporal consistency. In [14] a generic search method extends every patch-based denoising algorithm into a global video denoising algorithm by extending the patch search to the entire video. SPTWO [6] suggests using an optical flow to warp the neighboring frames to each target frame. Each patch of the target frame is then denoised using the similar patches in this volume with a Bayesian strategy similar to [22]. Recently, [40] proposed to learn an adaptive optimal transform using batches of frames.

Contributions.

In this work we propose a non-local architecture for image and video denoising that does not suffer from the restrictions of trainable inference networks.

The method first computes for each image patch the most similar neighbors in a rectangular spatio-temporal search window and gathers the center pixel of each similar patch forming a feature vector which is assigned to each image location. This is made possible by a GPU implementation of the patch search that allows computing the nearest neighbors efficiently. This results in an image with channels, which is fed to a CNN trained to predict the clean image from this high dimensional vector. We trained our network for grayscale video denoising. The non-locality present temporally in videos enables strong denoising results with our proposal.

To summarize our contributions, in this paper we present a new video denoising CNN method incorporating non-local information in a simple way. To the best of our knowledge, the present work is the first CNN-based video denoising method to attain state-of-the-art results. Compared to other works incorporating non-locality to neural networks on images, our proposal doesn’t have the limitations of the ones derived from variational/MRF models.

Figure 1: The architecture of the proposed method. The first module performs a patch-wise nearest neighbor search across neighboring frames. Then, the current frame, and the feature vectors of each pixel (the center pixels of the nearest neighbors) are fed into the network. The first four layer of the network perform convolutions with 32 feature maps. The resulting feature maps are the input of a simplified DnCNN [42] network with 15 layers.

2 Proposed method

Let be a video and denote its value at position in frame . We observe , a noisy version of contaminated by additive white Gaussian noise:

where .

Our video denoising network processes the video frame by frame. Before it is fed to the network, each frame is processed by a non-local patch search module which computes a non-local feature vector at each image position. A diagram of the proposed network is shown in Figure 1.

2.1 Non-local features

Let be a patch centered at pixel in frame . The patch search module computes the distances between the patch and the patches in a 3D rectangular search region centered at of size , where and are the spatial and temporal sizes. The positions of these similar patches are . Note that .

The pixel values at those positions are gathered as an -dimensional non-local feature vector

The image of non-local features

is considered as a 3D tensor with

channels. This is the input to the network. Note that the first channel of the feature images corresponds to the noisy image .

2.2 Network architecture

Our network consists of two stages: a non-local stage and a local stage. The non-local stage consists of four convolution layers with 32 kernels. The rationale for these layers is to allow the network to compute pixel-wise features out of the raw non-local features at the input.

The second stage receives the features computed by the first stage. It consists of 14 layers with 64 convolution kernels, followed by batch normalization and ReLU activations. The output layer is a convolution. Its architecture is similar to the DnCNN network introduced in [42], although with 15 layers instead of 17 (as in [43]). As for DnCNN, the network outputs a residual image, which has to be subtracted to the noisy image to get the denoised one. The training loss is the averaged mean square error between the residual and the noise.

3 Training and dataset

3.1 Datasets

For the training and validation sets we used a database of short segments of YouTube videos. The videos were selected by searching for 64 keywords. Only HD videos with Creative Commons license were used. From each video segments of 16 frames were extracted and downscaled to have 540 lines (typically , but the number of columns depends on the aspect ratio). The separations between segments is at least . In total the database consists of around 1120 extracts with 16 frames. We separated 6.5% of the videos of the database for the validation (one for each category).

For training we ignored the first and last frames of each segment for which the 3D patch search window couldn’t fit in the video. The images were converted to grayscale, before the synthetic Gaussian noise was added.

During validation we only considered the central frame of each sequence. The resulting validation score is thus computed on 503 sequences (1 frame each). 111The code to reproduce our results and the database is available at https://github.com/axeldavy/vnlnet.

For testing we used two datasets. One of them is a set of seven sequences from the Derf’s Test Media collection222https://media.xiph.org/video/derf used in [1]. We used this set for comparing with previous denoising methods. The sequences have a resolution of with 100 frames. The original videos are RGB of size , and were converted to grayscale by averaging the channels, and then down-sampled by a factor two. The second dataset is the test-dev split of the DAVIS video segmentation challenge [29]. It consists of 30 videos having between 25 and 90 frames. The videos are stored as sequences of JPEG images. There are two versions of the dataset: the full resolution (ranging between HD and 4K) and 480p. We used the full resolution set and applied our own downscaling to 540 rows. In this way we reduced the artifacts caused by JPEG compression.

3.2 Epochs

At each training epoch a new realization of the noise is added to generate the noisy samples. To speed the training up, we pre-compute the non-local patch search on every video (after noise generation). A random set of (spatio-temporal) patches is drawn from the dataset to generate the mini-batches.

We only consider patches such that the search window fits in the video (for instance, we exclude the first and last

frames). At testing time, we simply extended the video by padding with black frames at the start and the end of the sequence. An epoch was composed of 14000 batches of size 128, composed of image patches of size

. We trained for 20 epochs with Adam [20] and reduced the learning rate at epochs 12 and 17 (from to and respectively).

4 Experimental results

We will first show some experiments to highlight relevant aspects of the proposed approach. Then we compare with the state-of-the-art.

Method No patch Without oracle With oracle
PSNR 31.24 31.28 31.85
Table 1:

PSNR on the CBSD68 dataset (noise standard deviation of 25) for the proposed method on still images. Two variants of our method and a baseline (No patch) are compared. No patch corresponds to a simplified DnCNN with no nearest neighbor information. The other two versions collect 9 neighbors by comparing 9x9 patches. But while the former searches them on the noisy image, the latter determines the patch position on the noise-free image (oracle). In both cases the pixel values are taken on the noisy image.

Input

pNoisyp

pSimplified DnCNN (No patch)p

pOursp

pOurs + Oraclep

Figure 2: Results on a color image (noise standard deviation of 25). The compared methods are the ones introduced in Table 1.

The untapped potential of non-locality.

Although the focus of this work is in video denoising, it is still interesting to study the performance of the proposed non-local CNN on images. Figure 2 shows a comparison of a simplified DnCNN (A standard DnCNN but with 15 layers, as in our network) and our method on a static image. The results with and without non-local information are very similar, this is confirmed on Table 1. The only difference is visible on very self-similar parts like the blinds that are shown in the detail of Figure 2. The figure and table also show the result of an oracular method: the nearest neighbor search is performed on the noisy image, though the pixel values are taken from the noisy image. The oracular results show that non-locality has a great potential to improve the results of CNNs, yielding an improvement of 0.6dB. However, this improvement is hindered by the difficulty of finding accurate matches in the presence of noise. A standard way to reduce the matching errors is to use larger patches. But on images, larger patches have fewer similar patches. In contrast, as we will see below, the temporal redundancy of videos allows using very large patches.

4.1 Parameter tuning

Non-local search has three main parameters: The patch size, the number of retained matches and the number of frames on which we search them. Intuitively, we want the matches to be past or future positions of the current patch. Thus we set the number of matches to be the number of frames on which we search.

Input

pNoisyp

pNo patchp

pPatch width 9p

pPatch width 15p

pPatch width 21p

pPatch width 31p

pPatch width 41p

Figure 3: Example of denoised results with our method when changing the patch size, respectively no patch search, , , , and patches. The depth of the 3D window is of 15 frames for these experiments.
Patch Width No patch 9 15 21 31 41
PSNR 33.75 35.62 36.40 36.84 37.11 37.22

Table 2: Impact of the patch size on the PSNR computed on the validation set (noise standard deviation of 20). The tested sizes are , , , and . No patch corresponds to the baseline simplified DnCNN.

Input

pNoisyp

pNo Patchp

p3 Neighborsp

p7 Neighborsp

p11 Neighborsp

p15 Neighborsp

Figure 4: Example of denoised results with our method when changing the depth of the 3D patch search window, ie the number of frames considered in the search (respectively no patch search, 3, 7, 11 and 15). patches were used for these experiments.
Num Neighbors No patch 3 7 11 15
PSNR 33.75 35.35 36.50 36.97 37.22

Table 3: Impact of the depth of the 3D patch search window, ie the number of frames considered in the search, on the PSNR computed on the validation set for a noise standard deviation of 20. (respectively no patch search, 3, 7, 11 and 15)

In Table 2, we explore the impact of the patch size used for the matching. Figure 6 shows visual results corresponding to each parameter. Surprisingly, we obtain better and better results by increasing the size of the patches. The main reason for this is that the match precision is improved, as the impact of noise on the patch distance shrinks. The bottom row of Figure 6 shows an area of the ground only affected by slight camera motion and on the top row an area with complex motion (human movement). We can see that the former is clearly better denoised using large patches, while the latter remains unaffected around the motion. This indicates that the network is able to determine when the provided non-local information is not accurate and to fall back to a result similar to DnCNN in this case (obtained by single image denoising). Further increasing the patch size would result in more areas being processed as single images. As a result, we see that the performance gain from to is rather small.

In Table 3 and Figure 3, we see the impact of the number of frames used. One can see that the more frames, the better. Increasing the number of frames beyond 15 (7 past, current, and 7 future) doesn’t justify the small increase of performance.

In the following experiments, we shall use patches and 15 frames. Another parameter for non-local search is the spatial width of the search window, which we set to 41 pixels. We trained three networks for Gaussian noise of standard deviation 10, 20 and 40.

4.2 Comparison with state-of-the-art

Method crowd park joy pedestrians station sunflower touchdown tractor average
SPTWO 36.57 / .9651 35.87 / .9570 41.02 / .9725 41.24 / .9697 42.84 / .9824 40.45 / .9557 38.92 / .9701 39.56 / .9675
VBM3D 35.76 / .9589 35.00 / .9469 40.90 / .9674 39.14 / .9651 40.13 / .9770 39.25 / .9466 37.51 / .9575 38.24 / .9599
VBM4D 36.05 / .9535 35.31 / .9354 40.61 / .9712 40.85 / .9466 41.88 / .9696 39.79 / .9440 37.73 / .9533 38.88 / .9534
VNLB 37.24 / .9702 36.48 / .9622 42.23 / .9782 42.14 / .9771 43.70 / .9850 41.23 / .9615 40.20 / .9773 40.57 / .9731
DnCNN 34.39 / .9455 33.82 / .9329 39.46 / .9641 37.89 / .9412 40.20 / .9702 38.28 / .9269 36.91 / .9568 37.28 / .9482
VNLnet 36.90 / .9711 36.19 / .9642 41.66 / .9760 41.94 / .9735 43.58 / .9849 40.75 / .9575 38.78 / .9709 39.97 / .9712
SPTWO 32.94 / .9319 32.35 / .9161 37.01 / .9391 38.09 / .9461 38.83 / .9593 37.55 / .9287 35.15 / .9363 35.99 / .9368
VBM3D 32.34 / .9093 31.50 / .8731 37.06 / .9423 35.91 / .9007 36.25 / .9393 36.17 / .9065 33.53 / .8991 34.68 / .9100
VBM4D 32.40 / .9126 31.60 / .8832 36.72 / .9344 36.84 / .9224 37.78 / .9517 36.44 / .9034 33.95 / .9104 35.10 / .9169
VNLB 33.49 / .9335 32.80 / .9154 38.61 / .9583 38.78 / .9470 39.82 / .9698 37.47 / .9220 36.67 / .9536 36.81 / .9428
DnCNN 30.47 / .8890 30.03 / .8625 35.81 / .9302 34.37 / .8832 36.19 / .9361 35.35 / .8782 32.99 / .9019 33.60 / .8973
VNLnet 33.19 / .9367 32.57 / .9207 37.96 / .9528 37.87 / .9379 39.53 / .9667 36.79 / .9040 35.06 / .9367 36.14 / .9365
SPTWO 29.02 / .8095 28.79 / .8022 31.32 / .7705 32.37 / .7922 32.61 / .7974 31.80 / .7364 30.61 / .8223 30.93 / .7901
VBM3D 28.73 / .8295 27.93 / .7663 33.00 / .8828 32.57 / .8239 32.39 / .8831 33.38 / .8624 29.80 / .8039 31.11 / .8360
VBM4D 28.72 / .8339 27.99 / .7751 32.62 / .8683 32.93 / .8441 33.66 / .8999 33.68 / .8603 30.20 / .8205 31.40 / .8432
VNLB 29.88 / .8682 29.28 / .8309 34.68 / .9167 34.65 / .8871 35.44 / .9329 34.18 / .8712 32.58 / .8921 32.95 / .8856
DnCNN 26.85 / .7979 26.65 / .7525 32.01 / .8660 30.96 / .7899 32.13 / .8705 32.78 / .8346 29.25 / .7976 30.09 / .8156
VNLnet 29.46 / .8650 28.96 / .8289 33.88 / .9027 33.48 / .8577 34.65 / .9150 33.70 / .8465 31.17 / .8596 32.19 / .8679
Table 4: Quantitative denoising results (PSNR and SSIM) for seven grayscale test sequences of size from the Derf’s Test Media collection on several state-of-the-art video denoising algorithms versus DnCNN and our method. Three noise standard deviations are tested (10, 20 and 40). Compared methods are SPTWO [6], VBM3D [12], VBM4D [25], VNLB [2], DnCNN [42] and VNLnet (ours). We highlighted the best performance in black and the second best in brown.

Input

pNoisyp

pDnCNNp

pVBM3Dp

pVNLnet (Ours)p

pVNLBp

Figure 5: Visual comparison of the denoising results on the seven images reported in Table 4 for noise .

Input

pNoisyp

pNon-Local Pixel Meanp

pDnCNNp

pVNLnet (Ours)p

pVNLBp

Figure 6: Example of denoised result for several algorithms (noise standard deviation of 20). The two crops highlight the results on a non-moving and a moving part of the video. Non-Local Pixel Mean corresponds to the average of the output of the non-local search layer.
Method
DnCNN 36.80 32.94 28.69
VBM3D 37.43 33.75 30.12
VNLnet 39.08 35.44 31.79
Table 5: Performance (PSNR) of DnCNN, VBM3D and VNLnet (our method) on the DAVIS dataset [29] for several noise levels (10, 20 and 40).

In Table 4, we show a comparison of DnCNN and the proposed method Video Non-Local Network (VNLnet) with other state-of-the-art video denoising methods [1]. The state-of-the-art methods include SPTWO [6], VBM3D [12], VBM4D [25] and VNLB [2]. We note that overall VNLB (Video Non-Local Bayes) is the best performing method. Yet, as we shall see later, it comes at a high computational cost. Our method comes second and beats the other methods, which makes it a state-of-the-art method for video denoising and the first so of the neural kind. Figure 5 shows in detail the results for the most relevant methods (VBM3D being the most popular method).

In Figure 6, we show a detail of a sequence to highlight the result on two different types of areas. We include as reference the Non-Local Pixel Mean, which is just the result of the averaging of the matches presented to the network. As noise remains, one can thus see that the network does more than averaging the data on static areas (middle row). Our result has more details than DnCNN and is on par visually to the result of VNLB.

We also tested our method, DnCNN and VBM3D on the DAVIS test-dev dataset [29] with three difference noise levels. The results are summarized on Table 5. VNLB wasn’t considered due to its computation time. We note that the proposed method outperforms DnCNN, a frame by frame denoiser, and VBM3D, a state-of-the-art video denoising method.

A note on running times.

VBM3D DnCNN VBM4D VNLB SPTWO
1.3s 13s 52s 140s 210s
Table 6: Running time per frame on a video for VBM3D, DnCNN, VBM4D, VNLB and SPTWO on single CPU core.
Non-local search Rest of the network DnCNN
932 ms 80 ms 95 ms
Table 7: Running time per frame on a video on a Nvidia Titan V ( patches at every position, 3D windows, the default parameters).

On Table 6, we compare the CPU running time of VBM3D, DnCNN and VNLB when denoising a video frame. While we do not a have a CPU implementation of the patch search layer, the GPU runtimes of Table 7 point out that on CPU our method should be 10 times slower than DnCNN. The non-local search is particularly costly because we search matches on 15 frames for patches centered in every pixel of our image. The patch search could be made significantly faster by reducing the size of the 3D window using tricks explored in other papers. VBM3D for example centers the search on each frame on small windows around the best matches found in the previous frame. A related acceleration is to use a search strategy based on PatchMatch [5].

5 Implementation details

The patch search requires the computation of the distance between each patch in the image and the patches in the search region. If implemented naïvely, this operation can be prohibitive. Patch-based methods require a patch search step. To reduce the computational cost, a common approach is to search the nearest neighbors only for the patches in a subgrid of the image. For example BM3D processes 1/9th of the patches with default parameters. Since the processed patches overlap, the aggregation of the denoised patches covers the whole image.

Our proposed method does not have any aggregation. We compute the neighbors for all image patches, which is costly. In the case of video, best results are obtained with large patches and a large search region (both temporally and spatially). Therefore we need a highly efficient patch search algorithm.

Our implementation uses an optimized GPU kernel which searches for the locations in parallel. For each patch, the best distances with respect to all other patches in the search volume are maintained in a table. We split the computation of the distances is two steps: first compute the sum of squares across columns:

Then the distances can be obtained by applying a horizontal box filter of size on the volume composed by the neighboring GPU threads. The resulting implementation has linear complexity in the size of the search region and the patch width.

To optimize the speed of the algorithm we use the GPU shared memory as cache for the memory accesses thus reducing bandwidth limitations. In addition, for sorting the distances the ordered table is stored into GPU registers, and written to memory only at the end of the computation. The computation of the L2 distances and the maintenance of the ordered table have about the same order of computation cost. More details about the implementation can be found in the supplementary material.

6 Conclusions

We have proposed a simple but efficient way of incorporating non-local information into a Neural network architecture for denoising. Our proposed method first computes for each image patch the most similar neighbors on a spatio-temporal window and gathers the center pixel of each similar patch forming a non-local feature vector which is given to a simplified DnCNN. Our method yields a significant gain compared to using DnCNN directly on each video frame. But it also outperforms many state-of-the-art video denoising algorithms, including the popular VBM3D.

Our contribution places neural networks among the best video denoising methods and opens the way for new works in this area.

We have seen the importance of having reliable matches: On the validation set, the best performing method used patches of size for the patch search. We have also noticed that on regions with non-reliable matches (complex motion), the network reverts to a result similar to single image denoising. Thus we believe future works should focus on improving this area, by possibly adapting the size of the patch and passing information about the quality of the matches to the network.

References

  • [1] P. Arias, G. Facciolo, and J.-M. Morel. A comparison of patch-based models in video denoising. In 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pages 1–5. IEEE, 2018.
  • [2] P. Arias and J.-M. Morel. Towards a bayesian video denoising method. In Advanced Concepts for Intelligent Vision Systems, LNCS. Springer, 2015.
  • [3] P. Arias and J.-M. Morel.

    Video denoising via empirical bayesian estimation of space-time patches.

    Journal of Mathematical Imaging and Vision, 60(1):70–93, Jan 2018.
  • [4] A. Barbu. Training an active random field for real-time image denoising. IEEE Transactions on Image Processing, 18(11):2451–2462, Nov 2009.
  • [5] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. PatchMatch. In ACM SIGGRAPH 2009 papers on - SIGGRAPH ’09, page 1, New York, New York, USA, 2009. ACM Press.
  • [6] A. Buades, J.-L. Lisani, and M. Miladinović. Patch-based video denoising with optical flow estimation. IEEE Transactions on Image Processing, 25(6):2573–2586, June 2016.
  • [7] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with bm3d? In

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2392–2399, June 2012.
  • [8] C. Chen, Q. Chen, J. Xu, and V. Koltun. Learning to see in the dark. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [9] X. Chen, L. Song, and X. Yang. Deep rnns for video denoising. In Applications of Digital Image Processing, 2016.
  • [10] Y. Chen and T. Pock. Trainable Nonlinear Reaction Diffusion: A Flexible Framework for Fast and Effective Image Restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1256–1272, 6 2017.
  • [11] C. Cruz, A. Foi, V. Katkovnik, and K. Egiazarian.

    Nonlocality-reinforced convolutional neural networks for image denoising.

    IEEE Signal Processing Letters, 25(8):1216–1220, Aug 2018.
  • [12] K. Dabov, A. Foi, and K. Egiazarian. Video denoising by sparse 3D transform-domain collaborative filtering. In EUSIPCO, 2007.
  • [13] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 2007.
  • [14] T. Ehret, P. Arias, and J.-M. Morel. Global patch search boosts video denoising. In International Conference on Computer Vision Theory and Applications, 2017.
  • [15] C. Godard, K. Matzen, and M. Uyttendaele. Deep burst denoising. arXiv, 2017.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
  • [17] Y. Huang, W. Wang, and L. Wang. Bidirectional recurrent convolutional networks for multi-frame super-resolution. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 235–243. Curran Associates, Inc., 2015.
  • [18] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.
  • [19] V. Jain and S. Seung. Natural image denoising with convolutional networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 769–776. Curran Associates, Inc., 2009.
  • [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [21] E. Kobler, T. Klatzer, K. Hammernik, and T. Pock.

    Variational networks: Connecting variational methods and deep learning.

    In V. Roth and T. Vetter, editors, Pattern Recognition, pages 281–293, Cham, 2017. Springer International Publishing.
  • [22] M. Lebrun, A. Buades, and J.-M. Morel. A nonlocal bayesian image denoising algorithm. SIAM Journal on Imaging Sciences, 2013.
  • [23] S. Lefkimmiatis. Non-local color image denoising with convolutional neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5882–5891, July 2017.
  • [24] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 4473–4481, Oct 2017.
  • [25] M. Maggioni, G. Boracchi, A. Foi, and K. Egiazarian. Video Denoising Using Separable 4D Nonlocal Spatiotemporal Transforms. In Proc. of SPIE, 2011.
  • [26] M. Maggioni, G. Boracchi, A. Foi, and K. Egiazarian. Video denoising, deblocking, and enhancement through separable 4-D nonlocal spatiotemporal transforms. IEEE Transactions on Image Processing, 2012.
  • [27] X. Mao, C. Shen, and Y.-B. Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2802–2810. Curran Associates, Inc., 2016.
  • [28] B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Carroll. Burst denoising with kernel prediction networks. In CVPR, 2018.
  • [29] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
  • [30] P. Qiao, Y. Dou, W. Feng, R. Li, and Y. Chen. Learning non-local image diffusion for image denoising. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17, pages 1847–1855, New York, NY, USA, 2017. ACM.
  • [31] Y. Romano, M. Elad, and P. Milanfar. The little engine that could: Regularization by denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017.
  • [32] S. Roth and M. J. Black. Fields of experts: a framework for learning image priors. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 860–867 vol. 2, June 2005.
  • [33] M. S. M. Sajjadi, R. Vemulapalli, and M. Brown. Frame-Recurrent Video Super-Resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [34] V. Santhanam, V. I. Morariu, and L. S. Davis. Generalized deep image to image regression. CoRR, abs/1612.03268, 2016.
  • [35] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2774–2781, June 2014.
  • [36] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang. Deep video deblurring for hand-held cameras. In IEEE CVPR, 2017.
  • [37] J. Sun and M. F. Tappen. Learning non-local range markov random field for image restoration. In CVPR 2011, pages 2745–2752, June 2011.
  • [38] Y. Tai, J. Yang, X. Liu, and C. Xu. Memnet: A persistent memory network for image restoration. 2017 IEEE International Conference on Computer Vision (ICCV), pages 4549–4557, 2017.
  • [39] R. Vemulapalli, O. Tuzel, and M. Liu. Deep gaussian conditional random field network: A model-based deep network for discriminative denoising. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4801–4809, June 2016.
  • [40] B. Wen, Y. Li, L. Pfister, and Y. Bresler. Joint adaptive sparsity and low-rankness on the fly: an online tensor reconstruction scheme for video denoising. In IEEE ICCV, 2017.
  • [41] D. Yang and J. Sun. Bm3d-net: A convolutional neural network for transform-domain collaborative filtering. IEEE Signal Processing Letters, 25(1):55–59, Jan 2018.
  • [42] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 7 2017.
  • [43] K. Zhang, W. Zuo, and L. Zhang. FFDNet: Toward a Fast and Flexible Solution for {CNN} based Image Denoising. CoRR, abs/1710.0, 2017.
  • [44] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In 2011 International Conference on Computer Vision, pages 479–486, Nov 2011.