Image capturing systems are inherently degraded by noise, including shot noise of photons and read noise from sensors . And this problem gets even worse for the images and videos captured in low-light scenarios or by small-aperture cameras of cellphones. Thus, it is important to study denoising algorithms to produce high-quality images and video frames [3, 6, 16, 20, 18, 22].
Most of the traditional denoising methods achieve good results by selecting and averaging pixels in the image. And how to effectively select suitable pixels and compute the averaging weights are the key factors of different denoising approaches. As a typical example, the bilateral smoothing model  samples pixels in a local square region and calculates the weights with Gaussian functions. In addition, the BM3D  searches relevant pixels by block matching, and the averaging weights are decided using empirical Wiener filter. However, these methods usually use hand-crafted schemes for pixel sampling and weighting, which do not always work well in complex scenarios.
Different from the traditional methods, deep neural networks have also been used for image denoising [31, 26]. These data-driven models could exploit the natural image priors within large amount of training data to learn the mapping function from the noisy image to the desired clear output, which helps achieve better results than the traditional methods. But the deep learning based approaches do not explicitly manipulate input pixels with weighted averaging, and directly synthesizing results with the deep networks and spatially-invariant convolution kernels could lead to corrupted image textures and over-smoothing artifacts.
To solve the aforementioned problems, we propose to explicitly learn the selecting and averaging process for image denoising in a data-driven manner. Specifically, we use deep convolutional neural networks (CNNs) to estimate a 2D deformable convolution kernel (the patch in the middle of Figure2(c)) for each pixel in the noisy image. The sampling locations and weights of the kernel are both learned, corresponding to the pixel selecting and weighting strategies of the traditional denoising models, respectively. The advantage of our deformable kernel is two-folded. On one hand, the proposed approach could improve the classical averaging process by learning from data, which is by sharp contrast to the hand-crafted schemes. On the other hand, our model directly filters the noisy input, which constrains the output space and thereby reduces artifacts of other deep learning based approaches. Note that while one can simply use a normal kernel similar with the KPN  to sample pixels from a rigid grid (Figure 2(b)), it often leads to limited receptive field and cannot efficiently exploit the structure information in the images. And irrelevant sampling locations in the rigid kernel may harm the performance. By contrast, our deformable kernel naturally adapts to the image structures and is able to increase the receptive field without sampling more pixels.
Except for the above single image case, we can also use the proposed method in video denoising, and a straightforward way for this is to apply the 2D deformable kernels on each frame separately, as shown in Figure 2(c). However, this simple 2D strategy has difficulties in handling videos with large motion, where few reliable pixels could be found in neighboring frames (frame and of Figure 2). To overcome this limitation, we need to distribute more sampling locations on the frames with higher reliability (the reference frame ) and avoid the frames with severe motion; and this requires our algorithm to be able to search pixels across the spatial-temporal space of the input videos. Thus, instead of predicting 2D kernels for the pixels in the noisy input, we develop 3D deformable kernels (Figure 2(d)) for each pixel in the desired output to adaptively select the most informative pixels in the spatial-temporal space of videos. The proposed kernel naturally solves the large motion issues by capturing dependencies between 3D locations and sampling on more reliable frames, as illustrated in Figure 2(d). Furthermore, our method could effectively deal with the misalignment caused by dynamic scenes and reduce the cluttered boundaries and the ghosting artifacts of existing video denoising approaches [20, 22] as shown in Figure 1.
In this paper, we make the following contributions. First, we establish the connection between the traditional denoising methods and the deformable convolution, and propose a new method with deformable kernels for image denoising to explicitly learn the classical selecting and averaging process. Second, we extend the proposed 2D kernels to the spatial-temporal space to better deal with large motion in video denoising, which further reduces artifacts and improves performance. We also introduce annealing terms to facilitate training the 3D kernels. In addition, we provide comprehensive analysis of the proposed algorithm about how the deformable kernel helps improve denoising results. Extensive experiments on both synthetic and real-world data demonstrate that our method compares favorably against state-of-the-arts on both single image and video inputs.
2 Related Work
We discuss the state-of-the-art denoising methods as well as recent works on learning flexible convolutional operations, and put the proposed algorithm in proper context.
Image and video denoising. Most state-of-the-art denoising methods rely on pixel sampling and weighted averaging [9, 29, 3, 6]. Gaussian  and bilateral  smoothing models use pixels from a local window and obtain averaging weights using Gaussian functions. NLM  samples pixels globally and decides the weights with patch similarities. BM3D  selects pixels with block matching and use transform domain collaborative filtering for averaging. Since video frames could provide more information then single image, the VBM3D  and VBM4D  extend the BM3D to videos by grouping more similar patches in higher dimensions. In addition, optical flow has been exploited in video denoising methods [17, 18]. However, fast and reliable flow estimation still remains a challenging problem.
use deep CNN with residual connection to directly learn a mapping function from noisy input to denoised result. To learn the mapping function for multiple frame input, RNNs[5, 8]
have also been used for exploiting the temporal structure of videos. While these networks are effective in removing noise, the activation functions employed in them could lead to information loss, and directly synthesizing images with deep neural networks tends to cause oversmoothing artifacts. Mildenhall  first use deep CNN to predict normal kernels for denoising. However, normal kernels have rigid sampling grid and cannot well handle misalignment from camera shake and dynamic scenes. Instead, we propose deformable 2D and 3D kernels which enables free form pixel sampling and naturally handles above issues.
Learning flexible convolutions. In deep CNNs, the convolutional operation is defined as weighted summation of a grid of pixels sampled from images or feature maps. Normal convolution kernels usually apply fixed sampling grid and convolution weights for different locations of all inputs. Recently, several approaches have been developed for more flexible convolutional operations [13, 12, 11, 7], and the flexibility of them comes from either the weighting or the sampling scheme. On one hand, Jia 
improves the weighting strategy with a dynamic filter network. Following this work, similar ideas have been explored for video interpolation and video denoising . On the other hand, more flexible sampling methods have also been developed for images  or videos . However, these works do not allow free form convolution as each point in the desired output samples only one location. Dai  propose spatial deformable kernels for object detection which consider 2D geometric transformations. But this method cannot sample pixels from the temporal space, and thereby is not suitable for video input. Moreover, it uses fixed convolution weights for different locations and could lead to oversmoothing artifacts like a Gaussian filter. By contrast, our method enables adaptively sampling in the spatial-temporal space while the kernel weights are also learned, which is consistent with the selecting and averaging process of classical denoising methods.
3 Proposed Algorithm
We propose a novel framework with deformable convolution kernels for image and video denoising. Different from normal kernels which have rigid sampling grid and fixed convolution weights, we use deformable grids and dynamic weights  for the proposed kernels, which correspond to the pixel selecting and weighting process of classical denoising methods. The deformations of the grids could be represented as the offsets added to the rigid sampling locations.
An overview of the proposed algorithm is shown in Figure 3. We first train a deep CNN for estimating the offsets of the proposed kernels. Then we sample pixels from the noisy input according to the predicted offsets, and estimate the kernel weights with the concatenation of the sampled pixels, the noisy input and the features of the offset network. Finally, we can generate the denoised output by convolving the sampled pixels with the learned kernel weights.
|Layer name||offset network||conv layers|
|number of feature maps||64||128||256||512||512||512||256||128||128||3||64|
3.1 Learning to Sample and Average Pixels
For a noisy input where and represent the height and width, the weighted averaging process  for image denoising could be formulated as:
where is a pixel on the denoised output . represents the sampling grid with sampling locations, and is the weights for averaging pixels. For example,
defines a spatially-invariant rigid grid, where the size is and .
In the proposed deformable kernels, the sampling grid could be generated by the predicted offsets :
Note that both and are functions of which indicates that our deformable kernels are spatially-variant. Since the offsets in are usually fractional, we use bilinear interpolation to sample the pixels similar with .
After the adaptive sampling, we can recover the clear output by convolving the sampled pixels with the learned kernel as in (1). Note that the weights of are also spatially-variant and depend on the input videos, which is by contrast to the normal CNNs with fixed uniform convolution kernels.
3D deformable kernels.
We can also use the proposed method for video denoising. Suppose that we have a noisy video sequence : , where is the reference frame. A straightforward way to process this input is to apply the above 2D kernels on each frame separately:
However, this simple strategy has problems in dealing with large motion as illustrated in Figure 2. To solve this problem, we develop 3D deformable kernels (Figure 4(a)) which could more efficiently distribute the sampling locations across the spatial-temporal space. The 3D kernel directly takes the concatenated video frames as input, and we can formulate the filtering process as:
where denotes the sampling coordinate in the temporal dimension, and is the number of pixels of the 3D kernel.
Similar with (3)-(4), we generate the sampling grid by predicting 3D offsets . Furthermore, to sample pixels across the video frames, we introduce the trilinear interpolation by which could be computed as:
where only the pixels closest to in the 3D space of
contribute to the interpolated result. Since the trilinear sampling mechanism is differentiable, we could learn the deformable 3D kernels with backpropagation in an end-to-end manner. The derivatives of this sampler are shown in AppendixA.
3.2 Network Architecture
The offset network in Figure 3 takes a single image as input for image denoising, and a sequence of neighboring frames for video denoising. As shown in Figure 4(b), we adopt a U-Net architecture  which has been widely used in pixel-wise estimation tasks [4, 30]. The U-Net is an encoder-decoder network where the encoder sequentially transforms the input frames into lower-resolution feature embeddings, and the decoder correspondingly expands the features back to full resolution estimates. We perform pixel-wise summation with skip connections between congruent layers in the encoder and decoder to jointly use low-level and high-level features for the estimation. Furthermore, we concatenate the sampled pixels, the noisy input and the features from the last layer of the offset network, and then feed them to three convolution layers to estimate the kernel weights (Figure 3). All convolution layers use
kernels with stride. The feature map number for each layer of our network is shown in Table 1
. We use ReLU as the activation function for the convolution layers except for the last one which is followed by Tanh to output normalized offsets. As the proposed estimation network is fully convolutional, it is able to handle arbitrary spatial size during inference.
3.3 Loss Function
With the predicted result and ground truth image in linear space, we could simply use a loss to train our network for single image denoising:
where gamma correction is performed to emphasize errors in darker regions and generate more perceptually pleasant results.
Regularization term for video denoising. Since the deformable 3D kernel samples pixels across the video frames, it is possible that the network gets stuck in the local minimum during training where all the sample locations lie around the reference frame. To avoid this problem and encourage the network to exploit more temporal information, we introduce a regularization term to have subsets of the sampled pixels individually learn the 3D filtering.
Specifically, we split the sampling locations in the 3D grid into groups: , and each group consists of points. Similar with (6), the filtering result of the -th pixel group could be calculated as:
where , and the multiplier is used to match the scale of . With
for regularization, we set our final loss function for video denoising as:
The regularization of each
is slowly reduced during training, where the hyperparametersand are used to control the annealing process similar with . is the iteration number. At the beginning of the network optimization, and the second term is prominent, which encourages the network to find the most informative pixels for each subset of the 3D kernel. And this constraint disappears as gets larger, and the whole sampling grid learns to rearrange the sampling locations so that all the filtering groups, the different parts of the learned 3D kernel, could work collaboratively.
|Reference frame||26.75 / 0.6891||28.08 / 0.7333||27.37 / 0.5842||27.96 / 0.7064||27.54 / 0.6782||22.83 / 0.5403||23.94 / 0.5730||23.00 / 0.3746||23.97 / 0.5598||23.43 / 0.5119|
|NLM ||31.04 / 0.8838||31.51 / 0.9025||33.35 / 0.8687||31.71 / 0.8663||31.90 / 0.8803||28.21 / 0.8236||28.57 / 0.8443||30.62 / 0.8076||28.73 / 0.8040||29.03 / 0.8199|
|BM3D ||33.00 / 0.9196||32.63 / 0.9245||35.16 / 0.9172||33.09 / 0.9028||33.47 / 0.9160||29.96 / 0.8793||29.81 / 0.8836||32.30 / 0.8766||30.27 / 0.8609||30.59 / 0.8751|
|DnCNN ||35.30 / 0.9499||34.54 / 0.9498||37.45 / 0.9436||36.22 / 0.9494||35.88 / 0.9482||32.30 / 0.9163||31.54 / 0.9124||34.55 / 0.9048||33.26 / 0.9148||32.91 / 0.9121|
|KPN ,||35.23 / 0.9526||34.38 / 0.9493||37.50 / 0.9451||36.18 / 0.9526||35.82 / 0.9499||32.32 / 0.9198||31.44 / 0.9120||34.74 / 0.9085||33.28 / 0.9200||32.94 / 0.9151|
|KPN ,||35.23 / 0.9534||34.38 / 0.9500||37.54 / 0.9460||36.16 / 0.9536||35.83 / 0.9508||32.36 / 0.9222||31.46 / 0.9136||34.80 / 0.9110||33.30 / 0.9220||32.98 / 0.9172|
|Ours-2D||35.40 / 0.9535||34.57 / 0.9507||37.64 / 0.9465||36.41 / 0.9538||36.01 / 0.9511||32.49 / 0.9226||31.62 / 0.9153||34.89 / 0.9121||33.51 / 0.9232||33.13 / 0.9183|
|KPN , , blind||35.18 / 0.9492||34.20 / 0.9484||37.39 / 0.9438||36.05 / 0.9508||35.71 / 0.9480||32.23 / 0.9182||31.37 / 0.9107||34.63 / 0.9073||33.17 / 0.9183||32.85 / 0.9136|
|DnCNN, blind||35.19 / 0.9500||34.38 / 0.9479||37.28 / 0.9417||36.06 / 0.9491||35.73 / 0.9472||32.19 / 0.9158||31.42 / 0.9105||34.40 / 0.9023||33.08 / 0.9135||32.77 / 0.9105|
|Ours-2D, blind||35.33 / 0.9531||34.55 / 0.9508||37.57 / 0.9458||36.35 / 0.9538||35.95 / 0.9509||32.44 / 0.9224||31.62 / 0.9152||34.81 / 0.9109||33.46 / 0.9215||33.08 / 0.9175|
|Direct average||22.75 / 0.6880||25.70 / 0.7777||25.15 / 0.6701||23.47 / 0.6842||25.27 / 0.7050||21.96 / 0.6071||24.78 / 0.6934||24.34 / 0.5466||22.81 / 0.6055||23.47 / 0.6132|
|VBM4D||33.26 / 0.9326||34.00 / 0.9469||35.83 / 0.9347||34.01 / 0.9327||34.27 / 0.9367||30.34 / 0.8894||31.28 / 0.9089||32.66 / 0.8881||31.33 / 0.8925||31.40 / 0.8947|
|KPN ,||35.61 / 0.9597||35.25 / 0.9637||38.18 / 0.9529||36.45 / 0.9604||36.37 / 0.9592||32.92 / 0.9344||32.56 / 0.9358||35.59 / 0.9223||33.80 / 0.9355||33.72 / 0.9320|
|KPN ,||35.64 / 0.9603||35.23 / 0.9646||38.30 / 0.9542||36.49 / 0.9623||36.41 / 0.9604||32.95 / 0.9336||32.61 / 0.9377||35.72 / 0.9246||33.88 / 0.9374||33.79 / 0.9333|
|Ours-2D,||35.66 / 0.9576||35.82 / 0.9656||38.19 / 0.9518||36.80 / 0.9609||36.62 / 0.9590||32.94 / 0.9309||33.09 / 0.9380||35.59 / 0.9208||34.15 / 0.9365||33.94 / 0.9315|
|Ours-3D,||36.02 / 0.9618||35.80 / 0.9666||38.78 / 0.9580||37.04 / 0.9624||36.91 / 0.9622||33.29 / 0.9372||33.05 / 0.9400||36.17 / 0.9301||34.40 / 0.9390||34.23 / 0.9366|
|KPN , , blind||35.44 / 0.9577||35.03 / 0.9605||38.03 / 0.9506||36.30 / 0.9586||36.20 / 0.9569||32.73 / 0.9302||32.36 / 0.9312||35.39 / 0.9185||33.61 / 0.9309||33.52 / 0.9277|
|Ours-3D, , blind||35.70 / 0.9590||35.47 / 0.9633||38.35 / 0.9538||36.67 / 0.9615||36.55 / 0.9594||33.02 / 0.9327||32.79 / 0.9348||35.78 / 0.9239||34.09 / 0.9361||33.92 / 0.9319|
4 Experimental Results
In this section, we evaluate the proposed method both quantitatively and qualitatively. The source code, data, and the trained models will be made available to the public.
Datasets. For video denoising, we collect high-quality long videos from Internet, where each has a resolution of or and a frame rate of , , or fps. We use long videos for training and the other for testing, which are splitted into and non-overlapped scenes, respectively. With the videos of different scenes, we extract K sequences for training where each sequence consists of consecutive frames. Our test dataset is composed of subsets where each has approximately sequences sampled from the testing videos. The sequences used for testing are not overlapped. In addition, we simply use the middle frame of each sequence from the video datasets for both training and testing in single image denoising.
Similar with , we generate the noisy input for our models by first performing inverse gamma correction and then adding signal-dependent Gaussian noise: , where represents the intensity of the pixel, and the noise parameters and are randomly sampled from and , respectively. In our experiments, we train the networks in both blind and non-blind manners. For the non-blind version, the parameters and are assumed to be known, and the noise level is fed into the network as an additional channel of the input. Similar with , we estimate the noise level as: , where represents the intensity value of the reference frame in video denoising or the input image in single frame denoising.
Training and parameter settings. We learn deformable kernels for single image denoising. For video input, we use for the deformable 3D kernels to save GPU memory. We set and as and , respectively. During training, we use the Adam optimizer  with the initial learning rate as . We decrease the learning rate by a factor of
per epoch, until it reaches. We use a batch size of . We randomly crop patches from the original input for training the single image model. In video denoising, we crop at the same place of all the input frames and set , so that each training sample has a size of . We train the denoising networks for iterations which roughly takes 50 hours.
Comparisons on the synthetic dataset. We compare the proposed algorithm with the state-of-the-art image and video denoising methods [22, 20, 6, 3, 31] on the synthetic dataset at different noise levels. We conduct exhaustive hyper-parameter finetuning for NLM , BM3D  and VBM4D  including both blind and non-blind versions, and choose the best performing results. We also train KPN  and DnCNN  on our datasets with the same settings. While  is originally designed for multi-frame input, we also adapt it to single image for more comprehensive evaluation by changing the network input.
As shown in Table 2 and 3, the proposed algorithm achieves consistently better results on both single image and video denoising in terms of both PSNR and structural similarity (SSIM), compared with the state-of-art methods in all the subsets with different noise levels. And even our blind version model is able to achieve competitive performance, while other methods rely on the oracle noise parameters for good results. In addition, KPN  uses rigid kernels for video denoising, and thus cannot effectively learn the pixel selecting process. And due to the inappropriate sampling locations, simply enlarging the kernel size of KPN does not lead to significant improvement as shown in Table 2-3.
Figure 5 shows examples for image and video denoising on the synthetic dataset. Traditional methods [6, 3, 20] with hand-crafted sampling and weighting strategies do not perform well and generates severe artifacts. In particular, VBM4D  selects pixels using norm to measure patch similarities, which tends to generate oversmoothing results, as shown in Figure 5(j). On the other hand, directly synthesizing the results with deep CNNs  could lead to corrupted structures and often lose details (Figure 5(e)). Moreover, KPN  learns rigid kernels for video denoising, which cannot deal with misalignments larger than pixels due to the limitation of the rigid sampling. When the misalignment is beyond this limit, the learned weights will either degenerate into a single image filter, which leads to oversmoothing result (Figure 5(k)), or get trapped into disorientation, which could cause ghosting artifacts around high-contrast boundaries, as shown in Figure 6. By contrast, we learn the classical denoising process in a data-driven manner and achieves clearer results with less artifacts (Figure 5(f), (l) and Figure 6(g)).
Temporal consistency. To better evaluate the results on videos, we show temporal comparisons in Figure 7 with the temporal profiles of the 1D sample highlighted by a dashed red line through 60 consecutive frames, which demonstrates better temporal consistency of the proposed method.
5 Discussion and Analysis
Ablation study. In this section, we conduct an ablation analysis on different components of our algorithm for better evaluation. We show the PSNR and SSIM for six variants of our model in Table 4, where “our full model 3” is our default setting. On the first row, the model ”direct” uses the offset network in Figure 4 to directly synthesize the denoised output and cannot produce high-quality results. To learn the weighting strategies of the classical models, we use dynamic weights for the proposed deformable kernels. As shown in the second row of Table 4, learning the model without dynamic weights significantly degrades the denoising performance. On the third and fourth rows, we learn rigid 3D kernels of different kernel sizes. The result shows that learning the pixel sampling strategies is important for learning the classical denoising process and improving denoising performance. In addition, the fifth row shows that the annealing term is also beneficial for model training, and eventually combining all components gives the best results. Note that the our kernel with size could sample pixels from a large receptive field (typically 4-15 pixels in our experiment), and further increasing the kernel size of the deformable 3D kernel only marginally improves the performance. Thus, we choose smaller kernel size as our default setting to ease computational load.
|w/o dynamic weights||35.50||0.9449||32.60||0.9058|
|w/o annealing term||36.16||0.9601||33.48||0.9341|
|our full model||36.91||0.9622||34.23||0.9366|
|our full model||36.88||0.9631||34.25||0.9379|
Effectiveness of the deformable 3D kernels. As illustrated in Figure 2, the proposed deformable 3D kernel samples pixels across the spatial-temporal space, and thus better handles large motion videos. To further testify the effectiveness of the 3D kernels on large motion, we compare the performance of our deformable 2D and 3D kernels under different motion levels. Specifically, we sample fps video clips with large motion from the Adobe240 dataset . Then we temporally downsample the high frame rate videos and get test subsets of different frame rates: , , , , , , and fps, where each contains 180 input sequences. Note that different frame rates correspond to different motion levels, and all the subsets use the same reference frames. We use one deformable 3D kernel with size for each pixel in the output while apply multiple 2D kernels to process each frame individually. As shown in Figure 8, the performance gap between the 2D and 3D kernels gets larger as the frame rate decreases, which demonstrates the superiority of the spatial-temporal sampling on large motion. We also notice that both kernels achieves better results (smaller MSE) on larger frame rates, which shows the importance of exploiting temporal information in video denoising.
Effectiveness of the regularization term in (11). Figure 9 shows the distributions of the sampling locations on the time dimension in the test dataset. Directly optimizing the loss without the annealing term often leads to undesirable local minima where most the sampling locations are around the reference frame in video denoising as shown in Figure 9(b). By adding the regularization term, the network is forced to search more informative pixels across a larger temporal range, which helps avoid the local minima (Figure 9(a)).
Visualization of the learned kernels. We show an examples in Figure 10 to visualize the learned deformable kernels. Compared to the fixed kernels (blue points in Figure 10(a)), our deformable kernel learns to sample pixels along the black edge across different frames, and thereby effectively enlarges the receptive field as well as reduces the interference of inappropriate pixels. And the ability to sample both spatially and temporally is crucial for our method to recover clean structures and details.
Generalization to real images. We compare our method with state-of-the-art denoising approaches [6, 31, 22, 20] on real images and video sequences in Figure 1 and 11 captured by cellphones. While trained on synthetic data, our model is able to recover subtle edges from the real-captured noisy input and well handles misalignment from large motions.
In this work, we propose to explicitly learn the classical selecting and averaging process for image and video denoising with deep neural networks. The proposed method is able to adaptively select pixels from the 2D or 3D input, which well handles misalignment from dynamic scenes and enables large receptive field while preserving details. In addition, we introduce new techniques for better training the proposed model. We can achieve a running speed of 0.27 megapixels per second on a GTX 1060 GPU, while the speed is 0.3 for KPN . We will explore more efficient network structures for estimating the denoising kernels in future work.
Appendix A Derivatives of the Deformable Convolution
As introduced in Figure 3, the result of the deformable convolution depends on the learned kernel weights and sampling offsets . For training the proposed neural network, we need the derivatives with respect to both and which could be derived as follows:
Appendix B Results on Color Videos
Since KPN  only considers grayscale videos, we also only use gray input in our paper for fair comparisons. To further evaluate the proposed deformable kernel on color videos, we respectively process the R, G, B channels with our network and provide a color example in Figure 12 for comparison. Note that our result has less color artifacts around edges.
-  M. Anderson, R. Motta, S. Chandrasekar, and M. Stokes. Proposal for a standard default color space for the internet—srgb. In Color and imaging conference, 1996.
-  T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron. Unprocessing images for learned raw denoising. arXiv preprint arXiv:1811.11127, 2018.
-  A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In CVPR, 2005.
-  C. Chen, Q. Chen, J. Xu, and V. Koltun. Learning to see in the dark. In CVPR, 2018.
-  X. Chen, L. Song, and X. Yang. Deep rnns for video denoising. In Applications of Digital Image Processing XXXIX, 2016.
-  K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. TIP, 16:2080–2095, 2007.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. CoRR, abs/1703.06211, 1:3, 2017.
-  C. Godard, K. Matzen, and M. Uyttendaele. Deep burst denoising. arXiv preprint arXiv:1712.05790, 2017.
-  R. C. Gonzalez and R. E. Woods. Digital image processing. Prentice hall New Jersey, 2002.
-  G. E. Healey and R. Kondepudy. Radiometric ccd camera calibration and noise estimation. TIP, 16:267–276, 1994.
T. Hyun Kim, M. S. Sajjadi, M. Hirsch, and B. Scholkopf.
Spatio-temporal transformer network for video restoration.In ECCV, 2018.
-  M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
-  X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In NIPS, 2016.
-  H. Jiang, D. Sun, V. Jampani, M. Yang, E. G. Learned-Miller, and J. Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. CoRR, abs/1712.00080, 2017.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  D. Kostadin, F. Alessandro, and E. KAREN. Video denoising by sparse 3d transform-domain collaborative filtering. In European signal processing conference, 2007.
-  C. Liu and W. T. Freeman. A high-quality video denoising algorithm based on reliable motion estimation. In ECCV, 2010.
-  Z. Liu, L. Yuan, X. Tang, M. Uyttendaele, and J. Sun. Fast burst images denoising. ACM Transactions on Graphics (TOG), 33:232, 2014.
-  M. Lundy and A. Mees. Convergence of an annealing algorithm. Mathematical programming, 34:111–124, 1986.
-  M. Maggioni, G. Boracchi, A. Foi, and K. Egiazarian. Video denoising, deblocking, and enhancement through separable 4-d nonlocal spatiotemporal transforms. TIP, 21:3952–3966, 2012.
-  A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In CVPR, 2015.
-  B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Carroll. Burst denoising with kernel prediction networks. In CVPR, 2018.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
-  S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolution. In CVPR, 2017.
-  T. Plotz and S. Roth. Benchmarking denoising algorithms with real photographs. In CVPR, 2017.
-  T. Remez, O. Litany, R. Giryes, and A. M. Bronstein. Deep class-aware image denoising. In Sampling Theory and Applications (SampTA), 2017 International Conference on, 2017.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 2015.
-  S. Su, G. Sapiro, M. Delbracio, J. Wang, W. Heidrich, and O. Wang. Deep video deblurring. In Color and imaging conference, 1996.
-  C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In ICCV, 1998.
-  X. Xu, D. Sun, S. Liu, W. Ren, Y.-J. Zhang, M.-H. Yang, and J. Sun. Rendering portraitures from monocular camera and beyond. In ECCV, 2018.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. TIP, 26:3142–3155, 2017.