Deep Parametric 3D Filters for Joint Video Denoising and Illumination Enhancement in Video Super Resolution

07/05/2022
by   Xiaogang Xu, et al.
The Chinese University of Hong Kong
0

Despite the quality improvement brought by the recent methods, video super-resolution (SR) is still very challenging, especially for videos that are low-light and noisy. The current best solution is to subsequently employ best models of video SR, denoising, and illumination enhancement, but doing so often lowers the image quality, due to the inconsistency between the models. This paper presents a new parametric representation called the Deep Parametric 3D Filters (DP3DF), which incorporates local spatiotemporal information to enable simultaneous denoising, illumination enhancement, and SR efficiently in a single encoder-and-decoder network. Also, a dynamic residual frame is jointly learned with the DP3DF via a shared backbone to further boost the SR quality. We performed extensive experiments, including a large-scale user study, to show our method's effectiveness. Our method consistently surpasses the best state-of-the-art methods on all the challenging real datasets with top PSNR and user ratings, yet having a very fast run time.

READ FULL TEXT VIEW PDF

page 1

page 3

page 4

page 5

page 6

page 7

03/12/2020

W2S: A Joint Denoising and Super-Resolution Dataset

Denoising and super-resolution (SR) are fundamental tasks in imaging. Th...
05/07/2019

Trinity of Pixel Enhancement: a Joint Solution for Demosaicking, Denoising and Super-Resolution

Demosaicing, denoising and super-resolution (SR) are of practical import...
05/02/2019

Joint High Dynamic Range Imaging and Super-Resolution from a Single Image

This paper presents a new framework for jointly enhancing the resolution...
04/25/2019

Deep SR-ITM: Joint Learning of Super-resolution and Inverse Tone-Mapping for 4K UHD HDR Applications

Recent modern displays are now able to render high dynamic range (HDR), ...
05/06/2021

Real-Time Video Super-Resolution by Joint Local Inference and Global Parameter Estimation

The state of the art in video super-resolution (SR) are techniques based...
11/24/2020

FireSRnet: Geoscience-Driven Super-Resolution of Future Fire Risk from Climate Change

With fires becoming increasingly frequent and severe across the globe in...
10/12/2020

Neural Enhancement in Content Delivery Systems: The State-of-the-Art and Future Directions

Internet-enabled smartphones and ultra-wide displays are transforming a ...

Introduction

The goal of video super resolution (SR) is to produce high-resolution videos from low-resolution video inputs. While promising results are demonstrated on general videos, existing approaches typically do not work well on videos that are low-light and noisy. Yet, such a setting is very common in practice, e.g., applying SR to enhance noisy videos taken in a dark and high-contrast environment.

Fundamentally, video denoising and video illumination enhancement are very different tasks from video SR: the former deals with noise and brightness in videos whereas the latter deals with the video resolution. Hence, to map a low-resolution, low-light, and noisy (LLN) video to a high-resolution, normal-light, and noise-free (HNN) video, the current best solution is to collectively use the best network model of each task by cascading models in a certain order.

However, doing so has several drawbacks. First, the network complexity is threefold, resulting in a slow inference, as we have to subsequently run three separate network models for denoising, illumination enhancement, and SR. Also, as the three networks are trained separately, we cannot ensure their consistency, e.g., artifacts from a preceding denoising or illumination enhancement network could be amplified by the subsequent SR network; see Fig. 1 (c). Alternatively, one may try to cascade and train all networks end-to-end. Yet, existing networks for video SR, denoising, and illumination enhancement often take multiple frames as input and output only a single frame, so a subsequent network cannot obtain sufficient inputs from the preceding one. Also, these networks are rather complex, so it is hard to fine-tune them together for high performance.

Figure 1: An example frame (a) from a challenging underexposed video enhanced by (b) a SOTA SR method Xiang et al. (2020); (c) SOTA methods in video denoising Tassano et al. (2020) video illumination enhancement Zhang et al. (2021) video SR Xiang et al. (2020); and (d) our approach. From the blown-up views, we can see that (d) is more sharp with distinct contrast, less noise, and less aliasing vs. (b) & (c). Please zoom to view the details.

Another approach is to use parallel branches of different purposes in a framework. However, as the branches are separated from one another, their connections are weak for joint learning. Also, the input to all branches should be identical, while the sizes of their outputs are inconsistent: the output size of the SR branch is larger than its input size, while the other branches have same input and output sizes. Hence, how to achieve various purposes with one common branch and representation is worth to be considered. Further, we eventually will need to infer the different branches to produce the final results, which is time costly.

In this paper, we present a new solution to map LLN videos to HNN videos within a single end-to-end network. The core of our solution is the Deep Parametric 3D Filter (DP3DF), a novel dynamic-filter representation we formulated collectively for video SR, illumination enhancement, and denoising. This is the first work that we are aware of in exploring an efficient architecture for simultaneous video SR, denoising, and illumination enhancement. Beyond the existing works with dynamic filters, our DP3DF considers the burst from adjacent frames. Hence, DP3DF can effectively exploit local spatiotemporal neighboring information and complete the mapping from LLN video to HNN video in a single encoder-and-decoder network. Also, we show that general dynamic filters in existing works are just special cases of our DP3DF; see Section 4. Further, we set up an additional branch for learning dynamic residual frames on top of the core encoder-and-decoder network, so we can share the backbone for learning the DP3DF and residual frames to promote the overall performance.

To demonstrate the quality of our method, we conducted comprehensive experiments to compare our method with a rich set of state-of-the-art methods on two public video datasets SMID Chen et al. (2019) and SDSD Wang et al. (2021), which provide static and dynamic low- and normal-light video pairs. Through various quantitative and qualitative evaluations, including a large-scale user study with 80 participants, we show the effectiveness of our DP3DF framework over SOTA SR methods and also different combinations of SOTA video methods on illumination enhancement, denoising, and SR, both quantitatively and qualitatively. Our DP3DF framework surpasses the SOTA methods with top PSNR and user ratings consistently. In summary, our contributions are threefold:

  • This is the first exploration of directly mapping LLN to HNN videos within a single-stage end-to-end network.

  • This is the first work we are aware of that simultaneously achieves video SR, denoising, and illumination enhancement via our DP3DF representation.

  • Extensive experiments are conducted on two real-world video datasets,demonstrating our superior performance.

Related work

Video SR.  Video SR aims to reconstruct a high-resolution frame from a low-resolution frame together with the associated adjacent frames. The key problem is on how to align the adjacent frames temporally with the center one. Several video SR methods Caballero et al. (2017); Tao et al. (2017); Sajjadi et al. (2018); Wang et al. (2018); Xue et al. (2019)

use optical flow for an explicit temporal alignment. However, it is hard to obtain accurate flow and the flow warping may introduce artifacts in the aligned frames. To leverage the temporal information, recurrent neural networks are adopted in some video SR methods 

Huang et al. (2017); Lim and Lee (2017), e.g., the convolutional LSTMs Shi et al. (2015). However, without an explicit temporal alignment, these RNN-based networks have limited capability in handling complex motions. Later, dynamic filters and deformable convolutions are exploited for temporal alignment. DUF Jo et al. (2018)

utilizes a dynamic filter to implement simple temporal alignment without motion estimation, whereas TDAN 

Tian et al. (2020) and EDVR Wang et al. (2019c) employ the deformable alignment in single- or multi-scale feature levels. Beyond the existing methods, we propose a new strategy that learns to construct dynamic filters, explicitly incorporating multi-frame information while avoiding the feature alignment and optical-flow computation.

Video denoising.  Early approaches are mostly patch-based, e.g., V-BM4D Maggioni et al. (2012) and VNLB Arias and Morel (2018), which extend from BM3D Dabov et al. (2007)

. Later, deep neural networks are explored for the task. Chen et al. 

Chen et al. (2016) propose the first attempt to video denoising based on RNN. Vogels et al. Vogels et al. (2018) design a kernel-predicting neural network for denoising Monte-Carlo-rendered sequences. Tassano et al. Tassano et al. (2019) propose DVDnet by separating the denoising of a frame into two stages. More recently, Tassano et al. Tassano et al. (2020) propose FastDVDnet to eliminate the dependence on motion estimation. Besides, some recent works focus on blind video denoising, e.g., Ehret et al. (2019) and Michele and Jan (2019).

Video illumination enhancement.  Learning-based low-light image enhancement gains increasing attention recently Yan et al. (2014, 2016); Lore et al. (2017); Cai et al. (2018); Wang et al. (2019a); Moran et al. (2020); Guo et al. (2020). Wang et al. Wang et al. (2019a) enhance photos by learning to estimate an illumination map. Sean et al. Moran et al. (2020)

learn spatial filters of various types for image enhancement. Also, unsupervised learning has been explored, e.g., Guo et al. 

Guo et al. (2020) train a lightweight network to estimate pixel-wise and high-order curves for dynamic range adjustment. Yet, applying low-light image enhancement methods independently to individual frames will likely cause flickering, thus leading to research on methods for low-light videos, e.g., Zhang et al. (2016); Lv et al. (2018); Jiang and Zheng (2019); Xue et al. (2019); Wang et al. (2019b); Chen et al. (2019). Zhang et al. Zhang et al. (2016) adopt a perception-driven progressive fusion. Lv et al. Lv et al. (2018) design a multi-branch network to extract multi-level features for stable enhancement.

Method

Figure 2:

Overview of our framework. The encoder branch (grey area) extracts deep features from network input

and the decoder branch (blue area) produces the output for learning the DP3DF and residual image (the branch in the red area). Further, we learn the DF3DF (the branch in the yellow area) for synthesizing the intermediate HNN frame . Finally, we refine using to produce the final output . The branch for learning the DP3DF and residual image share the same encoder-and-decoder backbone. Also, our DP3DF explicitly exploits adjacent temporal-spatial information around each pixel; see Fig. 3 for the details of how to apply DP3DF to a video.

Architecture

To start, let us denote as the input LLN frames and as the synthesized HNN frames, where is the time index. Usually, we train the network with downsampled from the ground-truth frames and we denote as the downsampling rate. To obtain realistic and temporally-smooth videos, we consider frames before and frames after time for estimating the target frame :

(1)

Thus, the shape of the network input is , where and , , are the height, width, channel size of the input video. Then, the shape of the output SR frame shall be . Fig. 2 illustrates the network input, synthesized frame, and various components in our framework. Overall, our framework first synthesizes an intermediate HNN frame , then constructs residual image to refine to generate the final output .

Figure 3: Illustrating how we apply the learned DP3DF to process an input video clip. For each pixel , we locate its 3D patch (green arrow) and its associated DP3DF kernel (red arrow), and then make use of the kernel components to process the 3D patch to produce output pixels (yellow and blue arrows).

Rather than directly producing the pixels of (which is HNN), we propose to first learn a new parametric representation called DP3DF. To complete the mapping from to , DP3DF has the shape of , where , , and are the dimensions (height, width, and time, respectively) of a 3D volume covered by DP3DF at each pixel in the network input and the “+1” is an additional component for illumination enhancement. Each pixel has DP3DF kernels, each of size ; see Fig. 3. For the enhancement of , where denotes a pixel location, we sample a volume of pixels around in and then use the learned kernels to produce pixels for the original pixel at . Besides, we normalize the elements in each kernel to be a sum of one for promoting smoothness in the results and suppressing the noise. Further, for the kernel prediction at each pixel, the additional “one” dimension is predicted for illumination adjustment in dark areas. The details of the kernels will be discussed in the next subsections.

Dp3df

Formulation.  Our network predicts DP3DF from and the filter learning branch output; see Fig. 2. Specifically, the DP3DF kernel associates with pixel in . Each can be decomposed into kernels. Each kernel has shape and can be decomposed into two parts: (weights for SR and denoising) and (weight for luminance adjustment), where . Suppose , , , , the upsampled pixels in can be predicted as

(2)

where and , which together iterate over the kernels in , and denotes . Especially, the elements in are normalized through Softmax, summing to one, whereas the elements in

are processed with the activation function of Sigmoid and we take reciprocals. The convolution with

gives an effect of spatial-temporal smoothing and helps achieve denoising. On the other hand, the multiplication with adjusts the illumination and enhances the dark areas in the input frame. Also, the resulting pixels produce a high-resolution frame from the low-resolution one. Thus, we generate the DP3DF locally and dynamically, considering the spatiotemporal neighborhood of each pixel.

Our model has several significant advantages. First, different operations can be formulated within a single coherent representation and can be learned jointly to enhance the performance. Second, such a representation allows us to propagate information, both spatially and temporally.

Implementation.  To learn the DP3DF, we adopt a network of an encoder-and-decoder structure. As shown in Fig. 2, the encoder has two downsampling layers, each with several residual blocks He et al. (2016). These residual blocks can extract relevant features in each layer and use an instance normalization to reduce the gap between different types of videos. Then, we pass the features from the encoder through several residual blocks to produce the input feature of the decoder. Subsequently, the decoder adopts a pixel shuffle Shi et al. (2016) for upsampling.

Figure 4: Illustrating dynamic filters implemented for different tasks. Our DP3DF (d) is a generalized filter that combines the properties of the dynamic filters for various tasks (a)-(c). Our DP3DF simultaneously achieves SR, denoising, and illumination enhancement, and is able to explicitly incorporate adjacent frame information, unlike the existing dynamic filter methods.

Connection with existing dynamic filters.   As illustrated in Fig. 4, we design our DP3DF to be a generalized filter that combines the properties of dynamic filters for different tasks. The dynamic filters in video SR, denoising, and illumination enhancement are special cases of DP3DF. Especially, the dynamic filters Jo et al. (2018) in SR predict kernels per pixel with a kernel size of ; existing denoising methods Mildenhall et al. (2018); Xia et al. (2020, 2021) predict one kernel of size per pixel to smooth the neighborhood area, whereas existing illumination enhancement methods, e.g., Wang et al. (2019a), adopt the Retinex theory to predict one value per pixel. If we reduce the 3D kernel shape () to 2D () and remove the dimension for illumination enhancement, we trim down the DP3DF into a dynamic filter for regular SR tasks. If we reduce the number of predicted kernels from to one and remove the dimension for illumination enhancement, we turn the DP3DF into a filter for denoising. Lastly, if we reduce the number of predicted kernels from to one and remove the 3D kernel, we trim down the DP3DF into a filter for regular illumination enhancement.

Residual Learning

To further enhance the performance, we adopt a residual learning branch (see the red area in Fig. 2) to learn a residual image for enriching the final output with high-frequency details. Importantly, the residual image is produced from multiple input frames rather than a single input frame, so sharing the same encoder-decoder structure with the main branch for predicting the DP3DF allows us to reduce the computational overhead. Finally, we combine the intermediate HNN frame with the learned residual image to produce final output frame .

Loss Function

The overall loss has the following three parts.

(i) Reconstructing .  First, we define an loss term for obtaining an accurate prediction of with the DP3DF:

(3)

where is the norm, and all pixel channels in ground truth and are normalized to [0, 1]. Such clip operation is effective for the training of illumination enhancement, eliminating invalid colors that are beyond the gamut and avoiding mistakenly darkening the underexposed regions.

(ii) Residual learning branch.  Like , we define another reconstruction loss for the residual learning branch to generate the final output from and :

(4)

(iii) Smoothness loss.  Many works employ the smoothness prior for illumination enhancement, e.g., Li and Brown (2014); Wang et al. (2019a), by assuming the illumination is locally smooth. Harnessing this prior in our framework has two advantages. It helps to not only reduce overfitting and improve the network’s generalizability but also enhance the image contrast. For adjacent pixels, say and , with similar illumination values in a video frame, their contrast in the enhanced frame should be small; and vice versa. So, we define the smoothness loss on the predicted as

(5)

where and are partial derivatives in horizontal and vertical directions, respectively, for the predicted ; and are spatially-varying (per-channel) smoothness weights expressed as

(6)

where is the logarithmic image of ; and is a small constant (set to 0.0001) to prevent division by zero.

Overall loss.  The overall loss is

(7)

where , and are the loss weights. We will release code and trained models upon the publication of this work.

Experiments

Datasets

We perform our evaluation on two public datasets with indoor and outdoor real-world videos: SMID Chen et al. (2019) and SDSD Wang et al. (2021)

. The videos in SMID are captured as static videos, in which the ground truths are obtained with a long exposure and the signal-to-noise ratio of the videos under the dark environment is extremely low. In this work, we explore the mapping from LLN to HNN frames in the sRGB domain. Thus, we follow the script provided by SMID 

Chen et al. (2019) to convert the low-light videos from the RAW domain to the sRGB domain using rawpy’s default ISP. On the other hand, SDSD is a dynamic video dataset collected through an electromechanical equipment, containing indoor and outdoor subsets. Also, we follow the official train-test split of SMID and SDSD.

Implementation

We empirically set and number of frames . Experiments on all datasets were conducted on the same network structure, whose backbone is an encoder-and-decoder structure; see Fig. 2. The encoder has three down-sampling layers with 64, 128, 256 channels, while the decoder has three up-sampling layers with 256, 128, 64 channels. The branches for learning the DP3DF and residual have three and one convolution layers, respectively. The shape of the DP3DF kernel is , in which the first dimensions contain parts for denoising and the remaining dimensions contain parts for illumination enhancement.

We train all modules end-to-end with the learning rate initialized as 4e-4 for all layers (adapted by the cosine learning scheduler); scale factor ; batch size = 16; and patch size = . The patches are cropped randomly from the down-sampled low-resolution frame. We use Kaiming Initialization He et al. (2015) to initialize the weights and Adam Kingma and Ba (2014) for training with momentum set to 0.9. Besides, we perform data augmentation by random rotation of //

and horizontal flip. We implement our method using Python 3.7.7 and PyTorch 1.2.0 

Paszke et al. (2019), and ran all experiments on one NVidia TITAN XP GPU. PSNR and SSIM Wang et al. (2004) are adopted for quantitative evaluation.

SMID SDSD Indoor SDSD Outdoor
Methods PSNR SSIM PSNR SSIM PSNR SSIM
Ours w/o Temporal 23.67 0.69 25.49 0.83 24.98 0.75
Ours w/o Spatial 22.84 0.63 24.87 0.78 24.02 0.71
Ours w/o Residual 25.44 0.71 27.01 0.83 25.69 0.76
Ours 25.73 0.73 27.11 0.85 25.80 0.77
Table 1: The quantitative evaluation in the ablation study.
(a) (a) Input

(PSNR: 11.06, SSIM: 0.48)

(b) (b) Ours w/o Temporal

(PSNR: 23.91, SSIM: 0.80)

(c) (c) Ours w/o Spatial

(PSNR: 23.57, SSIM: 0.79)

(d) (d) Ours

(PSNR: 26.17, SSIM: 0.83)

Figure 5: Example visual samples in the ablation study.

Ablation Study

We evaluate the major components in DP3DF on three ablated cases: (i) “w/o Temporal” removes the property of the 3D filters by ignoring the temporal dimension and filtering only in the spatial dimensions; (ii) “w/o Spatial” removes the spatial dimensions in DP3DF and applies filters only in the temporal dimension; and (iii) “w/o Residual” removes the branch of residual learning.

Table 1 summarizes the results, showing that all ablated cases are weaker than our full method. Especially, “w/o Temporal” does not have the ability to incorporate information from the adjacent time frames and “w/o Spatial” cannot obtain information from the adjacent pixels, thereby both having weaker performance. These two cases show the necessity of considering both the temporal and spatial dimensions in our 3D filter. Though “w/o residual” leverages multiple frames as DP3DF, our full model still consistently achieves better results. Further, Fig. 5 shows some visual samples, revealing the apparent degradation caused by removing different components in our 3D filter. Fine details and textures are better reconstructed using our full model.

(a) Input PSNR: 9.58, SSIM: 0.50
(b) RBPN PSNR: 23.29, SSIM: 0.78
(c) Zooming PSNR: 23.52, SSIM: 0.78
(d) TGA PSNR: 22.36, SSIM: 0.75
(e) TDAN PSNR: 22.92, SSIM: 0.75
(f) ToFlow, PSNR: 22.27, SSIM: 0.75
(g) EDVR PSNR: 23.44, SSIM: 0.77
(h) Ours PSNR: 24.54, SSIM: 0.79
Figure 6: Qualitative comparison on SMID. Our result contains sharper details and more vivid colors. Please zoom to view.
SMID SDSD Indoor SDSD Outdoor
Methods PSNR SSIM PSNR SSIM PSNR SSIM
BasicVSR 21.78 0.62 20.72 0.71 20.91 0.70
BasicVSR++ 22.48 0.65 21.02 0.75 21.31 0.72
IconVSR 21.99 0.63 20.94 0.73 20.89 0.71
RBPN 24.87 0.72 23.47 0.80 22.46 0.74
Zooming 24.89 0.71 26.32 0.84 22.05 0.72
TGA 23.40 0.67 23.92 0.76 23.83 0.74
TDAN 24.65 0.70 24.00 0.80 22.57 0.74
PFNL 20.85 0.60 23.19 0.82 23.31 0.72
ToFlow 23.08 0.66 21.82 0.76 22.07 0.71
EDVR 24.50 0.70 25.00 0.83 23.37 0.75
Ours 25.73 0.73 27.11 0.85 25.80 0.77
Table 2: Quantitative comparison with various SOTA SR methods on the SMID and SDSD datasets.
SMID SDSD Indoor SDSD Outdoor
Methods PSNR SSIM PSNR SSIM PSNR SSIM
FastDVDnet+Zooming 25.22 0.71 26.82 0.80 22.93 0.72
FastDVDnet+TGA 23.97 0.70 24.15 0.78 24.11 0.74
FastDVDnet+TDAN 24.95 0.71 24.30 0.76 23.55 0.70
TCE+Zooming 24.34 0.67 25.74 0.77 22.15 0.69
TCE+TGA 23.07 0.66 23.65 0.72 23.48 0.70
TCE+TDAN 24.12 0.68 23.69 0.71 23.01 0.67
FastDVDnet+TCE+Zooming 23.97 0.74 26.54 0.81 24.41 0.73
FastDVDnet+TCE+TGA 24.57 0.76 25.31 0.78 25.01 0.75
FastDVDnet+TCE+TDAN 24.00 0.70 25.89 0.87 23.81 0.71
TCE+FastDVDnet+Zooming 23.73 0.70 26.01 0.79 23.69 0.74
TCE+FastDVDnet+TGA 24.66 0.68 24.70 0.77 24.88 0.72
TCE+FastDVDnet+TDAN 24.21 0.71 24.88 0.81 23.35 0.70
Ours 25.73 0.73 27.11 0.85 25.80 0.77
Table 3: Comparison with baselines that combine SOTA video SR, denoise, and illumination enhancement networks.
Figure 7: Quantitative comparisons between our framework and existing SOTA video SR methods in terms of method run time on input images of 960512

Comparison

(a) Input PSNR: 7.58, SSIM: 0.35
(b) RBPN PSNR: 25.41, SSIM: 0.88
(c) Zooming PSNR: 26.40, SSIM: 0.88
(d) TGA PSNR: 24.98, SSIM: 0.84
(e) TDAN PSNR: 25.02, SSIM: 0.87
(f) ToFlow PSNR: 22.69, SSIM: 0.83
(g) EDVR PSNR: 26.27, SSIM: 0.88
(h) Ours PSNR: 27.55, SSIM: 0.89
Figure 8: Qualitative comparison on the indoor videos in the SDSD dataset. Please zoom to view.

Baselines.  As far as we are aware of, there is no current work designed for directly mapping LNN videos to HNN videos. So, we choose the following two classes of works to compare with. First, we consider a rich collection of SOTA methods for video SR: BasicVSR Chan et al. (2021), IconVSR Chan et al. (2021), BasicVSR++ Chan et al. (2022), RBPN Haris et al. (2019), Zooming Xiang et al. (2020), TGA Isobe et al. (2020), TDAN Tian et al. (2020), PFNL Yi et al. (2019), ToFlow Xue et al. (2019), and EDVR Wang et al. (2019c). We trained them on each dataset with their released code. Second, we collectively use network models for video denoising, illumination enhancement, and SR in a cascaded manner: illumination enhancement+SR, denoising+SR, illumination enhancement+denoise+SR, and denoise+illumination enhancement+SR, where “+” indicates the order of using different networks. Here, we employ FastDVDnet Tassano et al. (2020), a SOTA method for video denoising, and TCE Zhang et al. (2021), a SOTA method for video illumination enhancement, for use with various SOTA video SR methods.

Quantitative analysis.  Table 2 shows the comparison results with the SOTA SR methods. From the table, we can see that our method consistently achieves the highest PSNR and SSIM for all the datasets. Especially, our PSNR values are higher than all others by a large margin. This superiority shows that our method has strong capability of enhancing LLN videos. Also, the right two columns show results on the SDSD indoor and outdoor subsets. These videos contain dynamic scenes, so they are very challenging to handle. Yet, our method is able to obtain high-quality results with top PSNR and SSIM for both subsets.

On the other hand, Table 3 summarizes the comparison results with baselines that collectively combine SOTA video denoising, illumination enhancement, and SR networks. Here, we trained each network (videos SR, illumination enhancement, and denoising) individually on the associated dataset. From Table 3, we can see that our method always produces top PSNR values for all three datasets and our SSIM values stay high compared with others.

Fig. 7 reports the run time of our method vs. the SOTA video SR methods. We ran all methods on Intel 2.6GHz CPU & TITAN XP GPU. From the figure, we can see that our method is efficient with very low running time.

(a) Input PSNR: 11.30, SSIM: 0.65
(b) RBPN PSNR: 24.59, SSIM: 0.83
(c) Zooming PSNR: 24.24, SSIM: 0.82
(d) TGA PSNR: 20.64, SSIM: 0.79
(e) TDAN PSNR: 22.37, SSIM: 0.81
(f) ToFlow PSNR: 24.57, SSIM: 0.83
(g) EDVR PSNR: 24.64, SSIM: 0.83
(h) Ours PSNR: 25.80, SSIM: 0.84
Figure 9: Qualitative comparison on the outdoor videos in the SDSD dataset. Please zoom to view the sample frames.

Qualitative analysis.  Next, we show visual comparisons with other methods. Fig. 6 shows the comparison on SMID. Overall, the results show two main advantages of our method over others. First, the result from our method has high contrast and clear details, as well as natural color constancy and brightness. Therefore, the frame processed by our method is more realistic than those by the others. Second, in regions with complex textures, it can be observed that our outputs have fewer artifacts. So, our result looks cleaner and sharper than those produced by the others. Further, these results demonstrate that our method can simultaneously achieve video SR, noise reduction, and illumination enhancement.

On the other hand, Figs. 8 and 9, respectively, show the visual comparisons on the SDSD indoor and outdoor subsets that feature dynamic scenes. Compared with the results of the baselines, our results are visually more appealing due to the explicit details, vivid colors, rational contrast, and plausible brightness. These results show the limitations of the existing approaches in converting LLN videos to HNN videos, and the superiority of our framework.

User study.  Further, we conducted a large-scale user study with 80 participants (aged 18 to 52; 32 females and 48 males) to compare the perceptual quality of our method against various SOTA video SR approaches. In detail, we randomly selected 36 videos from the test sets of SMID and SDSD, and compared the results of different methods on these videos using an AB test. For each test video, our produced result is “Video A” whereas the result from some other baseline is “Video B.” In the test, each participant had to simultaneously watch videos A and B (we avoid bias by randomizing the left-right presentation order when showing videos A and B in each AB-test task) and choose among three options: “I think Video A is better”, “I think Video B is better”, and “I cannot decide.” Also, we asked the participants to make decisions based on the natural brightness, rich details, distinct contrast, and vivid color of the videos. For each participant, the number of tasks is 10 methods 2 videos , and it took around 30 minutes on average for each participant to complete the user study.

Figure 10: “Ours” is the percentage of test cases, in which the participant selected our results as better; “Other” is the percentage that another method was chosen to be better; and “Same” is the percentage that the user could not decide which one is better.

Fig. 10

summarizes the results of the user study, demonstrating that our results are more preferred by the participants over all the baselines. Also, we performed the statistical analysis by using the T-TEST function in MS Excel and found that the associated p-values in the comparison with the baseline methods are all smaller than 0.001, showing that the conclusion has a significant level of 0.001 statistically.

Conclusion

This paper presents a new approach for video super resolution. Our novel parametric representation, Deep Parametric 3D Filters (DP3DF), enables a direct mapping of LNN videos to HNN videos. It intrinsically incorporates local spatiotemporal information and achieves video SR simultaneously with denoising and illumination enhancement efficiently within a single encoder-and-decoder network. Besides, a dynamic residual frame can be jointly learned with the DP3DF, sharing the backbone and improving the visual quality of the results.

Extensive experiments were conducted on two real-world video datasets, SMID and SDSD, to show the effectiveness of our new approach. Both the quantitative and qualitative comparisons between our approach and current SOTA methods demonstrate our approach’s consistent top performance. Further, an extensive user study with 80 participants was conducted to evaluate and compare the results in terms of human perception. Results also showed that our results consistently receive higher ratings than those from the baselines.

References

  • P. Arias and J. Morel (2018) Video denoising via empirical Bayesian estimation of space-time patches. Journal of Mathematical Imaging and Vision. Cited by: Related work.
  • J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi (2017) Real-time video super-resolution with spatio-temporal networks and motion compensation. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Related work.
  • J. Cai, S. Gu, and L. Zhang (2018) Learning a deep single image contrast enhancer from multi-exposure images. IEEE Trans. Image Process.. Cited by: Related work.
  • K. C. Chan, X. Wang, K. Yu, C. Dong, and C. C. Loy (2021) BasicVSR: the search for essential components in video super-resolution and beyond. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Comparison.
  • K. C. Chan, S. Zhou, X. Xu, and C. C. Loy (2022) BasicVSR++: improving video super-resolution with enhanced propagation and alignment. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Comparison.
  • C. Chen, Q. Chen, M. N. Do, and V. Koltun (2019) Seeing motion in the dark. In Int. Conf. Comput. Vis., Cited by: Introduction, Related work, Datasets.
  • X. Chen, L. Song, and X. Yang (2016) Deep RNNs for video denoising. In Applications of Digital Image Processing XXXIX, Cited by: Related work.
  • K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian (2007) Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process.. Cited by: Related work.
  • T. Ehret, A. Davy, J. Morel, G. Facciolo, and P. Arias (2019) Model-blind video denoising via frame-to-frame training. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Related work.
  • C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and R. Cong (2020) Zero-reference deep curve estimation for low-light image enhancement. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Related work.
  • M. Haris, G. Shakhnarovich, and N. Ukita (2019) Recurrent back-projection network for video super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Comparison.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on ImageNet classification

    .
    In Int. Conf. Comput. Vis., Cited by: Implementation.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: DP3DF.
  • Y. Huang, W. Wang, and L. Wang (2017) Video super-resolution via bidirectional recurrent convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Related work.
  • T. Isobe, S. Li, X. Jia, S. Yuan, G. Slabaugh, C. Xu, Y. Li, S. Wang, and Q. Tian (2020) Video super-resolution with temporal group attention. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Comparison.
  • H. Jiang and Y. Zheng (2019) Learning to see moving objects in the dark. In Int. Conf. Comput. Vis., Cited by: Related work.
  • Y. Jo, S. W. Oh, J. Kang, and S. J. Kim (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Related work, DP3DF.
  • D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. arXiv:1412.6980. Cited by: Implementation.
  • Y. Li and M. S. Brown (2014) Single image layer separation using relative smoothness. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Loss Function.
  • B. Lim and K. M. Lee (2017) Deep recurrent ResNet for video super-resolution. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Cited by: Related work.
  • K. G. Lore, A. Akintayo, and S. Sarkar (2017)

    LLNet: a deep autoencoder approach to natural low-light image enhancement

    .
    Pattern Recognition. Cited by: Related work.
  • F. Lv, F. Lu, J. Wu, and C. Lim (2018) MBLLEN: low-light image/video enhancement using CNNs.. In Brit. Mach. Vis. Conf., Cited by: Related work.
  • M. Maggioni, G. Boracchi, A. Foi, and K. Egiazarian (2012) Video denoising, deblocking, and enhancement through separable 4-D nonlocal spatiotemporal transforms. IEEE Trans. Image Process.. Cited by: Related work.
  • C. Michele and V. G. Jan (2019) ViDeNN: deep blind video denoising. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., Cited by: Related work.
  • B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Carroll (2018) Burst denoising with kernel prediction networks. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: DP3DF.
  • S. Moran, P. Marza, S. McDonagh, S. Parisot, and G. Slabaugh (2020) DeepLPF: deep local parametric filters for image enhancement. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Related work.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    .
    In Adv. Neural Inform. Process. Syst., Cited by: Implementation.
  • M. S. Sajjadi, R. Vemulapalli, and M. Brown (2018) Frame-recurrent video super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Related work.
  • W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016)

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    .
    In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: DP3DF.
  • X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015)

    Convolutional LSTM network: a machine learning approach for precipitation nowcasting

    .
    In Adv. Neural Inform. Process. Syst., Cited by: Related work.
  • X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia (2017) Detail-revealing deep video super-resolution. In Int. Conf. Comput. Vis., Cited by: Related work.
  • M. Tassano, J. Delon, and T. Veit (2019) DVDnet: a fast network for deep video denoising. In IEEE International Conference on Image Processing (ICIP), Cited by: Related work.
  • M. Tassano, J. Delon, and T. Veit (2020) FastDVDnet: towards real-time deep video denoising without flow estimation. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Figure 1, Related work, Comparison.
  • Y. Tian, Y. Zhang, Y. Fu, and C. Xu (2020) TDAN: temporally-deformable alignment network for video super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Related work, Comparison.
  • T. Vogels, F. Rousselle, B. McWilliams, G. Röthlin, A. Harvill, D. Adler, M. Meyer, and J. Novák (2018)

    Denoising with kernel prediction and asymmetric loss functions

    .
    ACM Transactions on Graphics. Cited by: Related work.
  • L. Wang, Y. Guo, Z. Lin, X. Deng, and W. An (2018) Learning for video super-resolution through HR optical flow estimation. In ACCV, Cited by: Related work.
  • R. Wang, X. Xu, C. Fu, and J. Jia (2021) Seeing dynamic scene in the dark: high-quality video dataset with mechatronic alignment. In Int. Conf. Comput. Vis., Cited by: Introduction, Datasets.
  • R. Wang, Q. Zhang, C. Fu, X. Shen, W. Zheng, and J. Jia (2019a) Underexposed photo enhancement using deep illumination estimation. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Related work, DP3DF, Loss Function.
  • W. Wang, X. Chen, C. Yang, X. Li, X. Hu, and T. Yue (2019b) Enhancing low light videos by exploring high sensitivity camera noise. In Int. Conf. Comput. Vis., Cited by: Related work.
  • X. Wang, K. C. Chan, K. Yu, C. Dong, and C. L. Chen (2019c) EDVR: video restoration with enhanced deformable convolutional networks. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., Cited by: Related work, Comparison.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process.. Cited by: Implementation.
  • Z. Xia, M. Gharbi, F. Perazzi, K. Sunkavalli, and A. Chakrabarti (2021) Deep Denoising of Flash and No-Flash Pairs for Photography in Low-Light Environments. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: DP3DF.
  • Z. Xia, F. Perazzi, M. Gharbi, K. Sunkavalli, and A. Chakrabarti (2020) Basis prediction networks for effective burst denoising with large kernels. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: DP3DF.
  • X. Xiang, Y. Tian, Y. Zhang, Y. Fu, J. P. Allebach, and C. Xu (2020) Zooming slow-mo: fast and accurate one-stage space-time video super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Figure 1, Comparison.
  • T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2019) Video enhancement with task-oriented flow. Int. J. Comput. Vis.. Cited by: Related work, Related work, Comparison.
  • J. Yan, S. Lin, B. K. Sing, and X. Tang (2014) A learning-to-rank approach for image color enhancement. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Related work.
  • Z. Yan, H. Zhang, B. Wang, S. Paris, and Y. Yu (2016) Automatic photo adjustment using deep neural networks. ACM Trans. Graph.. Cited by: Related work.
  • P. Yi, Z. Wang, K. Jiang, J. Jiang, and J. Ma (2019) Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In Int. Conf. Comput. Vis., Cited by: Comparison.
  • F. Zhang, Y. Li, S. You, and Y. Fu (2021) Learning temporal consistency for low light video enhancement from single images. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: Figure 1, Comparison.
  • Q. Zhang, Y. Nie, L. Zhang, and C. Xiao (2016) Underexposed video enhancement via perception-driven progressive fusion. IEEE Trans. Vis. Comput. Graph.. Cited by: Related work.