VRT: A Video Restoration Transformer

01/28/2022
by   Jingyun Liang, et al.
Facebook
ETH Zurich
0

Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks long-range modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on three tasks, including video super-resolution, video deblurring and video denoising, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16dB) on nine benchmark datasets.

READ FULL TEXT VIEW PDF

page 3

page 4

page 6

page 7

06/05/2022

Recurrent Video Restoration Transformer with Guided Deformable Attention

Video restoration aims at restoring multiple high-quality frames from mu...
06/22/2022

No Attention is Needed: Grouped Spatial-temporal Shift for Simple and Efficient Video Restorers

Video restoration, aiming at restoring clear frames from degraded videos...
11/30/2021

Revisiting Temporal Alignment for Video Restoration

Long-range temporal alignment is critical yet challenging for video rest...
07/29/2021

Temporal Feature Warping for Video Shadow Detection

While single image shadow detection has been improving rapidly in recent...
07/18/2022

Rethinking Alignment in Video Super-Resolution Transformers

The alignment of adjacent frames is considered an essential operation in...
04/12/2022

Unidirectional Video Denoising by Mimicking Backward Recurrent Modules with Look-ahead Forward Ones

While significant progress has been made in deep video denoising, it rem...
01/06/2022

Flow-Guided Sparse Transformer for Video Deblurring

Exploiting similar and sharper scene patches in spatio-temporal neighbor...

1 Introduction

Video restoration, which reconstructs high-quality (HQ) frames from multiple low-quality (LQ) frames, has attracted much attention in recent years. Compared with single image restoration, the key challenge of video restoration lies in how to make full use of neighboring highly-related but misaligned supporting frames for the reconstruction of the reference frame.

Existing video restoration methods can be mainly divided into two categories: sliding window-based methods [4, 24, 88, 81, 37, 71, 110, 26, 34] and recurrent methods [23, 66, 18, 22, 25, 27, 7, 9, 45, 59, 109, 70]. As shown in Fig. 0(a), sliding window-based methods generally input multiple frames to generate a single HQ frame and processes long video sequences in a sliding window fashion. Each input frame is processed for multiple times in inference, leading to inefficient feature utilization and increased computation cost.

Some other methods are based on a recurrent architecture. As shown in Fig. 0(b), recurrent models mainly use previously reconstructed HQ frames for subsequent frame reconstruction. Due to the recurrent nature, they have three disadvantages. First, recurrent methods are limited in parallelization for efficient distributed training and inference. Second, although information is accumulated frame by frame, recurrent models are not good at long-range temporal dependency modelling. One frame may strongly affect the next adjacent frame, but its influence is quickly lost after few time steps [19, 83]. Third, they suffer from significant performance drops on few-frame videos [5].

[width=0.66]figures/intro_sliding.pdf LQHQLQHQLQHQ

(a)
(b)
(c)
Figure 1: Illustrative comparison of sliding window-based models (0(a), e.g., [37, 81, 88]), recurrent models (0(b), e.g., [18, 23, 25, 7, 9]) and the proposed parallel VRT model (0(c)). Green and blue circles denote low-quality (LQ) input frames and high-quality (HQ) output frames, respectively. , and are frame serial numbers. Dashed lines represent information fusion among different frames.

In this paper, we propose a Video Restoration Transformer (VRT) that allows for parallel computation and long-range dependency modelling in video restoration. Based on a multi-scale framework, VRT divides the video sequence into non-overlapping clips and shifts it alternately to enable inter-clip interactions. Specifically, each scale of VRT has several temporal mutual self attention (TMSA) modules followed by a parallel warping module. In TMSA, mutual attention is focused on mutual alignment between neighboring two-frame clips, while self attention is used for feature extraction. At the end of each scale, we further use parallel warping to fuse neighboring frame information into the current frame. After multi-scale feature extraction, alignment and fusion, the HQ frames are individually reconstructed from their corresponding frame features.

Compared with existing video restoration frameworks, VRT has several benefits. First, as shown in Fig. 0(c), VRT is trained and tested on long video sequences in parallel. In contrast, both sliding window-based and recurrent methods are often tested frame by frame. Second, VRT has the ability to model long-range temporal dependencies, utilizing information from multiple neighbouring frames during the reconstruction of each frame. By contrast, sliding window-based methods cannot be easily scaled up to long sequence modelling, while recurrent methods may forget distant information after several timestamps. Third, VRT proposes to use mutual attention for joint feature alignment and fusion. It adaptively utilizes features from supporting frames and fuses them into the reference frame, which can be regarded as implicit motion estimation and feature warping.

Our contributions can be summarized as follows:

  • We propose a new framework named Video Restoration Transformer (VRT) that is characterized by parallel computation and long-range dependency modelling. It jointly extracts, aligns, and fuses frame features at multiple scales.

  • We propose the mutual attention for mutual alignment between frames. It is a generalized “soft” version of image warping after implicit motion estimation.

  • VRT achieves state-of-the-art performance on video restoration, including video super-resolution, deblurring and denoising. It outperforms state-of-the-art methods by up to 2.16dB on benchmark datasets.

2 Related Work

2.1 Video Restoration

Similar to image restoration [13, 104, 32, 16, 49, 98, 108, 105, 106, 107, 86, 56, 20, 17, 43, 91, 42, 55, 41, 85, 87, 102, 35, 101, 76, 40], learning-based methods, especially CNN-based methods, have become the primary workhorse for video restoration [48, 88, 110, 97, 93, 92, 62, 84, 63, 96, 7, 54, 82, 33, 95].

Framework design.

From the perspective of architecture design, existing methods can be roughly divided into two categories: sliding window-based and recurrent methods. Sliding window-based methods often takes a short sequence of frames as input and merely predict the center frame [4, 24, 88, 79, 81, 37, 71, 110, 26, 80, 68, 34]. Although some works [36] predict multiple frames, they still focus on the reconstruction of the center frame during training and testing. Recurrent framework is another popular choice [23, 66, 18, 22, 25, 27, 7, 9, 45, 59, 109, 70]. Huang et al[23]

propose a bidirectional recurrent convolutional neural network for SR. Sajjadi 

et al[66] warp the previous frame prediction onto the current frame and feed it to a restoration network along with the current input frame. This idea is used by Chan et al[7] for bidirectional recurrent network, and further extended as grid propagation in [9].

Temporal alignment and fusion.

Since supporting frames are often highly-related but misaligned, temporal alignment plays an critical role in video restoration [44, 94, 81, 88, 8, 7, 9]. Early methods [44, 29, 4, 47, 77] use traditional flow estimation methods to estimate optical flow and then warp the supporting frames towards the reference frame. To compensate occlusion and large motion, Xue et al[94] utilize task-oriented flow by fine-tuning the pre-trained optical flow estimation model SpyNet [64] on different video restoration tasks. Jo et al[28] use dynamic upsampling filters for implicit motion compensation. Kim et al[31]

propose a spatio-temporal transformer network for multi-frame optical flow estimation and warping. Tian 

et al[81] propose TDAN that utilize deformable convolution [12] for feature alignment. Based on TDAN, Wang et al[88] extend it to multi-scale alignment, while Chan et al[9] incorporate optical flow as a guidance for offsets learning.

Attention mechanism.

Attention mechanism has been exploited in video restoration in combination with CNN [47, 88, 73, 5]. Liu et al[47] learn different weights for different temporal branches. Wang et al[88] learn pixel-level attention maps for spatial and temporal feature fusion. To better incorporate temporal information, Isobe et al[26] divide frames into several groups and design a temporal group attention module. Suin et al[73]

propose a reinforcement learning-based framework with factorized spatio-temporal attention. Cao 

et al[5] propose to use self attention among local patches within a video.

[width=17.9cm]figures/arch.pdf

Figure 2: The framework of the proposed Video Restoration Transformer (VRT). Given low-quality input frames, VRT reconstructs high-quality frames in parallel. It jointly extracts features, deals with misalignment, and fuses temporal information at multiple scales. On each scale, it has two kinds of modules: temporal mutual self attention (TMSA, see Sec. 3.2) and parallel warping (see Sec. 3.3). The downsampling and upsampling operations between different scales are omitted for clarity.

2.2 Vision Transformer

Recently, Transformer-based models [83, 67, 38, 90] have achieved promising performance in various vision tasks, such as image recognition [14, 6, 39, 90, 52, 21, 75, 52, 51, 50] and image restoration [11, 89, 40]. Some methods have tried to use Transformer for video modelling by extending the attention mechanism to the temporal dimension [3, 2, 60, 53, 38]. However, most of them are designed for visual recognition, which are fundamentally different from restoration tasks. They are more focused on feature fusion than on alignment. Cao et al[5] propose a CNN-transformer hybrid network for video super-resolution (SR) based on spatial-temporal convolutional self attention. However, it does not make full use of local information within each patch and suffers from border artifacts during testing.

3 Video Restoration Transformer

3.1 Overall Framework

Let be a sequence of low-quality (LQ) input frames and be a sequence of high-quality (HQ) target frames. , , , and are the frame number, height, width and input channel number and output channel number, respectively. is the upscaling factor, which is larger than 1 (e.g., for video SR) or equal to 1 (e.g., for video deblurring). The proposed Video Restoration Transformer (VRT) aims to restore HQ frames from LQ frames in parallel for various video restoration tasks, including video SR, deblurring, denoising, etc. As illustrated in Fig. 2, VRT can be divided into two parts: feature extraction and reconstruction.

Feature extraction.

At the beginning, we extract shallow features by a single spatial 2D convolution from the LQ sequence . After that, based on [65], we propose a multi-scale network that aligns frames at different image resolutions. More specifically, when the total scale number is , we downsample the feature for times by squeezing each neighborhood to the channel dimension and reducing the channel number to the original number via a linear layer. Then, we upsample the feature gradually by unsqueezing the feature back to its original size. In such a way, we can extract features and deal with object or camera motions at different scales by two kinds of modules: temporal mutual self attention (TMSA, see 3.2) and parallel warping (see 3.3

). Skip connections are added for features of same scales. Finally, after multi-scale feature extraction, alignment and fusion, we add several TMSA modules for further feature refinement and obtain the deep feature

.

[width=1]figures/mama_warp.pdf supporting frame reference frame

(a) Mutual attention

[width=1]figures/mama_temporal.pdf (layer)(frame)

(b) Stacking of temporal mutual self attention (TMSA)
Figure 3: Illustrations for mutual attention and temporal mutual self attention (TMSA). In 2(a), we let the orange square (the -th element of the reference frame) query elements in the supporting frame and use their weighted features as a new representation for the orange square. The weights are shown around solid arrows (we only show three examples for clarity). When and the rest , the mutual attention equals to warping the yellow square to the position of the orange square (illustrated as a dashed arrow). 2(b) shows a stack of temporal mutual self attention (TMSA) layers. The sequence is partitioned into 2-frame clips at each layer and shifted for every other layer to enable cross-clip interactions. Dashed lines represent information fusion among different frames.

Reconstruction.

After feature extraction, we reconstruct the HQ frames from the addition of shallow feature and deep feature . Different frames are reconstructed independently based on their corresponding features. Besides, to ease the burden of feature learning, we employ global residual learning and only predict the residual between the bilinearly upsampled LQ sequence and the ground-truth HQ sequence. In practice, different reconstruction modules are used for different restoration tasks. For video SR, we use the sub-pixel convolution layer [69] to upsample the feature by a scale factor of . For video deblurring, a single convolution layer is enough for reconstruction. Apart from this, the architecture designs are kept the same for all tasks.

Loss function.

For fair comparison with existing methods, we use the commonly used Charbonnier loss [10] between the reconstructed HQ sequence and the ground-truth HQ sequence as

(1)

where is a constant that is empirically set as .

3.2 Temporal Mutual Self Attention

In this section, based on the attention mechanism [83, 38, 90], we first introduce the mutual attention and then propose the temporal mutual self attention (TMSA).

Mutual attention.

Given a reference frame feature and a supporting frame feature , where is the number of feature elements and is the channel number, we compute the query , key and value from and by linear projections as

(2)

where are projection matrices. is the channel number of projected features. Then, we use to query in order to generate the attention map , which is then used for weighted sum of . This is formulated as

(3)

where SoftMax means the row softmax operation.

Since and come from and , respectively, reflects the correlation between elements in the reference image and the supporting image. For clarity, we rewrite Eq. (3) for the -th element of the reference image as

(4)

where refers to the new feature of the -th element in the reference frame. As shown in Fig. 2(a), when (e.g., the yellow square from the supporting frame) is the most similar element to (e.g., the orange square from the reference frame), holds for all (). When all are very dissimilar to , we have

(5)

In this extreme case, by combining Eq. (4) and (5), we have , which moves the -th element in the supporting frame to the position of the -th element in the reference frame (see the dashed red line in Fig. 2(a)

). This equals to image warping given an optical flow vector. When

does not hold, Eq. (4) can be regarded as a “soft” version of image warping. In practice, the reference frame and supporting frame can be exchanged, allowing mutual alignment between two frames. Besides, similar to multi-head self attention, we can also perform the attention for times and concatenate the results as multi-head mutual attention (MMA).

Particularly, mutual attention has several benefits over the combination of explicit motion estimation and image warping. First, mutual attention can adaptively preserve information from the supporting frame than image warping, which only focuses on the target pixel. It also avoids black hole artifacts when there is no matched positions. Second, mutual attention does not have the inductive biases of locality, which is inherent to most CNN-based motion estimation methods [15, 64, 61, 74] and may lead to performance drop when two neighboring objects move towards different directions. Third, mutual attention equals to conducting motion estimation and warping on image features in a joint way. In contrast, optical flows are often estimated on the input RGB image and then used for warping on features [7, 9]. Besides, flow estimation on RGB images is often not robust to lighting variation, occlusion and blur [94].

Temporal mutual self attention (TMSA).

Mutual attention is proposed for joint feature alignment between two frames. To extract and preserve feature from the current frame, we use mutual attention together with self attention. Let represent two frames, which can be split into and . We use multi-head mutual attention (MMA) on and for two times: warping towards and warping towards

. The warped features are combined and then concatenated with the result of multi-head self attention (MSA), followed by a multi-layer perceptron (MLP) for the purpose of dimension reduction. After that, another MLP is added for further feature transformation. Two LayerNorm (LN) layers and two residual connections are also used as shown in the green box of Fig. 

2. The whole process formulated as follows

(6)

where the subscripts of Split and Concat refer to the specified dimensions. However, due to the design of mutual attention, Eq. (6) can only deal with two frames at a time.

One naive way to extend Eq. (6) for frames is to deal with frame-to-frame pairs exhaustively, resulting in the computational complexity of . Inspired by the shifted window mechanism [52, 53], we propose the temporal mutual self attention (TMSA) to remedy the problem. TMSA first partitions the video sequence into non-overlapping 2-frame clips and then applies Eq. (6) to them in parallel. Next, as shown in Fig. 2(b), it shifts the sequence temporally by 1 frame for every other layer to enable cross-clip connections, reducing the computational complexity to . The temporal receptive field size is increased when multiple TMSA modules are stacked together. Specifically, at layer (), one frame can utilize information from up to frames.

Discussion.

Video restoration tasks often need to process high-resolution frames. Since the complexity of attention is quadratic to the number of elements within the attention window, global attention on the full image is often impractical. Therefore, following [52, 40], we partition each frame spatially into non-overlapping local windows, resulting in windows. Shifted window mechanism (with the shift of pixels) is also used spatially to enable cross-window connections. Besides, although stacking multiple TMSA modules allows for long-distance temporal modelling, distant frames are not directly connected. As will show in the ablation study, using only a small temporal window size cannot fully exploit the potential of the model. Therefore, we use larger temporal window size for the last quarter of TMSA modules to enable direct interactions between distant frames.

3.3 Parallel Warping

Due to spatial window partitioning, the mutual attention mechanism may not be able to deal with large motions well. Hence, as shown in the orange box of Fig. 2, we use feature warping at the end of each network stage to handle large motions. For any frame feature , we calculate the optical flows of its neighbouring frame features and , and warp them towards the frame as and (i.e., backward and forward warping). Then, we concatenate them with the original feature and use an MLP for feature fusion and dimension reduction. Specifically, following [9], we predict the residual flow by a flow estimation model and use deformable convolution [12] for deformable alignment. More details are provided in the supplementary.

4 Experiments

4.1 Experimental Setup

For video SR, we use 4 scales for VRT. On each scale, we stack 8 TMSA modules, the last two of which use a temporal window size of 8. The spatial window size , head size , and channel size are set to , 6 and 120, respectively. After 7 multi-scale feature extraction stages, we add 24 TMSA modules (only with self attention) for further feature extraction before reconstruction. More details are provided in the supplementary.

Method

Training Frames (REDS/ Vimeo-90K)

Params
(M)
Runtime
(ms)
BI degradation BD degradation
REDS4 [57]
(RGB channel)

Vimeo-90K-T

 [94]
(Y channel)
Vid4 [46]
(Y channel)
UDM10 [97]
(Y channel)

Vimeo-90K-T

 [94]
(Y channel)
Vid4 [46]
(Y channel)
Bicubic - - - 26.14/0.7292 31.32/0.8684 23.78/0.6347 28.47/0.8253 31.30/0.8687 21.80/0.5246
SwinIR [40] - 11.9 - 29.05/0.8269 35.67/0.9287 25.68/0.7491 35.42/0.9380 34.12/0.9167 25.25/0.7262
SwinIR-ft [40] 1/1 11.9 - 29.24/0.8319 35.89/0.9301 25.69/0.7488 36.76/0.9467 35.70/0.9293 25.62/0.7498
TOFlow [94] 5/7 - - 27.98/0.7990 33.08/0.9054 25.89/0.7651 36.26/0.9438 34.62/0.9212 25.85/0.7659
FRVSR [66] 10/7 5.1 137 - - - 37.09/0.9522 35.64/0.9319 26.69/0.8103
DUF [28] 7/7 5.8 974 28.63/0.8251 - 27.33/0.8319 38.48/0.9605 36.87/0.9447 27.38/0.8329
PFNL [97] 7/7 3.0 295 29.63/0.8502 36.14/0.9363 26.73/0.8029 38.74/0.9627 - 27.16/0.8355
RBPN [22] 7/7 12.2 1507 30.09/0.8590 37.07/0.9435 27.12/0.8180 38.66/0.9596 37.20/0.9458 27.17/0.8205

MuCAN [37]
5/7 - - 30.88/0.8750 37.32/0.9465 - - - -
RLSP [18] -/7 4.2 49 - - - 38.48/0.9606 36.49/0.9403 27.48/0.8388
TGA [26] -/7 5.8 384 - - - 38.74/0.9627 37.59/0.9516 27.63/0.8423
RSDN [25] -/7 6.2 94 - - - 39.35/0.9653 37.23/0.9471 27.92/0.8505
RRN [27] -/7 3.4 45 - - - 38.96/0.9644 - 27.69/0.8488
FDAN [45] -/7 9.0 - - - - 39.91/0.9686 37.75/0.9522 27.88/0.8508
EDVR [88] 5/7 20.6 378 31.09/0.8800 37.61/0.9489 27.35/0.8264 39.89/0.9686 37.81/0.9523 27.85/0.8503
GOVSR [96] -/7 7.1 81 - - - 40.14/0.9713 37.63/0.9503 28.41/0.8724
VSRT [5] 5/7 32.6 - 31.19/0.8815 37.71/0.9494 27.36/0.8258 - - -
VRT (ours) 6/- 30.7 236 31.60/0.8888 - - - - -
BasicVSR [7] 15/14 6.3 63 31.42/0.8909 37.18/0.9450 27.24/0.8251 39.96/0.9694 37.53/0.9498 27.96/0.8553
IconVSR [7] 15/14 8.7 70 31.67/0.8948 37.47/0.9476 27.39/0.8279 40.03/0.9694 37.84/0.9524 28.04/0.8570

BasicVSR++[9]
30/14 7.3 77  32.39/0.9069 37.79/0.9500 27.79/0.8400 40.72/0.9722 38.21/0.9550 29.04/0.8753
VRT (ours) 16/7 35.6 243 32.19/0.9006 38.20/0.9530 27.93/0.8425 41.05/0.9737 38.72/0.9584 29.42/0.8795
Table 1: Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for video super-resolution () on REDS4 [57], Vimeo-90K-T [94], Vid4 [46] and UDM10 [97]. Best and second best results are in red and blue colors, respectively. We currently do not have enough GPU memory to train the fully parallel model VRT on 30 frames.
valign=t
Frame 021, Clip 011, REDS [57]
valign=t
LQ () EDVR [88] VSRT [5] BasicVSR [7]
IconVSR [7] BasicVSR++ [9] VRT (ours) GT

valign=t
Frame 022, Clip city, Vid4 [46]
valign=t
LQ () EDVR [88] VSRT [5] BasicVSR [7]
IconVSR [7] BasicVSR++ [9] VRT (ours) GT
Figure 4: Visual comparison of video super-resolution () methods.

4.2 Video SR

As shown in Table 1, we compare VRT with the state-of-the-art image and video SR methods. VRT achieves best performance for both bicubic (BI) and blur-downsamplng (BD) degradations. Specifically, when trained on the REDS [57] dataset with short sequences, VRT outperforms VSRT by up to 0.57dB in PSNR. Compared with another representative sliding window-based model EDVR, VRT has an improvement of 0.501.57dB on different datasets, showing its good ability to fuse information from multiple frames. Note that VRT outputs all frames simultaneously rather than predicting them frame by frame as EDVR does. On the Vimeo-90K [94] dataset, VRT surpasses BasicVSR++ by up to 0.38dB, although BasicVSR++ and other recurrent models may mirror the 7-frame video for training and testing. When VRT is trained on longer sequences, it shows good potential in temporal modelling and further increases the PSNR by 0.52dB. As indicated in  [5], recurrent models often suffer from significant performance drops on short sequences. In contrast, VRT performs well on both short and long sequences. We note that VRT is slightly lower than the 32-frame model BasicVSR++. This is expected since VRT is only trained on 16 frames.

We also provide comparison on parameter number and runtime in Table 1. As a parallel model, VRT needs to restore all frames at the same time, which leads to relatively larger model size and longer runtime per frame compared with recurrent models. However, VRT has the potential for distributed deployment, which is hard for recurrent models that restore a video clip recursively by design.

Visual results of different methods are shown in Fig. 4. As one can see, in accordance with its significant quantitative improvements, VRT can generate visually pleasing images with sharp edges and fine details, such as horizontal strip patterns of buildings. By contrast, its competitors suffer from either distorted textures or lost details.

Method DeepDeblur [58] SRN [78]
DBN [71]
DBLRNet [103] STFAN [110] STTN [31] SFE [93] EDVR [88] TSP [62] PVDNet [70] GSTA [73] ARVo [34] VRT (ours)
PSNR 29.85 30.53 30.01 30.08 31.24 31.61 31.71 31.82 32.13 32.31 32.53 32.80 34.27 (+1.47)
SSIM 0.8800 0.8940 0.8877 0.8845 0.9340 0.9160 0.9160 0.9160 0.9268 0.9260 0.9468 0.9352 0.9651 (+0.03)
Table 2: Quantitative comparison (average RGB channel PSNR/SSIM) with state-of-the-art methods for video deblurring on DVD [71]. Following [62, 34], all restored frames instead of randomly selected 30 frames from each test set [71] are used in evaluation. Best and second best results are in red and blue colors, respectively.
Method DeepDeblur [58] SRN [78] DMPHN [100] SAPHN [72] MPRNet [99] SFE [93] IFI-RNN [59] ESTRNN [109] EDVR [88] TSP [62] PVDNet [70] GSTA [73] VRT (ours)
PSNR 29.23 30.26 31.20 31.85 32.66 31.01 31.05 31.07 31.54 31.67 31.98 32.10 34.81 (+2.15)
SSIM 0.9162 0.9342 0.9400 0.9480 0.9590 0.9130 0.9110 0.9023 0.9260 0.9279 0.9280 0.9600 0.9724 (+0.01)
Table 3: Quantitative comparison (average RGB channel PSNR/SSIM) with state-of-the-art methods for video deblurring on GoPro [58]. Best and second best results are in red and blue colors, respectively.
valign=t
Frame 00034, Clip IMG_0030, DVD [71]
valign=t
LQ DBN [71] STFAN [110] TSP [62]
PVDNet [70] ARVo [34] VRT (ours) GT

valign=t
Frame 000210, Clip GOPRO410_11_00, GoPro [58]
valign=t
LQ SRN [78] SAPHN [72] MPRNet [99]
TSP  [62] PVDNet [70] VRT (ours) GT
Figure 5: Visual comparison of video deblurring methods. Part of compared images are derived from [34, 99].
Method DeepDeblur [58] SRN [78] DBN [71] EDVR [88] VRT (ours)
PSNR 26.16 26.98 26.55 34.80 36.79 (+1.99)
SSIM 0.8249 0.8141 0.8066 0.9487 0.9648 (+0.02)
Table 4: Quantitative comparison (average RGB channel PSNR/SSIM) with state-of-the-art methods for video deblurring on REDS [57]. Best and second best results are in red and blue colors, respectively.

4.3 Video Deblurring

We conducts experiments on three different datasets for fair comparison with existing methods. Table 2 shows the results on the DVD [71] dataset. It is clear that VRT achieves the best performance, outperforming the second best method ARVo by a remarkable improvement of 1.47dB and 0.0299 in terms of PSNR and SSIM. Related to the attention mechanism, GSTA designs a gated spatio-temporal attention mechanism, while ARVo calculates the correlation between pixel pairs for correspondence learning. However, both of them are based on CNN, achieving significantly worse performance compared with the Transformer-based VRT. We also compare VRT on the GoPro [58] and REDS [57] datasets. VRT shows its superiority over other methods with significant PSNR gains of 2.15dB and 1.99dB. The total number of parameters of VRT is 18.3M, which is slightly smaller than EDVR (23.6M) and PVDNet (23.5M). The runtime is 2.2s per frame on blurred videos. Notably, during evaluation, we do not use any pre-processing techniques such as sequence truncation and image alignment [62, 70].

Fig. 5 shows the visual comparison of different methods. VRT is effective in removing motion blurs and restoring faithful details, such as the pole in the first example and characters in the second one. In comparison, other approaches fail to remove blurs completely and do not produce sharp edges.

Dataset VLNB [1] DVDnet [79] FastDVDnet [80] PaCNet [82] VRT (ours)
DAVIS 10 38.85 38.13 38.71 39.97 40.82 (+0.85)
20 35.68 35.70 35.77 36.82 38.15 (+1.33)
30 33.73 34.08 34.04 34.79 36.52 (+1.73)
40 32.32 32.86 32.82 33.34 35.32 (+1.98)
50 31.13 31.85 31.86 32.20 34.36 (+2.16)
Set8 10 37.26 36.08 36.44 37.06 37.88 (+0.82)
20 33.72 33.49 33.43 33.94 35.02 (+1.08)
30 31.74 31.79 31.68 32.05 33.35 (+1.30)
40 30.39 30.55 30.46 30.70 32.15 (+1.45)
50 29.24 29.56 29.53 29.66 31.22 (+1.56)
Table 5: Quantitative comparison (average RGB channel PSNR) with state-of-the-art methods for video denoising on DAVIS [30] and Set8 [79]. is the additive white Gaussian noise level. Best and second best results are in red and blue colors, respectively.

4.4 Video Denoising

We also conduct experiments on video denoising to show the effectiveness of VRT. Following [79, 80], we train one non-blind model for noise level on the DAVIS [30] dataset and test it on different noise levels. Table 5 shows the superiority of VRT on two benchmark datasets over existing methods. Even though PaCNet [82] trains different models separately for different noise levels, VRT still improves the PSNR by 0.822.16dB.

4.5 Ablation Study

For ablation study, we set up a small version of VRT as the baseline model by halving the layer and channel numbers. All models are trained on Vimeo-90K [94] for bicubic video SR () and tested it on Vid4 [46].

Impact of multi-scale architecture & parallel warping.

Table 6 shows the ablation study on the multi-scale architecture and parallel warping. When the number of model scales is reduced, the performance drops gradually, even though the computation burden becomes heavier. This is expected because multi-scale processing can help the model utilize information from a larger area and deal with large motions between frames. Besides, parallel warping also helps, bringing an improvement of 0.17dB.

Impact of temporal mutual self attention.

To test the effectiveness of mutual and self attention in TMSA, we conduct ablation study in Table 7. When we replace mutual attention with self attention (i.e., two self attentions) or only use one self attention, the performance drops by 0.110.17dB. One possible reason is that the model may be more focused on the reference frame rather than on the supporting frame during the computation of attention maps. In contrast, using the mutual attention can help the model to explicitly attend to the supporting frame and benefit from feature fusion. In addition, we can find that only using mutual attention is not enough. This is because mutual attention cannot preserve information of reference frames.

Impact of attention window size.

We conduct ablation study in Table 8 to investigate the impact of attention window size in the last few TMSAs of each scale. When the temporal window size increases from 1 to 2, the performance only improves slightly, possibly due to the fact that previous TMSA layers can already make good use of neighboring two-frame information. When the size is increased to 8, we can see an obvious improvement of 0.18dB. As a result, we use the window size of for those layers.

1 () 2 () 3 () 4 () Parallel warping PSNR
27.13
27.20
27.25
27.11
27.28
Table 6: Ablation study on multi-scale architecture and parallel warping. Given an input of spatial size , the corresponding feature sizes of each scale are shown in brackets. When some scales are removed, we add more layers to the rest scales to keep similar model size.
Attention 1 Self Attn. - Mutual Attn. Mutual Attn.
Attention 2 Self Attn. Self Attn. - Self Attn.
PSNR 27.17 27.11 26.92 27.28
Table 7: Ablation study on temporal mutual self attention.
Window Size
PSNR 27.10 27.13 27.18 27.28
Table 8: Ablation study on attention window size (frame height width).

5 Conclusion

In this paper, we proposed the Video Restoration Transformer (VRT) for video restoration. Based on a multi-scale framework, it jointly extracts, aligns, and fuses information from different frames at multiple resolutions by two kinds of modules: multiple temporal mutual self attention (TMSA) and parallel warping. More specifically, TMSA is composed of mutual and self attention. Mutual attention allows joint implicit flow estimation and feature warping, while self attention is responsible for feature extraction. Parallel warping is also used to further enhance feature alignment and fusion. Extensive experiments on various benchmark datasets show that VRT brings significant performance gains (up to 2.16dB) for video restoration, including video super-resolution, video deblurring and video denoising.

Acknowledgements   This work was partially supported by the ETH Zurich Fund (OK), a Huawei Technologies Oy (Finland) project, the China Scholarship Council and an Amazon AWS grant. Thanks Dr. Gurkirt Singh for insightful discussions. Special thanks goes to Yijue Chen.

References

  • [1] Pablo Arias and Jean-Michel Morel. Video denoising via empirical bayesian estimation of space-time patches. Journal of Mathematical Imaging and Vision, 60(1):70–93, 2018.
  • [2] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691, 2021.
  • [3] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095, 2021.
  • [4] Jose Caballero, Christian Ledig, Andrew Aitken, Alejandro Acosta, Johannes Totz, Zehan Wang, and Wenzhe Shi. Real-time video super-resolution with spatio-temporal networks and motion compensation. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4778–4787, 2017.
  • [5] Jiezhang Cao, Yawei Li, Kai Zhang, and Luc Van Gool. Video super-resolution transformer. arXiv preprint arXiv:2106.06847, 2021.
  • [6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
  • [7] Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential components in video super-resolution and beyond. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4947–4956, 2021.
  • [8] Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Understanding deformable alignment in video super-resolution. In

    AAAI Conference on Artificial Intelligence

    , pages 973–981, 2021.
  • [9] Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. arXiv preprint arXiv:2104.13371, 2021.
  • [10] Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half-quadratic regularization algorithms for computed imaging. In International Conference on Image Processing, volume 2, pages 168–172, 1994.
  • [11] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In IEEE Conference on Computer Vision and Pattern Recognition, pages 12299–12310, 2021.
  • [12] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In IEEE International Conference on Computer Vision, pages 764–773, 2017.
  • [13] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, pages 184–199, 2014.
  • [14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [15] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision, pages 2758–2766, 2015.
  • [16] Yuchen Fan, Honghui Shi, Jiahui Yu, Ding Liu, Wei Han, Haichao Yu, Zhangyang Wang, Xinchao Wang, and Thomas S Huang. Balanced two-stage residual networks for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 161–168, 2017.
  • [17] Yuchen Fan, Jiahui Yu, Ding Liu, and Thomas S Huang. Scale-wise convolution for image restoration. In AAAI Conference on Artificial Intelligence, pages 10770–10777, 2020.
  • [18] Dario Fuoli, Shuhang Gu, and Radu Timofte. Efficient video super-resolution through recurrent latent space propagation. In IEEE International Conference on Computer Vision Workshop, pages 3476–3485, 2019.
  • [19] Alexander Greaves-Tunnell and Zaid Harchaoui. A statistical investigation of long memory in language and music. In

    International Conference on Machine Learning

    , pages 2394–2403, 2019.
  • [20] Yong Guo, Jian Chen, Jingdong Wang, Qi Chen, Jiezhang Cao, Zeshuai Deng, Yanwu Xu, and Mingkui Tan. Closed-loop matters: Dual regression networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5407–5416, 2020.
  • [21] Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12):4338–4364, 2021.
  • [22] Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. Recurrent back-projection network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3897–3906, 2019.
  • [23] Yan Huang, Wei Wang, and Liang Wang. Bidirectional recurrent convolutional networks for multi-frame super-resolution. Advances in Neural Information Processing Systems, 28:235–243, 2015.
  • [24] Yan Huang, Wei Wang, and Liang Wang. Video super-resolution via bidirectional recurrent convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):1015–1028, 2017.
  • [25] Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin Wang, and Qi Tian. Video super-resolution with recurrent structure-detail network. In European Conference on Computer Vision, pages 645–660, 2020.
  • [26] Takashi Isobe, Songjiang Li, Xu Jia, Shanxin Yuan, Gregory Slabaugh, Chunjing Xu, Ya-Li Li, Shengjin Wang, and Qi Tian. Video super-resolution with temporal group attention. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8008–8017, 2020.
  • [27] Takashi Isobe, Fang Zhu, Xu Jia, and Shengjin Wang. Revisiting temporal modeling for video super-resolution. arXiv preprint arXiv:2008.05765, 2020.
  • [28] Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3224–3232, 2018.
  • [29] Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K Katsaggelos.

    Video super-resolution with convolutional neural networks.

    IEEE Transactions on Computational Imaging, 2(2):109–122, 2016.
  • [30] Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. In Asian Conference on Computer Vision, pages 123–141, 2018.
  • [31] Tae Hyun Kim, Mehdi SM Sajjadi, Michael Hirsch, and Bernhard Scholkopf. Spatio-temporal transformer network for video restoration. In European Conference on Computer Vision, pages 106–122, 2018.
  • [32] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al.

    Photo-realistic single image super-resolution using a generative adversarial network.

    In IEEE Conference on Computer Vision and Pattern Recognition, pages 4681–4690, 2017.
  • [33] Seunghwan Lee, Donghyeon Cho, Jiwon Kim, and Tae Hyun Kim. Restore from restored: Video restoration with pseudo clean video. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3537–3546, 2021.
  • [34] Dongxu Li, Chenchen Xu, Kaihao Zhang, Xin Yu, Yiran Zhong, Wenqi Ren, Hanna Suominen, and Hongdong Li. Arvo: Learning all-range volumetric correspondence for video deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7721–7731, 2021.
  • [35] Juncheng Li, Zehua Pei, and Tieyong Zeng. From beginner to master: A survey for deep learning-based single-image super-resolution. arXiv preprint arXiv:2109.14335, 2021.
  • [36] Sheng Li, Fengxiang He, Bo Du, Lefei Zhang, Yonghao Xu, and Dacheng Tao. Fast spatio-temporal residual network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10522–10531, 2019.
  • [37] Wenbo Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, and Jiaya Jia. Mucan: Multi-correspondence aggregation network for video super-resolution. In European Conference on Computer Vision, pages 335–351, 2020.
  • [38] Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, and Wanqing Li. Trear: Transformer-based rgb-d egocentric action recognition. IEEE Transactions on Cognitive and Developmental Systems, 2021.
  • [39] Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
  • [40] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. SwinIR: Image restoration using swin transformer. In IEEE Conference on International Conference on Computer Vision Workshops, 2021.
  • [41] Jingyun Liang, Andreas Lugmayr, Kai Zhang, Martin Danelljan, Luc Van Gool, and Radu Timofte. Hierarchical conditional flow: A unified framework for image super-resolution and image rescaling. In IEEE Conference on International Conference on Computer Vision, 2021.
  • [42] Jingyun Liang, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Mutual affine network for spatially variant kernel estimation in blind image super-resolution. In IEEE Conference on International Conference on Computer Vision, 2021.
  • [43] Jingyun Liang, Kai Zhang, Shuhang Gu, Luc Van Gool, and Radu Timofte. Flow-based kernel prior with application to blind super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10601–10610, 2021.
  • [44] Renjie Liao, Xin Tao, Ruiyu Li, Ziyang Ma, and Jiaya Jia. Video super-resolution via deep draft-ensemble learning. In IEEE International Conference on Computer Vision, pages 531–539, 2015.
  • [45] Jiayi Lin, Yan Huang, and Liang Wang. Fdan: Flow-guided deformable alignment network for video super-resolution. arXiv preprint arXiv:2105.05640, 2021.
  • [46] Ce Liu and Deqing Sun. On bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2):346–360, 2013.
  • [47] Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, and Thomas Huang. Robust video super-resolution with learned temporal dynamics. In IEEE International Conference on Computer Vision, pages 2507–2515, 2017.
  • [48] Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, Xinchao Wang, and Thomas S Huang. Learning temporal dynamics for video super-resolution: A deep learning approach. IEEE Transactions on Image Processing, 27(7):3432–3445, 2018.
  • [49] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. Non-local recurrent network for image restoration. arXiv preprint arXiv:1806.02919, 2018.
  • [50] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128(2):261–318, 2020.
  • [51] Yun Liu, Guolei Sun, Yu Qiu, Le Zhang, Ajad Chhatkuli, and Luc Van Gool. Transformer in convolutional neural networks. arXiv preprint arXiv:2106.03180, 2021.
  • [52] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
  • [53] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. arXiv preprint arXiv:2106.13230, 2021.
  • [54] Matteo Maggioni, Yibin Huang, Cheng Li, Shuai Xiao, Zhongqian Fu, and Fenglong Song. Efficient multi-stage video denoising with recurrent spatio-temporal fusion. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3466–3475, 2021.
  • [55] Yiqun Mei, Yuchen Fan, and Yuqian Zhou. Image super-resolution with non-local sparse attention. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3517–3526, 2021.
  • [56] Yiqun Mei, Yuchen Fan, Yuqian Zhou, Lichao Huang, Thomas S Huang, and Honghui Shi. Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5690–5699, 2020.
  • [57] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1996–2005, 2019.
  • [58] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3883–3891, 2017.
  • [59] Seungjun Nah, Sanghyun Son, and Kyoung Mu Lee. Recurrent neural networks with intra-frame iterations for video deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8102–8111, 2019.
  • [60] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. Video transformer network. arXiv preprint arXiv:2102.00719, 2021.
  • [61] Simon Niklaus.

    A reimplementation of SPyNet using PyTorch.

    https://github.com/sniklaus/pytorch-spynet, 2018.
  • [62] Jinshan Pan, Haoran Bai, and Jinhui Tang. Cascaded deep video deblurring using temporal sharpness prior. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3043–3051, 2020.
  • [63] Jinshan Pan, Haoran Bai, and Jinhui Tang. Cascaded deep video deblurring using temporal sharpness prior. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3043–3051, 2020.
  • [64] Anurag Ranjan and Michael J Black. Optical flow estimation using a spatial pyramid network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4161–4170, 2017.
  • [65] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241, 2015.
  • [66] Mehdi SM Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6626–6634, 2018.
  • [67] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  • [68] Dev Yashpal Sheth, Sreyas Mohan, Joshua L Vincent, Ramon Manzorro, Peter A Crozier, Mitesh M Khapra, Eero P Simoncelli, and Carlos Fernandez-Granda. Unsupervised deep video denoising. In IEEE International Conference on Computer Vision, pages 1759–1768, 2021.
  • [69] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
  • [70] Hyeongseok Son, Junyong Lee, Jonghyeop Lee, Sunghyun Cho, and Seungyong Lee. Recurrent video deblurring with blur-invariant motion estimation and pixel volumes. ACM Transactions on Graphics, 40(5):1–18, 2021.
  • [71] Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1279–1288, 2017.
  • [72] Maitreya Suin, Kuldeep Purohit, and AN Rajagopalan. Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3615, 2020.
  • [73] Maitreya Suin and AN Rajagopalan. Gated spatio-temporal attention-guided video deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7802–7811, 2021.
  • [74] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
  • [75] Guolei Sun, Yun Liu, Thomas Probst, Danda Pani Paudel, Nikola Popovic, and Luc Van Gool. Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926, 2021.
  • [76] Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Mefnet: Multi-scale event fusion network for motion deblurring. arXiv preprint arXiv:2112.00167, 2021.
  • [77] Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deep video super-resolution. In IEEE International Conference on Computer Vision, pages 4472–4480, 2017.
  • [78] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8174–8182, 2018.
  • [79] Matias Tassano, Julie Delon, and Thomas Veit. Dvdnet: A fast network for deep video denoising. In IEEE International Conference on Image Processing, pages 1805–1809, 2019.
  • [80] Matias Tassano, Julie Delon, and Thomas Veit. Fastdvdnet: Towards real-time deep video denoising without flow estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1354–1363, 2020.
  • [81] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3360–3369, 2020.
  • [82] Gregory Vaksman, Michael Elad, and Peyman Milanfar. Patch craft: Video denoising by deep modeling and patch matching. In IEEE International Conference on Computer Vision, pages 1759–1768, 2021.
  • [83] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
  • [84] Longguang Wang, Yulan Guo, Li Liu, Zaiping Lin, Xinpu Deng, and Wei An. Deep video super-resolution using hr optical flow estimation. IEEE Transactions on Image Processing, 29:4323–4336, 2020.
  • [85] Longguang Wang, Yingqian Wang, Xiaoyu Dong, Qingyu Xu, Jungang Yang, Wei An, and Yulan Guo. Unsupervised degradation representation learning for blind super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10581–10590, 2021.
  • [86] Longguang Wang, Yingqian Wang, Zhengfa Liang, Zaiping Lin, Jungang Yang, Wei An, and Yulan Guo. Learning parallax attention for stereo image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 12250–12259, 2019.
  • [87] Longguang Wang, Yingqian Wang, Zaiping Lin, Jungang Yang, Wei An, and Yulan Guo. Learning a single network for scale-arbitrary super-resolution. In IEEE Conference on International Conference on Computer Vision, pages 10581–10590, 2021.
  • [88] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1954–1963, 2019.
  • [89] Zhendong Wang, Xiaodong Cun, Jianmin Bao, and Jianzhuang Liu. Uformer: A general u-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106, 2021.
  • [90] Christoph Wick, Jochen Zöllner, and Tobias Grüning. Transformer for handwritten text recognition using bidirectional post-decoding. In International Conference on Document Analysis and Recognition, pages 112–126, 2021.
  • [91] Xiaoyu Xiang, Qian Lin, and Jan P Allebach. Boosting high-level vision with joint compression artifacts reduction and super-resolution. In International Conference on Pattern Recognition, pages 2390–2397, 2021.
  • [92] Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3370–3379, 2020.
  • [93] Xinguang Xiang, Hao Wei, and Jinshan Pan. Deep video deblurring using sharpness features from exemplars. IEEE Transactions on Image Processing, 29:8976–8987, 2020.
  • [94] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8):1106–1125, 2019.
  • [95] Ren Yang and Radu Timofte. Ntire 2021 challenge on quality enhancement of compressed video: Methods and results. In IEEE Conference on Computer Vision and Pattern Recognition, pages 647–666, 2021.
  • [96] Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, Tao Lu, Xin Tian, and Jiayi Ma. Omniscient video super-resolution. In IEEE International Conference on Computer Vision, pages 4429–4438, 2021.
  • [97] Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In IEEE International Conference on Computer Vision, pages 3106–3115, 2019.
  • [98] Jiahui Yu, Yuchen Fan, Jianchao Yang, Ning Xu, Zhaowen Wang, Xinchao Wang, and Thomas Huang. Wide activation for efficient and accurate image super-resolution. arXiv preprint arXiv:1808.08718, 2018.
  • [99] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-stage progressive image restoration. In IEEE Conference on Computer Vision and Pattern Recognition, pages 14821–14831, 2021.
  • [100] Hongguang Zhang, Yuchao Dai, Hongdong Li, and Piotr Koniusz. Deep stacked hierarchical multi-patch network for image deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5978–5986, 2019.
  • [101] Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [102] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In IEEE Conference on International Conference on Computer Vision, 2021.
  • [103] Kaihao Zhang, Wenhan Luo, Yiran Zhong, Lin Ma, Wei Liu, and Hongdong Li. Adversarial spatio-temporal learning for video deblurring. IEEE Transactions on Image Processing, 28(1):291–301, 2018.
  • [104] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
  • [105] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Learning a single convolutional super-resolution network for multiple degradations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3262–3271, 2018.
  • [106] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision, pages 286–301, 2018.
  • [107] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. Residual non-local attention networks for image restoration. arXiv preprint arXiv:1903.10082, 2019.
  • [108] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2472–2481, 2018.
  • [109] Zhihang Zhong, Ye Gao, Yinqiang Zheng, and Bo Zheng. Efficient spatio-temporal recurrent neural network for video deblurring. In European Conference on Computer Vision, pages 191–207, 2020.
  • [110] Shangchen Zhou, Jiawei Zhang, Jinshan Pan, Haozhe Xie, Wangmeng Zuo, and Jimmy Ren. Spatio-temporal filter adaptive network for video deblurring. In IEEE International Conference on Computer Vision, pages 2482–2491, 2019.