Recurrent Video Restoration Transformer with Guided Deformable Attention

06/05/2022
by   Jingyun Liang, et al.
4

Video restoration aims at restoring multiple high-quality frames from multiple low-quality frames. Existing video restoration methods generally fall into two extreme cases, i.e., they either restore all frames in parallel or restore the video frame by frame in a recurrent way, which would result in different merits and drawbacks. Typically, the former has the advantage of temporal information fusion. However, it suffers from large model size and intensive memory consumption; the latter has a relatively small model size as it shares parameters across frames; however, it lacks long-range dependency modeling ability and parallelizability. In this paper, we attempt to integrate the advantages of the two cases by proposing a recurrent video restoration transformer, namely RVRT. RVRT processes local neighboring frames in parallel within a globally recurrent framework which can achieve a good trade-off between model size, effectiveness, and efficiency. Specifically, RVRT divides the video into multiple clips and uses the previously inferred clip feature to estimate the subsequent clip feature. Within each clip, different frame features are jointly updated with implicit feature aggregation. Across different clips, the guided deformable attention is designed for clip-to-clip alignment, which predicts multiple relevant locations from the whole inferred clip and aggregates their features by the attention mechanism. Extensive experiments on video super-resolution, deblurring, and denoising show that the proposed RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime.

READ FULL TEXT VIEW PDF
01/28/2022

VRT: A Video Restoration Transformer

Video restoration (e.g., video super-resolution) aims to restore high-qu...
04/12/2022

Unidirectional Video Denoising by Mimicking Backward Recurrent Modules with Look-ahead Forward Ones

While significant progress has been made in deep video denoising, it rem...
11/30/2021

Revisiting Temporal Alignment for Video Restoration

Long-range temporal alignment is critical yet challenging for video rest...
06/22/2022

No Attention is Needed: Grouped Spatial-temporal Shift for Simple and Efficient Video Restorers

Video restoration, aiming at restoring clear frames from degraded videos...
10/01/2020

Deformable Kernel Convolutional Network for Video Extreme Super-Resolution

Video super-resolution, which attempts to reconstruct high-resolution vi...
12/03/2020

EVRNet: Efficient Video Restoration on Edge Devices

Video transmission applications (e.g., conferencing) are gaining momentu...
04/18/2022

BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment

This work addresses the Burst Super-Resolution (BurstSR) task using a ne...

1 Introduction

Video restoration, such as video super-resolution, deblurring, and denoising, has become a hot topic in recent years. It aims to restore a clear and sharp high-quality video from a degraded (e.g., downsampled, blurred, or noisy) low-quality video Wang et al. (2019a); Chan et al. (2021b); Cao et al. (2021); Liang et al. (2022). It has wide applications in live streaming Zhang et al. (2020), video surveillance Liu et al. (2022b), old film restoration Wan et al. (2022), and more.

Parallel methods and recurrent methods have been dominant strategies for solving various video restoration problems. Typically, those two kinds of methods have their respective merits and demerits. Parallel methods Caballero et al. (2017b); Huang et al. (2017); Wang et al. (2019a); Tian et al. (2020); Li et al. (2020); Su et al. (2017); Zhou et al. (2019); Isobe et al. (2020b); Li et al. (2021a); Cao et al. (2021); Liang et al. (2022) support distributed deployment and achieve good performance by directly fusing information from multiple frames, but they often have a large model size and consume enormous memory for long-sequence videos. In the meanwhile, recurrent models Huang et al. (2015); Sajjadi et al. (2018); Fuoli et al. (2019); Haris et al. (2019); Isobe et al. (2020a, c); Chan et al. (2021a, b); Lin et al. (2021); Nah et al. (2019b); Zhong et al. (2020); Son et al. (2021) reuse the same network block to save parameters and predict the new frame feature based on the previously refined frame feature, but the sequential processing strategy inevitably leads to information loss and noise amplification Chu et al. (2020) for long-range dependency modelling and makes it hard to be parallelized.

Considering the advantages and disadvantages of parallel and recurrent methods, in this paper, we propose a recurrent video restoration transformer (RVRT) that takes the best of both worlds. On the one hand, RVRT introduces the recurrent design into transformer-based models to reduce model parameters and memory usage. On the other hand, it processes neighboring frames together as a clip to reduce video sequence length and alleviate information loss. To be specific, we first divide the video into fixed-length video clips. Then, starting from the first clip, we refine the subsequent clip feature based on the previously inferred clip feature and the old features of the current clip from shallower layers. Within each clip, different frame features are jointly extracted, implicitly aligned and effectively fused by the self-attention mechanism Vaswani et al. (2017); Liu et al. (2021); Liang et al. (2021a). Across different clips, information is accumulated clip by clip with a larger hidden state than previous recurrent methods.

To implement the above RVRT model, one big challenge is how to align different video clips when using the previous clip for feature refinement. Most existing alignment techniques Ranjan and Black (2017); Sun et al. (2018); Sajjadi et al. (2018); Xue et al. (2019); Chan et al. (2021a); Dai et al. (2017); Tian et al. (2020); Wang et al. (2019a); Chan et al. (2021b); Liang et al. (2022) are designed for frame-to-frame alignment. One possible way to apply them to clip-to-clip alignment is by introducing an extra feature fusion stage after aligning all frame pairs. Instead, we propose an one-stage video-to-video alignment method named guided deformable attention (GDA). More specifically, for a reference location in the target clip, we first estimate the coordinates of multiple relevant locations from different frames in the supporting clip under the guidance of optical flow, and then aggregate features of all locations dynamically by the attention mechanism.

GDA has several advantages over previous alignment methods: 1) Compared with optical flow-based warping that only samples one point from one frame Sajjadi et al. (2018); Xue et al. (2019); Chan et al. (2021a), GDA benefits from multiple relevant locations sampled from the video clip. 2) Unlike mutual attention Liang et al. (2022)

, GDA utilizes features from arbitrary locations without suffering from the small receptive field in local attention or the huge computation burden in global attention. Besides, GDA allows direct attention on non-integer locations with bilinear interpolation. 3) In contrast to deformable convolution 

Dai et al. (2017); Zhu et al. (2019); Tian et al. (2020); Wang et al. (2019a); Chan et al. (2021b) that uses a fixed weight in feature aggregation, GDA generates dynamic weights to aggregate features from different locations. It also supports arbitrary location numbers and allows for both frame-to-frame and video-to-video alignment without any modification.

Our contributions can be summarized as follows:

  • We propose the recurrent video restoration transformer (RVRT) that extracts features of local neighboring frames from one clip in a joint and parallel way, and refines clip features by accumulating information from previous clips and previous layers. By reducing the video sequence length and transmitting information with a larger hidden state, RVRT alleviates information loss and noise amplification in recurrent networks, and also makes it possible to partially parallelize the model.

  • We propose the guided deformable attention (GDA) for one-stage video clip-to-clip alignment. It dynamically aggregates information of relevant locations from the supporting clip.

  • Extensive experiments on eight benchmark datasets show that the proposed model achieves state-of-the-art performance in three challenging video restoration tasks: video super-resolution, video deblurring, and video denoising, with balanced model size, memory usage and runtime.

2 Related Work

2.1 Video Restoration

Parallel vs. recurrent methods.

Most existing video restoration methods can be classified as parallel or recurrent methods according to their parallelizability. Parallel methods estimate all frames simultaneously, as the refinement of one frame feature is not dependent on the update of other frame features. They can be further divided as sliding window-based methods 

Caballero et al. (2017b); Huang et al. (2017); Wang et al. (2019a); Tassano et al. (2019); Tian et al. (2020); Wang et al. (2020); Li et al. (2020); Su et al. (2017); Zhou et al. (2019, 2019); Isobe et al. (2020b); Tassano et al. (2020); Sheth et al. (2021); Li et al. (2021a) and transformer-based methods Cao et al. (2021); Liang et al. (2022)

. The former kind of methods typically restore merely the center frame from the neighboring frames and are often tested in a sliding window fashion rather than in parallel. These methods generally consist of four stages: feature extraction, feature alignment, feature fusion, and frame reconstruction. Particularly, in the feature alignment stage, they often align all frames towards the center frame, which leads to quadratic complexity with respect to video length and is hard to be extended for long-sequence videos. Instead, the latter kind of method reconstructs all frames at a time based on the transformer architectures. They jointly extract, align, and fuse features for all frames, achieving significant performance improvements against previous methods. However, current transformer-based methods are laid up with a huge model size and large memory consumption. Different from above parallel methods, recurrent methods 

Huang et al. (2015); Sajjadi et al. (2018); Fuoli et al. (2019); Haris et al. (2019); Xiang et al. (2020a); Isobe et al. (2020a, c); Chan et al. (2021a, b); Lin et al. (2021); Nah et al. (2019b); Zhong et al. (2020); Son et al. (2021) propagate latent features from one frame to the next frame sequentially, where information of previous frames is accumulated for the restoration of later frames. Basically, they are composed of three stages: feature extraction, feature propagation and frame reconstruction. Due to the recurrent nature of feature propagation, recurrent methods suffer from information loss and the inapplicability of distributed deployment.

Alignment in video restoration.

Unlike image restoration that mainly focuses on feature extraction Dong et al. (2014); Zhang et al. (2017, 2018a, 2018b); Liang et al. (2021d, b, c); Sun et al. (2021b); Zhang et al. (2021, 2022), how to align multiple highly-related but misaligned frames is another key problem in video restoration. Traditionally, many methods Liao et al. (2015); Kappeler et al. (2016); Caballero et al. (2017b); Liu et al. (2017); Tao et al. (2017); Caballero et al. (2017a); Xue et al. (2019); Chan et al. (2021a) first estimate the optical flow between neighbouring frames Dosovitskiy et al. (2015); Ranjan and Black (2017); Sun et al. (2018) and then conduct image warping for alignment. Other techniques, such as deformable convolution Dai et al. (2017); Zhu et al. (2019); Tian et al. (2020); Wang et al. (2019a); Chan et al. (2021b); Cao et al. (2021), dynamic filter Jo et al. (2018) and mutual attention Liang et al. (2022), have also been exploited for implicit feature alignment.

2.2 Vision Transformer

Transformer Vaswani et al. (2017)

is the de-facto standard architecture in natural language processing. Recently, it has been used in dealing with vision problems by viewing pixels or image patches as tokens 

Carion et al. (2020); Dosovitskiy et al. (2020), achieving remarkable performance gains in various computer vision tasks, including image classification Dosovitskiy et al. (2020); Li et al. (2021b); Liu et al. (2021), object detection Vaswani et al. (2021); Liu et al. (2020); Xia et al. (2022), semantic segmentation Wu et al. (2020); Dong et al. (2021); Sun et al. (2021a), etc. It also achieves promising results in restoration tasks Chen et al. (2021); Liang et al. (2021a); Wang et al. (2021); Lin et al. (2022); Cao et al. (2021); Liang et al. (2022); Fuoli et al. (2022); Geng et al. (2022); Cao et al. (2022); Yun et al. (2022); Liu et al. (2022a). In particular, for video restoration, Cao et alCao et al. (2021) propose the first transformer model for video SR, while Liang et alLiang et al. (2022) propose an unified framework for video SR, deblurring and denoising.

We note that some transformer-based works Zhu et al. (2020); Xia et al. (2022) have tried to combine the concept of deformation Dai et al. (2017); Zhu et al. (2019) with the attention mechanism Vaswani et al. (2017). Zhu et alZhu et al. (2020) directly predicts the attention weight from the query feature without considering its feature interaction with supporting locations. Concurrently, Xia et alXia et al. (2022) place the supporting points uniformly on the image to make use of global information. Both above two methods are proposed for recognition tasks such as object detection, which is fundamentally different from video alignment in our model.

[width=14cm]figures/arch.pdf

Figure 1: The architecture of recurrent video restoration transformer (RVRT). From left to right, it consists of shallow feature extraction, recurrent feature refinement and HQ frame reconstruction. In recurrent feature refinement (RFR, see more details in Fig. 2), we divide the video into -frame clips ( in this figure) and process frames in one clip in parallel within a globally recurrent framework in time. Multiple refinement layers are stacked for better performance.

3 Methodology

3.1 Overall Architecture

Given a low-quality video sequence , where , , and are the video length, height, width and channel, respectively, the goal of video restoration is to reconstruct the high-quality video , where is the scale factor. To reach this goal, we propose a recurrent video restoration transformer, as illustrated in Fig. 1. The model consists of three parts: shallow feature extraction, recurrent feature refinement and HQ frame reconstruction. More specifically, in shallow feature extraction, we first use a convolution layer to extract features from the LQ video. For deblurring and denoising (i.e.,

), we additionally add two strided convolution layers to downsample the feature and reduce computation burden in the next layers. After that, several Residual Swin Transformer Blocks (RSTBs) 

Liang et al. (2021a) are used to extract the shallow feature. Then, we use recurrent feature refinement modules for temporal correspondence modeling and guided deformable attention for video alignment, which are detailed in Sec. 3.2 and Sec. 3.3, respectively. Lastly, we add several RSTBs to generate the final feature and reconstruct the HQ video by pixel shuffle layer Shi et al. (2016). For training, the Charbonnier loss Charbonnier et al. (1994) () is used for all tasks.

[width=5.3cm]figures/rfr.pdf

Figure 2: The illustrations of recurrent feature refinement (RFR). The -th clip feature from the -th layer is aligned towards the -th clip as by guided deformable attention (GDA, see more details in Fig. 3). and are then refined as by several modified residual swin transformer blocks (MRSTBs), in which different frames are jointly processed in a parallel way.

3.2 Recurrent Feature Refinement

We stack recurrent feature refinement modules to refine the video feature by exploiting the temporal correspondence between different frames. To make a trade-off between recurrent and transformer-based methods, we process frames locally in parallel on the basis of a globally recurrent framework.

Formally, given the video feature from the

-th layer, we first reshape it as a 5-dimensional tensor

by dividing it into video clip features: . Each clip feature () has neighbouring frame features: . To utilize information from neighbouring clips, we align the -th clip feature towards the -th clip based on the optical flow , clip feature and clip feature . This is formulated as follows:

(1)

where GDA is the guided deformable attention and is the aligned clip feature. The details of GDA will be described in Sec. 3.3.

Similar to recurrent neural networks 

Sajjadi et al. (2018); Chan et al. (2021a, b), as shown in Fig. 2, we update the clip feature of each time step as follows:

(2)

where is the output of the shallow feature extraction module and are from previous recurrent feature refinement modules. is the recurrent feature refinement module that consists of a convolution layer for feature fusion and several modified residual Swin Transformer blocks (MRSTBs) for feature refinement. In MRSTB, we upgrade the original 2D attention window to the 3D attention window, so that every frame in the clip can attend to itself and other frames simultaneously, allowing implicit feature aggregation. In addition, in order to accumulate information forward and backward in time, we reverse the video sequence for all even recurrent feature refinement modules Huang et al. (2015); Chan et al. (2021b).

The above recurrent feature refinement module is the key component of the proposed RVRT model. Globally, features of different video clips are propagated in a recurrent way. Locally, features of different frames are updated jointly in parallel. For an arbitrary single frame, it can make full use of global information accumulated in time and local information extracted together by the self-attention mechanism. As we can see, RVRT is a generalization of both recurrent and transformer models. It becomes a recurrent model when or a transformer model when . This is fundamentally different from previous methods that simply adopt transformer blocks to replace CNN blocks within a recurrent architecture Wan et al. (2022). It is also different from existing attempts in natural language processing Wang et al. (2019b); Lei et al. (2020).

[height=8.5cm]figures/gda.pdf

Figure 3: The illustrations of guided deformable attention (GDA). We estimate offsets of multiple relevant locations from different frames based on the warped clip, and then aggregate features of different locations dynamically by the attention mechanism. is the -th clip feature from the -th layer, while and are the pre-aligned and aligned features of . and denote optical flows and offsets, respectively.

3.3 Guided Deformable Attention for Video Alignment

Different from previous frameworks, the proposed RVRT needs to align neighboring related but misaligned video clips, as indicated in Eq. (1). In this subsection, we propose the guided deformation attention (GDA) for video clip-to-clip alignment.

Given the -th clip feature from the -th layer, our goal is to align towards the -th clip as a list of features , where denotes the aligned clip feature towards the -th frame feature of the -th clip, and is the aligned frame feature from the -th frame in the -th clip to the -th frame in the -th clip. Inspired by optical flow estimation designs Dosovitskiy et al. (2015); Niklaus (2018); Sun et al. (2018); Chan et al. (2021b), we first pre-align with the optical flow as , where denotes the warping operation. For convenience, we summarize the pre-alignments of all “-to-” () frame pairs between the -th and -th video clips as follows:

(3)

After that, we predict the optical flow offsets from the concatenation of , and

along the channel dimension. A small convolutional neural network (CNN) with several convolutional layers and ReLU layers is used for prediction. This is formulated as

(4)

where the current misalignment between the -th clip feature and the warped -th clip features can reflect the offset required for further alignment. In practice, we initialize as the optical flows estimated from the LQ input video via SpyNet Ranjan and Black (2017), and predict offsets for each frame ( offsets in total). The optical flows are updated layer by layer as follows:

(5)

where denotes the -th offset in predictions from the -th frame to the -th frame.

Then, for the -th frame of the -th clip, we sample its relevant features from the -th clip feature according the predicted locations, which are indicated by the sum of optical flow and offsets, i.e., , according to the chain relationship  Chan et al. (2021b); Ranjan and Black (2017). For simplicity, we define the queries , keys and values as follows:

(6)
(7)
(8)

where is the projected feature from the -th frame of -th clip. and are the projected features that are bilinearly sampled from locations of and , respectively. , and are the projection matrices. Note that we first project the feature and then do sampling to reduce redundant computation.

Next, similar to the attention mechanism Vaswani et al. (2017), we calculate the attention weights based on the and from the -th layer and then compute the aligned feature as a weighted sum of from the same -th layer as follows:

(9)

where SoftMax is the softmax operation along the row direction and is a scaling factor.

Lastly, since Eq. (9) only aggregates information spatially, we add a multi-layer perception (MLP) with two fully-connected layers and a GELUactivation function between them to enable channel interaction as follows:

(10)

where a residual connection is used to stabilize training. The hidden and output channel numbers of the

MLP are ( is the ratio) and , respectively.

Multi-group multi-head guided deformable attention.

We can divide the channel into several deformable groups and perform the deformable sampling for different groups in parallel. Besides, in the attention mechanism, we can further divide one deformable group into several attention heads and perform the attention operation separately for different heads. All groups and heads are concatenated together before channel interaction.

Connection to deformable convolution.

Deformable convolution Dai et al. (2017); Zhu et al. (2019) uses a learned weight for feature aggregation, which can be seen as a special case of GDA, i.e., using different projection matrix for different locations and then directly averaging the resulting features. Its parameter number and computation complexity are and , respectively. In contrast, GDA uses the same projection matrix for all locations but generates dynamic weights to aggregate them. Its parameter number and computation complexity are and , which are similar to deformable convolution when choosing proper and .

4 Experiments

4.1 Experimental Setup

For shallow feature extraction and HQ frame reconstruction, we use 1 RSTB that has 2 swin transformer layers. For recurrent feature refinement, we use 4 refinement modules with a clip size of 2, each of which has 2 MRSTBs with 2 modified swin transformer layers. For both RSTB and MRSTB, spatial attention window size and head number are and 6, respectively. We use 144 channels for video SR and 192 channels for deblurring and denoising. In GDA, we use 12 deformable groups and 12 deformable heads with 9 candidate locations. We empirically project the query to a higher-dimensional space (e.g., ) because we found it can improve the performance slightly and the parameter number of GDA is not a bottleneck. In training, we randomly crop HQ patches and use different video lengths for different datasets: 30 frames for REDS Nah et al. (2019a), 14 frames for Vimeo-90K Xue et al. (2019), and 16 frames for DVD Su et al. (2017), GoPro Nah et al. (2017) as well as DAVIS Khoreva et al. (2018). Adam optimizer Kingma and Ba (2014) with default setting is used to train the model for 600,000 iterations when the batch size is 8. The learning rate is initialized as and deceased with the Cosine Annealing scheme Loshchilov and Hutter (2016). To stabilize training, we initialize SpyNet Ranjan and Black (2017); Niklaus (2018) with pretrained weights, fix it for the first 30,000 iterations and reduce its learning rate by 75%.

Clip Length 1 2 3
PSNR 31.98 32.10 32.07
Table 2: Ablation study on different video alignment techniques.
Alignment None Warping Xue et al. (2019) TMSA Liang et al. (2022) DCN Tian et al. (2020) GDA GDA
PSNR 26.14 28.88 30.45 31.93 32.00 32.10
Table 1: Ablation study on clip length.
[height=3.15cm]figures/ablation_clip_hacked.pdf Figure 4: Per-frame PSNR drop when pixels of the 50-th frame is hacked to be all zeros. is clip length. Optical Flow Guidance Optical Flow Update MLP PSNR 30.99 32.03 31.83 32.10 Table 3: Ablation study on different GDA components. Deformable Group 1 6 12 12 12 24 Attention Head 1 6 12 24 36 24 PSNR 31.63 32.03 32.10 32.13 32.03 32.11 Table 4: Ablation study on deformable groups and attention heads.

4.2 Ablation Study

To explore the effectiveness of different components, we conduct ablation studies on REDS Nah et al. (2019a) for video SR. For efficiency, we reduce the MRSTB blocks by half and use 12 frames in training.

The impact of clip length.

In RVRT, we divide the video into -frame clips. As shown in Table 2, the performance rises when clip length is increased from 1 to 2. However, the performance saturates when , possibly due to the fact that long-range optical flow guidance between distant frames is often inaccurate Jabri et al. (2020). Besides, to compare the temporal modelling ability, we hack the input LQ video (Clip 000 from REDS, 100 frames in total) by manually setting all pixels of the 50-th frame as zeros. As indicated in Fig. 4, on the one hand, has a smaller performance drop and all its frames still have higher PSNR than (equals to a recurrent model) after the attack, showing that RVRT can mitigate the noise amplification from the hacked frame to the rest frames. One the other hand, the hacked frame of has an impact on more neighbouring frames than , which means that RVRT can alleviate information loss and utilize more frames than for restoration.

The impact of video alignment.

The alignment of video clips plays a key role in our framework. We compare the proposed clip-to-clip guided deformable attention (GDA) with existing frame-to-frame alignment techniques by performing them frame by frame, followed by concatenation and channel reduction. As we can see from Table 2, GDA outperforms all existing methods when it is used for frame-to-frame alignment (denoted as GDA), and leads a further improvement when we aggregate features directly from the whole clip.

The impact of different components in GDA.

We further conduct an ablation study on GDA in Table 3. As we can see, the optical flow guidance is critical for the model, leading to a PSNR gain of 1.11dB. The update of optical flow in different layers can further improve the result. The channel interaction in MLP also plays an important role, since the attention mechanism only aggregates information spatially.

The impact of deformable group and attention head.

We also conduct experiments on different group and head numbers in GDA. As shown in Table 4, when the deformable group rises, the PSNR first rises and then keeps almost unchanged. Besides, double attention heads lead to slightly better results at the expense of higher computation, but using too many heads has an adverse impact as the head dimension may be too small.

4.3 Video Super-Resolution

For video SR, we consider two settings: bicubic (BI) and blur-downsampling (BD) degradation. For BI degradation, we train the model on two different datasets: REDS Nah et al. (2019a) and Vimeo-90K Xue et al. (2019), and then test the model on their corresponding testsets: REDS4 and Vimeo-90K-T. We additionally test Vid4 Liu and Sun (2013) along with Vimeo-90K. For BD degradation, we train it on Vimeo-90K and test it on Vimeo-90K-T, Vid4, and UDM10 Yi et al. (2019). The comparisons with existing methods are shown in Table 5. As we can see, RVRT achieves the best performance on REDS4 and Vid4 for both degradations. Compared with the representative recurrent model BasicVSR++ Chan et al. (2021b), RVRT improves the PSNR by significant margins of 0.20.5dB. Compared with the recent transformer-based model VRT Liang et al. (2022), RVRT outperforms VRT on REDS4 and Vid4 by up to 0.36dB. The visual comparisons of different methods are shown in Fig. 5. It is clear that RVRT generates sharp and clear HQ frames, while other methods fail to restore fine textures and details.

Method BI degradation BD degradation
REDS4 Nah et al. (2019a) (RGB channel) Vimeo-90K-T  Xue et al. (2019) (Y channel) Vid4 Liu and Sun (2013) (Y channel) UDM10 Yi et al. (2019) (Y channel) Vimeo-90K-T  Xue et al. (2019) (Y channel) Vid4 Liu and Sun (2013) (Y channel)
Bicubic 26.14/0.7292 31.32/0.8684 23.78/0.6347 28.47/0.8253 31.30/0.8687 21.80/0.5246
SwinIR Liang et al. (2021a) 29.05/0.8269 35.67/0.9287 25.68/0.7491 35.42/0.9380 34.12/0.9167 25.25/0.7262
SwinIR-ft Liang et al. (2021a) 29.24/0.8319 35.89/0.9301 25.69/0.7488 36.76/0.9467 35.70/0.9293 25.62/0.7498
TOFlow Xue et al. (2019) 27.98/0.7990 33.08/0.9054 25.89/0.7651 36.26/0.9438 34.62/0.9212 25.85/0.7659
FRVSR Sajjadi et al. (2018) - - - 37.09/0.9522 35.64/0.9319 26.69/0.8103
DUF Jo et al. (2018) 28.63/0.8251 - 27.33/0.8319 38.48/0.9605 36.87/0.9447 27.38/0.8329
PFNL Yi et al. (2019) 29.63/0.8502 36.14/0.9363 26.73/0.8029 38.74/0.9627 - 27.16/0.8355
RBPN Haris et al. (2019) 30.09/0.8590 37.07/0.9435 27.12/0.8180 38.66/0.9596 37.20/0.9458 27.17/0.8205

MuCAN Li et al. (2020)
30.88/0.8750 37.32/0.9465 - - - -
RLSP Fuoli et al. (2019) - - - 38.48/0.9606 36.49/0.9403 27.48/0.8388
TGA Isobe et al. (2020b) - - - 38.74/0.9627 37.59/0.9516 27.63/0.8423
RSDN Isobe et al. (2020a) - - - 39.35/0.9653 37.23/0.9471 27.92/0.8505
RRN Isobe et al. (2020c) - - - 38.96/0.9644 - 27.69/0.8488
FDAN Lin et al. (2021) - - - 39.91/0.9686 37.75/0.9522 27.88/0.8508
EDVR Wang et al. (2019a) 31.09/0.8800 37.61/0.9489 27.35/0.8264 39.89/0.9686 37.81/0.9523 27.85/0.8503
GOVSR Yi et al. (2021) - - - 40.14/0.9713 37.63/0.9503 28.41/0.8724
BasicVSR Chan et al. (2021a) 31.42/0.8909 37.18/0.9450 27.24/0.8251 39.96/0.9694 37.53/0.9498 27.96/0.8553
IconVSR Chan et al. (2021a) 31.67/0.8948 37.47/0.9476 27.39/0.8279 40.03/0.9694 37.84/0.9524 28.04/0.8570
VRT Liang et al. (2022) 32.19/0.9006 38.20/0.9530 27.93/0.8425 41.05/0.9737 38.72/0.9584 29.42/0.8795
BasicVSR++ Chan et al. (2021b) 32.39/0.9069 37.79/0.9500 27.79/0.8400 40.72/0.9722 38.21/0.9550 29.04/0.8753
RVRT (ours) 32.75/0.9113 38.15/0.9527 27.99/0.8462 40.90/0.9729 38.59/0.9576 29.54/0.8810
Table 5: Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for video super-resolution () on REDS4 Nah et al. (2019a), Vimeo-90K-T Xue et al. (2019), Vid4 Liu and Sun (2013) and UDM10 Yi et al. (2019).
Method #Param (M) Memory (M) Runtime (ms) PSNR (dB)
BasicVSR++ Chan et al. (2021b) 7.3 223 77 32.39
BasicVSR++ Chan et al. (2021b)+RSTB Liang et al. (2021a) 9.3 1021 201 32.61
EDVR Wang et al. (2019a) 20.6 3535 378 31.09
VSRT Cao et al. (2021) 32.6 27487 328 31.19
VRT Liang et al. (2022) 35.6 2149 243 32.19
RVRT (ours) 10.8 1056 183 32.75
Table 6: Comparison of model size, testing memory and runtime for an LQ input of .

We compare the model size, testing memory consumption and runtime of different models in Table 6. Compared with representative parallel methods EDVR Wang et al. (2019a), VSRT Cao et al. (2021) and VST Liang et al. (2022), RVRT achieves significant performance gains with less than at least 50% of model parameters and testing memory usage. It also reduces the runtime by at least 25%. Compared the recurrent model BasicVSR++ Chan et al. (2021b)

, RVRT brings a PSNR improvement of 0.26dB. As for the inferiority of testing memory and runtime, we argue that it is mainly because the CNN layers are highly optimized on existing deep learning frameworks. To prove it, we use the transformer-based RSTB blocks in RVRT to replace the CNN blocks in BasicVSR++, in which case it has similar memory usage and more runtime than our model.

In addition, to better understand how guided deformable attention works, we visualize the predicted offsets on the LQ frames and show the attention weight in Fig. 6. As we can see, multiple offsets are predicted to select multiple sampled locations in the neighbourhood of the corresponding pixel. According to the feature similarity between the query feature and the sampled features, features of different locations are aggregated by calculating a dynamic attention weight.

valign=t
Frame 024, Clip 011, REDS Nah et al. (2019a)
valign=t
LQ () EDVR Wang et al. (2019a) VSRT Cao et al. (2021) BasicVSR Chan et al. (2021a)
BasicVSR++ Chan et al. (2021b) VRT Liang et al. (2022) RVRT (ours) GT

valign=t
Frame 012, Clip city, Vid4 Liu and Sun (2013)
valign=t
LQ () EDVR Wang et al. (2019a) VSRT Cao et al. (2021) BasicVSR Chan et al. (2021a)
BasicVSR++ Chan et al. (2021b) VRT Liang et al. (2022) RVRT (ours) GT
Figure 5: Visual comparison of video super-resolution () methods on REDS Nah et al. (2019a) and Vid4 Liu and Sun (2013).

[width=13cm]figures/visual_attn_dga.pdf

Figure 6: The visualization of predicted offsets and attention weight predicted in guided deformable attention. Although guided deformable attention is conducted on features, we plot illustrations on LQ input frames for better understanding. Best viewed by zooming.

4.4 Video Deblurring

For video deblurring, the model is trained and tested on two different datasets, DVD Su et al. (2017) and GoPro Nah et al. (2017), with their official training/testing splits. As shown in Table 8 and 8, RVRT shows its superiority over most methods with huge improvements of 1.402.27dB on two datasets. Even though the performance gain over VRT is relatively small, RVRT has a smaller model size and much less runtime. In detail, the model size and runtime of RVRT are 13.6M and 0.3s, while VRT has 18.3M parameters and the runtime of 2.2s on a LQ input. The visual comparison is provided in the supplementary material due to the space limit.

4.5 Video Denoising

For video denoising, we train the model on the training set of DAVIS Khoreva et al. (2018) and test it on its corresponding testset and Set8 Tassano et al. (2019). For fairness of comparison, following Tassano et al. (2019, 2020), we train a non-blind additive white Gaussian denoising model for noise level . Similar to the case of video deblurring, there is a huge gap (0.602.37dB) between RVRT and most methods. Compared with VRT, RVRT has slightly better performance on large noise levels, with a smaller model size (12.8M v.s.18.4M) and less runtime (0.2s v.s.1.5s) on a LQ input. The visual comparison is provided in the supplementary material due to the space limit.

Method SRN  Tao et al. (2018)
DBN  Su et al. (2017)
STFAN Zhou et al. (2019) STTN  Kim et al. (2018) SFE  Xiang et al. (2020b) EDVR  Wang et al. (2019a)
PSNR 30.53 30.01 31.24 31.61 31.71 31.82
SSIM 0.8940 0.8877 0.9340 0.9160 0.9160 0.9160
Method TSP  Pan et al. (2020) PVDNet  Son et al. (2021) GSTA  Suin and Rajagopalan (2021) ARVo  Li et al. (2021a) VRT  Liang et al. (2022) RVRT  (ours)
PSNR 32.13 32.31 32.53 32.80 34.24 34.30
SSIM 0.9268 0.9260 0.9468 0.9352 0.9651 0.9655
Table 8: Quantitative comparison (average RGB channel PSNR/SSIM) with state-of-the-art methods for video deblurring on GoPro Nah et al. (2017).
Method SRN  Tao et al. (2018) DMPHN  Zhang et al. (2019) SAPHN  Suin et al. (2020) MPRNet  Zamir et al. (2021) IFI-RNN  Nah et al. (2019b) ESTRNN  Zhong et al. (2020)
PSNR 30.26 31.20 31.85 32.66 31.05 31.07
SSIM 0.9342 0.9400 0.9480 0.9590 0.9110 0.9023
Method EDVR  Wang et al. (2019a) TSP  Pan et al. (2020) PVDNet  Son et al. (2021) GSTA  Suin and Rajagopalan (2021) VRT  Liang et al. (2022) RVRT   (ours)
PSNR 31.54 31.67 31.98 32.10 34.81 34.92
SSIM 0.9260 0.9279 0.9280 0.9600 0.9724 0.9738
Table 7: Quantitative comparison (average RGB channel PSNR/SSIM) with state-of-the-art methods for video deblurring on DVD Su et al. (2017).
Dataset VLNB  Arias and Morel (2018) DVDNet  Tassano et al. (2019) FastDVDNet  Tassano et al. (2020) PaCNet  Vaksman et al. (2021) VRT  Liang et al. (2022) RVRT   (ours)
DAVIS 10 38.85 38.13 38.71 39.97 40.82 40.57
20 35.68 35.70 35.77 36.82 38.15 38.05
30 33.73 34.08 34.04 34.79 36.52 36.57
40 32.32 32.86 32.82 33.34 35.32 35.47
50 31.13 31.85 31.86 32.20 34.36 34.57
Set8 10 37.26 36.08 36.44 37.06 37.88 37.53
20 33.72 33.49 33.43 33.94 35.02 34.83
30 31.74 31.79 31.68 32.05 33.35 33.30
40 30.39 30.55 30.46 30.70 32.15 32.21
50 29.24 29.56 29.53 29.66 31.22 31.33
Table 9: Quantitative comparison (average RGB channel PSNR) with state-of-the-art methods for video denoising on DAVIS Khoreva et al. (2018) and Set8 Tassano et al. (2019).

5 Conclusion

In this paper, we proposed a recurrent video restoration transformer with guided deformable attention. It is a globally recurrent model with locally parallel designs, which benefits from the advantages of both parallel methods and recurrent methods. We also propose the guided deformable attention module for our special case of video clip-to-clip alignment. Under the guidance of optical flow, it aggregates information from multiple neighboring locations adaptively with the attention mechanism. Extensive experiments on video super-resolution, video deblurring, and video denoising demonstrated the effectiveness of the proposed method.

6 Limitations and Societal Impacts

Although RVRT achieves state-of-the-art performance in video restoration, it still has some limitations. For example, the complexity of pre-alignment by optical flow increases quadratically with respect to the clip length. One possible solution is to develop a video-to-video optical flow estimation model that directly predicts all optical flows. As for societal impacts, similar to other restoration methods, RVRT may bring privacy concerns after restoring blurry videos and lead to misjudgments if used for medical diagnosis.

This work was partially supported by the ETH Zurich Fund (OK), a Huawei Technologies Oy (Finland) project, the China Scholarship Council and an Amazon AWS grant. Special thanks goes to Yijue Chen.

References

  • P. Arias and J. Morel (2018) Video denoising via empirical bayesian estimation of space-time patches. Journal of Mathematical Imaging and Vision 60 (1), pp. 70–93. Cited by: Table 9.
  • J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi (2017a) Real-time video super-resolution with spatio-temporal networks and motion compensation. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4778–4787. Cited by: §2.1.
  • J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi (2017b) Real-time video super-resolution with spatio-temporal networks and motion compensation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4778–4787. Cited by: §1, §2.1, §2.1.
  • J. Cao, Y. Li, K. Zhang, and L. Van Gool (2021) Video super-resolution transformer. arXiv preprint arXiv:2106.06847. Cited by: §1, §1, §2.1, §2.1, §2.2, Figure 5, §4.3, Table 6.
  • M. Cao, Y. Fan, Y. Zhang, J. Wang, and Y. Yang (2022) VDTR: video deblurring with transformer. arXiv preprint arXiv:2204.08023. Cited by: §2.2.
  • N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §2.2.
  • K. C. Chan, X. Wang, K. Yu, C. Dong, and C. C. Loy (2021a) BasicVSR: the search for essential components in video super-resolution and beyond. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4947–4956. Cited by: §1, §1, §1, §2.1, §2.1, §3.2, Figure 5, Table 5.
  • K. C. Chan, S. Zhou, X. Xu, and C. C. Loy (2021b) BasicVSR++: improving video super-resolution with enhanced propagation and alignment. arXiv preprint arXiv:2104.13371. Cited by: §1, §1, §1, §1, §2.1, §2.1, §3.2, §3.3, §3.3, Figure 5, §4.3, §4.3, Table 5, Table 6.
  • P. Charbonnier, L. Blanc-Feraud, G. Aubert, and M. Barlaud (1994) Two deterministic half-quadratic regularization algorithms for computed imaging. In International Conference on Image Processing, pp. 168–172. Cited by: §3.1.
  • H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao (2021) Pre-trained image processing transformer. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 12299–12310. Cited by: §2.2.
  • M. Chu, Y. Xie, J. Mayer, L. Leal-Taixé, and N. Thuerey (2020) Learning temporal coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics (TOG) 39 (4), pp. 75–1. Cited by: §1.
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In IEEE International Conference on Computer Vision, pp. 764–773. Cited by: §1, §1, §2.1, §2.2, §3.3.
  • C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, pp. 184–199. Cited by: §2.1.
  • X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo (2021) Cswin transformer: a general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652. Cited by: §2.2.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.2.
  • A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision, pp. 2758–2766. Cited by: §2.1, §3.3.
  • D. Fuoli, M. Danelljan, R. Timofte, and L. Van Gool (2022) Fast online video super-resolution with deformable attention pyramid. arXiv preprint arXiv:2202.01731. Cited by: §2.2.
  • D. Fuoli, S. Gu, and R. Timofte (2019) Efficient video super-resolution through recurrent latent space propagation. In IEEE International Conference on Computer Vision Workshop, pp. 3476–3485. Cited by: §1, §2.1, Table 5.
  • Z. Geng, L. Liang, T. Ding, and I. Zharkov (2022) RSTT: real-time spatial temporal transformer for space-time video super-resolution. arXiv preprint arXiv:2203.14186. Cited by: §2.2.
  • M. Haris, G. Shakhnarovich, and N. Ukita (2019) Recurrent back-projection network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3897–3906. Cited by: §1, §2.1, Table 5.
  • Y. Huang, W. Wang, and L. Wang (2015) Bidirectional recurrent convolutional networks for multi-frame super-resolution. Advances in Neural Information Processing Systems 28, pp. 235–243. Cited by: §1, §2.1, §3.2.
  • Y. Huang, W. Wang, and L. Wang (2017) Video super-resolution via bidirectional recurrent convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 1015–1028. Cited by: §1, §2.1.
  • T. Isobe, X. Jia, S. Gu, S. Li, S. Wang, and Q. Tian (2020a) Video super-resolution with recurrent structure-detail network. In European Conference on Computer Vision, pp. 645–660. Cited by: §1, §2.1, Table 5.
  • T. Isobe, S. Li, X. Jia, S. Yuan, G. Slabaugh, C. Xu, Y. Li, S. Wang, and Q. Tian (2020b) Video super-resolution with temporal group attention. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8008–8017. Cited by: §1, §2.1, Table 5.
  • T. Isobe, F. Zhu, X. Jia, and S. Wang (2020c) Revisiting temporal modeling for video super-resolution. arXiv preprint arXiv:2008.05765. Cited by: §1, §2.1, Table 5.
  • A. Jabri, A. Owens, and A. Efros (2020) Space-time correspondence as a contrastive random walk. Advances in neural information processing systems 33, pp. 19545–19560. Cited by: §4.2.
  • Y. Jo, S. W. Oh, J. Kang, and S. J. Kim (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3224–3232. Cited by: §2.1, Table 5.
  • A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos (2016)

    Video super-resolution with convolutional neural networks

    .
    IEEE Transactions on Computational Imaging 2 (2), pp. 109–122. Cited by: §2.1.
  • A. Khoreva, A. Rohrbach, and B. Schiele (2018) Video object segmentation with language referring expressions. In Asian Conference on Computer Vision, pp. 123–141. Cited by: §4.1, §4.5, Table 9.
  • T. H. Kim, M. S. Sajjadi, M. Hirsch, and B. Scholkopf (2018)

    Spatio-temporal transformer network for video restoration

    .
    In European Conference on Computer Vision, pp. 106–122. Cited by: Table 8.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • J. Lei, L. Wang, Y. Shen, D. Yu, T. L. Berg, and M. Bansal (2020) Mart: memory-augmented recurrent transformer for coherent video paragraph captioning. arXiv preprint arXiv:2005.05402. Cited by: §3.2.
  • D. Li, C. Xu, K. Zhang, X. Yu, Y. Zhong, W. Ren, H. Suominen, and H. Li (2021a) Arvo: learning all-range volumetric correspondence for video deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7721–7731. Cited by: §1, §2.1, Table 8.
  • W. Li, X. Tao, T. Guo, L. Qi, J. Lu, and J. Jia (2020) Mucan: multi-correspondence aggregation network for video super-resolution. In European Conference on Computer Vision, pp. 335–351. Cited by: §1, §2.1, Table 5.
  • Y. Li, K. Zhang, J. Cao, R. Timofte, and L. Van Gool (2021b) Localvit: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707. Cited by: §2.2.
  • J. Liang, J. Cao, Y. Fan, K. Zhang, R. Ranjan, Y. Li, R. Timofte, and L. Van Gool (2022) VRT: a video restoration transformer. arXiv preprint arXiv:2201.12288. Cited by: §1, §1, §1, §1, §2.1, §2.1, §2.2, Figure 5, §4.3, §4.3, Table 2, Table 5, Table 6, Table 8, Table 9.
  • J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte (2021a) SwinIR: image restoration using swin transformer. In IEEE Conference on International Conference on Computer Vision Workshops, Cited by: §1, §2.2, §3.1, Table 5, Table 6.
  • J. Liang, A. Lugmayr, K. Zhang, M. Danelljan, L. Van Gool, and R. Timofte (2021b) Hierarchical conditional flow: a unified framework for image super-resolution and image rescaling. In IEEE Conference on International Conference on Computer Vision, Cited by: §2.1.
  • J. Liang, G. Sun, K. Zhang, L. Van Gool, and R. Timofte (2021c) Mutual affine network for spatially variant kernel estimation in blind image super-resolution. In IEEE Conference on International Conference on Computer Vision, Cited by: §2.1.
  • J. Liang, K. Zhang, S. Gu, L. Van Gool, and R. Timofte (2021d) Flow-based kernel prior with application to blind super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10601–10610. Cited by: §2.1.
  • R. Liao, X. Tao, R. Li, Z. Ma, and J. Jia (2015) Video super-resolution via deep draft-ensemble learning. In IEEE International Conference on Computer Vision, pp. 531–539. Cited by: §2.1.
  • J. Lin, Y. Huang, and L. Wang (2021) FDAN: flow-guided deformable alignment network for video super-resolution. arXiv preprint arXiv:2105.05640. Cited by: §1, §2.1, Table 5.
  • J. Lin, Y. Cai, X. Hu, H. Wang, Y. Yan, X. Zou, H. Ding, Y. Zhang, R. Timofte, and L. Van Gool (2022) Flow-guided sparse transformer for video deblurring. arXiv preprint arXiv:2201.01893. Cited by: §2.2.
  • C. Liu and D. Sun (2013) On bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2), pp. 346–360. Cited by: Figure 5, §4.3, Table 5.
  • C. Liu, H. Yang, J. Fu, and X. Qian (2022a) Learning trajectory-aware transformer for video super-resolution. arXiv preprint arXiv:2204.04216. Cited by: §2.2.
  • D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang, and T. Huang (2017) Robust video super-resolution with learned temporal dynamics. In IEEE International Conference on Computer Vision, pp. 2507–2515. Cited by: §2.1.
  • H. Liu, Z. Ruan, P. Zhao, C. Dong, F. Shang, Y. Liu, L. Yang, and R. Timofte (2022b) Video super-resolution based on deep learning: a comprehensive survey. Artificial Intelligence Review, pp. 1–55. Cited by: §1.
  • L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen (2020) Deep learning for generic object detection: a survey. International Journal of Computer Vision 128 (2), pp. 261–318. Cited by: §2.2.
  • Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §1, §2.2.
  • I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    .
    arXiv preprint arXiv:1608.03983. Cited by: §4.1.
  • S. Nah, S. Baik, S. Hong, G. Moon, S. Son, R. Timofte, and K. Mu Lee (2019a) Ntire 2019 challenge on video deblurring and super-resolution: dataset and study. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1996–2005. Cited by: Figure 5, §4.1, §4.2, §4.3, Table 5.
  • S. Nah, T. Hyun Kim, and K. Mu Lee (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3883–3891. Cited by: §4.1, §4.4, Table 8.
  • S. Nah, S. Son, and K. M. Lee (2019b) Recurrent neural networks with intra-frame iterations for video deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8102–8111. Cited by: §1, §2.1, Table 8.
  • S. Niklaus (2018)

    A reimplementation of SPyNet using PyTorch

    .
    Note: https://github.com/sniklaus/pytorch-spynet Cited by: §3.3, §4.1.
  • J. Pan, H. Bai, and J. Tang (2020) Cascaded deep video deblurring using temporal sharpness prior. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3043–3051. Cited by: Table 8.
  • A. Ranjan and M. J. Black (2017) Optical flow estimation using a spatial pyramid network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4161–4170. Cited by: §1, §2.1, §3.3, §3.3, §4.1.
  • M. S. Sajjadi, R. Vemulapalli, and M. Brown (2018) Frame-recurrent video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6626–6634. Cited by: §1, §1, §1, §2.1, §3.2, Table 5.
  • D. Y. Sheth, S. Mohan, J. L. Vincent, R. Manzorro, P. A. Crozier, M. M. Khapra, E. P. Simoncelli, and C. Fernandez-Granda (2021) Unsupervised deep video denoising. In IEEE International Conference on Computer Vision, pp. 1759–1768. Cited by: §2.1.
  • W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883. Cited by: §3.1.
  • H. Son, J. Lee, J. Lee, S. Cho, and S. Lee (2021) Recurrent video deblurring with blur-invariant motion estimation and pixel volumes. ACM Transactions on Graphics 40 (5), pp. 1–18. Cited by: §1, §2.1, Table 8.
  • S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang (2017) Deep video deblurring for hand-held cameras. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1279–1288. Cited by: §1, §2.1, §4.1, §4.4, Table 8.
  • M. Suin, K. Purohit, and A. Rajagopalan (2020) Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3615. Cited by: Table 8.
  • M. Suin and A. Rajagopalan (2021) Gated spatio-temporal attention-guided video deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7802–7811. Cited by: Table 8.
  • D. Sun, X. Yang, M. Liu, and J. Kautz (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §1, §2.1, §3.3.
  • G. Sun, Y. Liu, T. Probst, D. P. Paudel, N. Popovic, and L. Van Gool (2021a) Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926. Cited by: §2.2.
  • L. Sun, C. Sakaridis, J. Liang, Q. Jiang, K. Yang, P. Sun, Y. Ye, K. Wang, and L. Van Gool (2021b) MEFNet: multi-scale event fusion network for motion deblurring. arXiv preprint arXiv:2112.00167. Cited by: §2.1.
  • X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia (2017) Detail-revealing deep video super-resolution. In IEEE International Conference on Computer Vision, pp. 4472–4480. Cited by: §2.1.
  • X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia (2018) Scale-recurrent network for deep image deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8174–8182. Cited by: Table 8.
  • M. Tassano, J. Delon, and T. Veit (2019) Dvdnet: a fast network for deep video denoising. In IEEE International Conference on Image Processing, pp. 1805–1809. Cited by: §2.1, §4.5, Table 9.
  • M. Tassano, J. Delon, and T. Veit (2020) Fastdvdnet: towards real-time deep video denoising without flow estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1354–1363. Cited by: §2.1, §4.5, Table 9.
  • Y. Tian, Y. Zhang, Y. Fu, and C. Xu (2020) Tdan: temporally-deformable alignment network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3360–3369. Cited by: §1, §1, §1, §2.1, §2.1, Table 2.
  • G. Vaksman, M. Elad, and P. Milanfar (2021) Patch craft: video denoising by deep modeling and patch matching. In IEEE International Conference on Computer Vision, pp. 1759–1768. Cited by: Table 9.
  • A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman, and J. Shlens (2021) Scaling local self-attention for parameter efficient visual backbones. arXiv preprint arXiv:2103.12731. Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §2.2, §2.2, §3.3.
  • Z. Wan, B. Zhang, D. Chen, and J. Liao (2022) Bringing old films back to life. arXiv preprint arXiv:2203.17276. Cited by: §1, §3.2.
  • L. Wang, Y. Guo, L. Liu, Z. Lin, X. Deng, and W. An (2020) Deep video super-resolution using hr optical flow estimation. IEEE Transactions on Image Processing 29, pp. 4323–4336. Cited by: §2.1.
  • X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy (2019a) Edvr: video restoration with enhanced deformable convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1954–1963. Cited by: §1, §1, §1, §1, §2.1, §2.1, Figure 5, §4.3, Table 5, Table 6, Table 8.
  • Z. Wang, X. Cun, J. Bao, and J. Liu (2021) Uformer: a general u-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106. Cited by: §2.2.
  • Z. Wang, Y. Ma, Z. Liu, and J. Tang (2019b) R-transformer: recurrent neural network enhanced transformer. arXiv preprint arXiv:1907.05572. Cited by: §3.2.
  • B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, M. Tomizuka, J. Gonzalez, K. Keutzer, and P. Vajda (2020) Visual transformers: token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677. Cited by: §2.2.
  • Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang (2022) Vision transformer with deformable attention. arXiv preprint arXiv:2201.00520. Cited by: §2.2, §2.2.
  • X. Xiang, Y. Tian, Y. Zhang, Y. Fu, J. P. Allebach, and C. Xu (2020a) Zooming slow-mo: fast and accurate one-stage space-time video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3370–3379. Cited by: §2.1.
  • X. Xiang, H. Wei, and J. Pan (2020b) Deep video deblurring using sharpness features from exemplars. IEEE Transactions on Image Processing 29, pp. 8976–8987. Cited by: Table 8.
  • T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision 127 (8), pp. 1106–1125. Cited by: §1, §1, §2.1, §4.1, §4.3, Table 2, Table 5.
  • P. Yi, Z. Wang, K. Jiang, J. Jiang, T. Lu, X. Tian, and J. Ma (2021) Omniscient video super-resolution. In IEEE International Conference on Computer Vision, pp. 4429–4438. Cited by: Table 5.
  • P. Yi, Z. Wang, K. Jiang, J. Jiang, and J. Ma (2019) Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In IEEE International Conference on Computer Vision, pp. 3106–3115. Cited by: §4.3, Table 5.
  • W. Yun, M. Qi, C. Wang, H. Fu, and H. Ma (2022) Coarse-to-fine video denoising with dual-stage spatial-channel transformer. arXiv preprint arXiv:2205.00214. Cited by: §2.2.
  • S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M. Yang, and L. Shao (2021) Multi-stage progressive image restoration. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 14821–14831. Cited by: Table 8.
  • H. Zhang, Y. Dai, H. Li, and P. Koniusz (2019) Deep stacked hierarchical multi-patch network for image deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5978–5986. Cited by: Table 8.
  • K. Zhang, Y. Li, J. Liang, J. Cao, Y. Zhang, H. Tang, R. Timofte, and L. Van Gool (2022) Practical blind denoising via swin-conv-unet and data synthesis. arXiv preprint arXiv:2203.13278. Cited by: §2.1.
  • K. Zhang, J. Liang, L. Van Gool, and R. Timofte (2021) Designing a practical degradation model for deep blind image super-resolution. In IEEE Conference on International Conference on Computer Vision, Cited by: §2.1.
  • K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §2.1.
  • K. Zhang, W. Zuo, and L. Zhang (2018a) Learning a single convolutional super-resolution network for multiple degradations. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3262–3271. Cited by: §2.1.
  • Y. Zhang, Y. Zhang, Y. Wu, Y. Tao, K. Bian, P. Zhou, L. Song, and H. Tuo (2020) Improving quality of experience by adaptive video streaming with super-resolution. In IEEE Conference on Computer Communications, pp. 1957–1966. Cited by: §1.
  • Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018b) Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision, pp. 286–301. Cited by: §2.1.
  • Z. Zhong, Y. Gao, Y. Zheng, and B. Zheng (2020) Efficient spatio-temporal recurrent neural network for video deblurring. In European Conference on Computer Vision, pp. 191–207. Cited by: §1, §2.1, Table 8.
  • S. Zhou, J. Zhang, J. Pan, H. Xie, W. Zuo, and J. Ren (2019) Spatio-temporal filter adaptive network for video deblurring. In IEEE International Conference on Computer Vision, pp. 2482–2491. Cited by: §1, §2.1, Table 8.
  • X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: §1, §2.1, §2.2, §3.3.
  • X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: §2.2.