Video restoration, which reconstructs high-quality (HQ) frames from multiple low-quality (LQ) frames, has attracted much attention in recent years. Compared with single image restoration, the key challenge of video restoration lies in how to make full use of neighboring highly-related but misaligned supporting frames for the reconstruction of the reference frame.
Existing video restoration methods can be mainly divided into two categories: sliding window-based methods [4, 24, 88, 81, 37, 71, 110, 26, 34] and recurrent methods [23, 66, 18, 22, 25, 27, 7, 9, 45, 59, 109, 70]. As shown in Fig. 0(a), sliding window-based methods generally input multiple frames to generate a single HQ frame and processes long video sequences in a sliding window fashion. Each input frame is processed for multiple times in inference, leading to inefficient feature utilization and increased computation cost.
Some other methods are based on a recurrent architecture. As shown in Fig. 0(b), recurrent models mainly use previously reconstructed HQ frames for subsequent frame reconstruction. Due to the recurrent nature, they have three disadvantages. First, recurrent methods are limited in parallelization for efficient distributed training and inference. Second, although information is accumulated frame by frame, recurrent models are not good at long-range temporal dependency modelling. One frame may strongly affect the next adjacent frame, but its influence is quickly lost after few time steps [19, 83]. Third, they suffer from significant performance drops on few-frame videos .
In this paper, we propose a Video Restoration Transformer (VRT) that allows for parallel computation and long-range dependency modelling in video restoration. Based on a multi-scale framework, VRT divides the video sequence into non-overlapping clips and shifts it alternately to enable inter-clip interactions. Specifically, each scale of VRT has several temporal mutual self attention (TMSA) modules followed by a parallel warping module. In TMSA, mutual attention is focused on mutual alignment between neighboring two-frame clips, while self attention is used for feature extraction. At the end of each scale, we further use parallel warping to fuse neighboring frame information into the current frame. After multi-scale feature extraction, alignment and fusion, the HQ frames are individually reconstructed from their corresponding frame features.
Compared with existing video restoration frameworks, VRT has several benefits. First, as shown in Fig. 0(c), VRT is trained and tested on long video sequences in parallel. In contrast, both sliding window-based and recurrent methods are often tested frame by frame. Second, VRT has the ability to model long-range temporal dependencies, utilizing information from multiple neighbouring frames during the reconstruction of each frame. By contrast, sliding window-based methods cannot be easily scaled up to long sequence modelling, while recurrent methods may forget distant information after several timestamps. Third, VRT proposes to use mutual attention for joint feature alignment and fusion. It adaptively utilizes features from supporting frames and fuses them into the reference frame, which can be regarded as implicit motion estimation and feature warping.
Our contributions can be summarized as follows:
We propose a new framework named Video Restoration Transformer (VRT) that is characterized by parallel computation and long-range dependency modelling. It jointly extracts, aligns, and fuses frame features at multiple scales.
We propose the mutual attention for mutual alignment between frames. It is a generalized “soft” version of image warping after implicit motion estimation.
VRT achieves state-of-the-art performance on video restoration, including video super-resolution, deblurring and denoising. It outperforms state-of-the-art methods by up to 2.16dB on benchmark datasets.
2 Related Work
2.1 Video Restoration
Similar to image restoration [13, 104, 32, 16, 49, 98, 108, 105, 106, 107, 86, 56, 20, 17, 43, 91, 42, 55, 41, 85, 87, 102, 35, 101, 76, 40], learning-based methods, especially CNN-based methods, have become the primary workhorse for video restoration [48, 88, 110, 97, 93, 92, 62, 84, 63, 96, 7, 54, 82, 33, 95].
From the perspective of architecture design, existing methods can be roughly divided into two categories: sliding window-based and recurrent methods. Sliding window-based methods often takes a short sequence of frames as input and merely predict the center frame [4, 24, 88, 79, 81, 37, 71, 110, 26, 80, 68, 34]. Although some works  predict multiple frames, they still focus on the reconstruction of the center frame during training and testing. Recurrent framework is another popular choice [23, 66, 18, 22, 25, 27, 7, 9, 45, 59, 109, 70]. Huang et al. 
propose a bidirectional recurrent convolutional neural network for SR. Sajjadiet al.  warp the previous frame prediction onto the current frame and feed it to a restoration network along with the current input frame. This idea is used by Chan et al.  for bidirectional recurrent network, and further extended as grid propagation in .
Temporal alignment and fusion.
Since supporting frames are often highly-related but misaligned, temporal alignment plays an critical role in video restoration [44, 94, 81, 88, 8, 7, 9]. Early methods [44, 29, 4, 47, 77] use traditional flow estimation methods to estimate optical flow and then warp the supporting frames towards the reference frame. To compensate occlusion and large motion, Xue et al.  utilize task-oriented flow by fine-tuning the pre-trained optical flow estimation model SpyNet  on different video restoration tasks. Jo et al.  use dynamic upsampling filters for implicit motion compensation. Kim et al. 
propose a spatio-temporal transformer network for multi-frame optical flow estimation and warping. Tianet al.  propose TDAN that utilize deformable convolution  for feature alignment. Based on TDAN, Wang et al.  extend it to multi-scale alignment, while Chan et al.  incorporate optical flow as a guidance for offsets learning.
Attention mechanism has been exploited in video restoration in combination with CNN [47, 88, 73, 5]. Liu et al.  learn different weights for different temporal branches. Wang et al.  learn pixel-level attention maps for spatial and temporal feature fusion. To better incorporate temporal information, Isobe et al.  divide frames into several groups and design a temporal group attention module. Suin et al. 
propose a reinforcement learning-based framework with factorized spatio-temporal attention. Caoet al.  propose to use self attention among local patches within a video.
2.2 Vision Transformer
Recently, Transformer-based models [83, 67, 38, 90] have achieved promising performance in various vision tasks, such as image recognition [14, 6, 39, 90, 52, 21, 75, 52, 51, 50] and image restoration [11, 89, 40]. Some methods have tried to use Transformer for video modelling by extending the attention mechanism to the temporal dimension [3, 2, 60, 53, 38]. However, most of them are designed for visual recognition, which are fundamentally different from restoration tasks. They are more focused on feature fusion than on alignment. Cao et al.  propose a CNN-transformer hybrid network for video super-resolution (SR) based on spatial-temporal convolutional self attention. However, it does not make full use of local information within each patch and suffers from border artifacts during testing.
3 Video Restoration Transformer
3.1 Overall Framework
Let be a sequence of low-quality (LQ) input frames and be a sequence of high-quality (HQ) target frames. , , , and are the frame number, height, width and input channel number and output channel number, respectively. is the upscaling factor, which is larger than 1 (e.g., for video SR) or equal to 1 (e.g., for video deblurring). The proposed Video Restoration Transformer (VRT) aims to restore HQ frames from LQ frames in parallel for various video restoration tasks, including video SR, deblurring, denoising, etc. As illustrated in Fig. 2, VRT can be divided into two parts: feature extraction and reconstruction.
At the beginning, we extract shallow features by a single spatial 2D convolution from the LQ sequence . After that, based on , we propose a multi-scale network that aligns frames at different image resolutions. More specifically, when the total scale number is , we downsample the feature for times by squeezing each neighborhood to the channel dimension and reducing the channel number to the original number via a linear layer. Then, we upsample the feature gradually by unsqueezing the feature back to its original size. In such a way, we can extract features and deal with object or camera motions at different scales by two kinds of modules: temporal mutual self attention (TMSA, see 3.2) and parallel warping (see 3.3
). Skip connections are added for features of same scales. Finally, after multi-scale feature extraction, alignment and fusion, we add several TMSA modules for further feature refinement and obtain the deep feature.
After feature extraction, we reconstruct the HQ frames from the addition of shallow feature and deep feature . Different frames are reconstructed independently based on their corresponding features. Besides, to ease the burden of feature learning, we employ global residual learning and only predict the residual between the bilinearly upsampled LQ sequence and the ground-truth HQ sequence. In practice, different reconstruction modules are used for different restoration tasks. For video SR, we use the sub-pixel convolution layer  to upsample the feature by a scale factor of . For video deblurring, a single convolution layer is enough for reconstruction. Apart from this, the architecture designs are kept the same for all tasks.
For fair comparison with existing methods, we use the commonly used Charbonnier loss  between the reconstructed HQ sequence and the ground-truth HQ sequence as
where is a constant that is empirically set as .
3.2 Temporal Mutual Self Attention
Given a reference frame feature and a supporting frame feature , where is the number of feature elements and is the channel number, we compute the query , key and value from and by linear projections as
where are projection matrices. is the channel number of projected features. Then, we use to query in order to generate the attention map , which is then used for weighted sum of . This is formulated as
where SoftMax means the row softmax operation.
Since and come from and , respectively, reflects the correlation between elements in the reference image and the supporting image. For clarity, we rewrite Eq. (3) for the -th element of the reference image as
where refers to the new feature of the -th element in the reference frame. As shown in Fig. 2(a), when (e.g., the yellow square from the supporting frame) is the most similar element to (e.g., the orange square from the reference frame), holds for all (). When all are very dissimilar to , we have
In this extreme case, by combining Eq. (4) and (5), we have , which moves the -th element in the supporting frame to the position of the -th element in the reference frame (see the dashed red line in Fig. 2(a)
). This equals to image warping given an optical flow vector. Whendoes not hold, Eq. (4) can be regarded as a “soft” version of image warping. In practice, the reference frame and supporting frame can be exchanged, allowing mutual alignment between two frames. Besides, similar to multi-head self attention, we can also perform the attention for times and concatenate the results as multi-head mutual attention (MMA).
Particularly, mutual attention has several benefits over the combination of explicit motion estimation and image warping. First, mutual attention can adaptively preserve information from the supporting frame than image warping, which only focuses on the target pixel. It also avoids black hole artifacts when there is no matched positions. Second, mutual attention does not have the inductive biases of locality, which is inherent to most CNN-based motion estimation methods [15, 64, 61, 74] and may lead to performance drop when two neighboring objects move towards different directions. Third, mutual attention equals to conducting motion estimation and warping on image features in a joint way. In contrast, optical flows are often estimated on the input RGB image and then used for warping on features [7, 9]. Besides, flow estimation on RGB images is often not robust to lighting variation, occlusion and blur .
Temporal mutual self attention (TMSA).
Mutual attention is proposed for joint feature alignment between two frames. To extract and preserve feature from the current frame, we use mutual attention together with self attention. Let represent two frames, which can be split into and . We use multi-head mutual attention (MMA) on and for two times: warping towards and warping towards
. The warped features are combined and then concatenated with the result of multi-head self attention (MSA), followed by a multi-layer perceptron (MLP) for the purpose of dimension reduction. After that, another MLP is added for further feature transformation. Two LayerNorm (LN) layers and two residual connections are also used as shown in the green box of Fig.2. The whole process formulated as follows
where the subscripts of Split and Concat refer to the specified dimensions. However, due to the design of mutual attention, Eq. (6) can only deal with two frames at a time.
One naive way to extend Eq. (6) for frames is to deal with frame-to-frame pairs exhaustively, resulting in the computational complexity of . Inspired by the shifted window mechanism [52, 53], we propose the temporal mutual self attention (TMSA) to remedy the problem. TMSA first partitions the video sequence into non-overlapping 2-frame clips and then applies Eq. (6) to them in parallel. Next, as shown in Fig. 2(b), it shifts the sequence temporally by 1 frame for every other layer to enable cross-clip connections, reducing the computational complexity to . The temporal receptive field size is increased when multiple TMSA modules are stacked together. Specifically, at layer (), one frame can utilize information from up to frames.
Video restoration tasks often need to process high-resolution frames. Since the complexity of attention is quadratic to the number of elements within the attention window, global attention on the full image is often impractical. Therefore, following [52, 40], we partition each frame spatially into non-overlapping local windows, resulting in windows. Shifted window mechanism (with the shift of pixels) is also used spatially to enable cross-window connections. Besides, although stacking multiple TMSA modules allows for long-distance temporal modelling, distant frames are not directly connected. As will show in the ablation study, using only a small temporal window size cannot fully exploit the potential of the model. Therefore, we use larger temporal window size for the last quarter of TMSA modules to enable direct interactions between distant frames.
3.3 Parallel Warping
Due to spatial window partitioning, the mutual attention mechanism may not be able to deal with large motions well. Hence, as shown in the orange box of Fig. 2, we use feature warping at the end of each network stage to handle large motions. For any frame feature , we calculate the optical flows of its neighbouring frame features and , and warp them towards the frame as and (i.e., backward and forward warping). Then, we concatenate them with the original feature and use an MLP for feature fusion and dimension reduction. Specifically, following , we predict the residual flow by a flow estimation model and use deformable convolution  for deformable alignment. More details are provided in the supplementary.
4.1 Experimental Setup
For video SR, we use 4 scales for VRT. On each scale, we stack 8 TMSA modules, the last two of which use a temporal window size of 8. The spatial window size , head size , and channel size are set to , 6 and 120, respectively. After 7 multi-scale feature extraction stages, we add 24 TMSA modules (only with self attention) for further feature extraction before reconstruction. More details are provided in the supplementary.
Training Frames (REDS/ Vimeo-90K)
|BI degradation||BD degradation|
4.2 Video SR
As shown in Table 1, we compare VRT with the state-of-the-art image and video SR methods. VRT achieves best performance for both bicubic (BI) and blur-downsamplng (BD) degradations. Specifically, when trained on the REDS  dataset with short sequences, VRT outperforms VSRT by up to 0.57dB in PSNR. Compared with another representative sliding window-based model EDVR, VRT has an improvement of 0.501.57dB on different datasets, showing its good ability to fuse information from multiple frames. Note that VRT outputs all frames simultaneously rather than predicting them frame by frame as EDVR does. On the Vimeo-90K  dataset, VRT surpasses BasicVSR++ by up to 0.38dB, although BasicVSR++ and other recurrent models may mirror the 7-frame video for training and testing. When VRT is trained on longer sequences, it shows good potential in temporal modelling and further increases the PSNR by 0.52dB. As indicated in , recurrent models often suffer from significant performance drops on short sequences. In contrast, VRT performs well on both short and long sequences. We note that VRT is slightly lower than the 32-frame model BasicVSR++. This is expected since VRT is only trained on 16 frames.
We also provide comparison on parameter number and runtime in Table 1. As a parallel model, VRT needs to restore all frames at the same time, which leads to relatively larger model size and longer runtime per frame compared with recurrent models. However, VRT has the potential for distributed deployment, which is hard for recurrent models that restore a video clip recursively by design.
Visual results of different methods are shown in Fig. 4. As one can see, in accordance with its significant quantitative improvements, VRT can generate visually pleasing images with sharp edges and fine details, such as horizontal strip patterns of buildings. By contrast, its competitors suffer from either distorted textures or lost details.
|Method||DeepDeblur ||SRN ||
|DBLRNet ||STFAN ||STTN ||SFE ||EDVR ||TSP ||PVDNet ||GSTA ||ARVo ||VRT (ours)|
|Method||DeepDeblur ||SRN ||DMPHN ||SAPHN ||MPRNet ||SFE ||IFI-RNN ||ESTRNN ||EDVR ||TSP ||PVDNet ||GSTA ||VRT (ours)|
|Method||DeepDeblur ||SRN ||DBN ||EDVR ||VRT (ours)|
4.3 Video Deblurring
We conducts experiments on three different datasets for fair comparison with existing methods. Table 2 shows the results on the DVD  dataset. It is clear that VRT achieves the best performance, outperforming the second best method ARVo by a remarkable improvement of 1.47dB and 0.0299 in terms of PSNR and SSIM. Related to the attention mechanism, GSTA designs a gated spatio-temporal attention mechanism, while ARVo calculates the correlation between pixel pairs for correspondence learning. However, both of them are based on CNN, achieving significantly worse performance compared with the Transformer-based VRT. We also compare VRT on the GoPro  and REDS  datasets. VRT shows its superiority over other methods with significant PSNR gains of 2.15dB and 1.99dB. The total number of parameters of VRT is 18.3M, which is slightly smaller than EDVR (23.6M) and PVDNet (23.5M). The runtime is 2.2s per frame on blurred videos. Notably, during evaluation, we do not use any pre-processing techniques such as sequence truncation and image alignment [62, 70].
Fig. 5 shows the visual comparison of different methods. VRT is effective in removing motion blurs and restoring faithful details, such as the pole in the first example and characters in the second one. In comparison, other approaches fail to remove blurs completely and do not produce sharp edges.
|Dataset||VLNB ||DVDnet ||FastDVDnet ||PaCNet ||VRT (ours)|
4.4 Video Denoising
We also conduct experiments on video denoising to show the effectiveness of VRT. Following [79, 80], we train one non-blind model for noise level on the DAVIS  dataset and test it on different noise levels. Table 5 shows the superiority of VRT on two benchmark datasets over existing methods. Even though PaCNet  trains different models separately for different noise levels, VRT still improves the PSNR by 0.822.16dB.
4.5 Ablation Study
For ablation study, we set up a small version of VRT as the baseline model by halving the layer and channel numbers. All models are trained on Vimeo-90K  for bicubic video SR () and tested it on Vid4 .
Impact of multi-scale architecture & parallel warping.
Table 6 shows the ablation study on the multi-scale architecture and parallel warping. When the number of model scales is reduced, the performance drops gradually, even though the computation burden becomes heavier. This is expected because multi-scale processing can help the model utilize information from a larger area and deal with large motions between frames. Besides, parallel warping also helps, bringing an improvement of 0.17dB.
Impact of temporal mutual self attention.
To test the effectiveness of mutual and self attention in TMSA, we conduct ablation study in Table 7. When we replace mutual attention with self attention (i.e., two self attentions) or only use one self attention, the performance drops by 0.110.17dB. One possible reason is that the model may be more focused on the reference frame rather than on the supporting frame during the computation of attention maps. In contrast, using the mutual attention can help the model to explicitly attend to the supporting frame and benefit from feature fusion. In addition, we can find that only using mutual attention is not enough. This is because mutual attention cannot preserve information of reference frames.
Impact of attention window size.
We conduct ablation study in Table 8 to investigate the impact of attention window size in the last few TMSAs of each scale. When the temporal window size increases from 1 to 2, the performance only improves slightly, possibly due to the fact that previous TMSA layers can already make good use of neighboring two-frame information. When the size is increased to 8, we can see an obvious improvement of 0.18dB. As a result, we use the window size of for those layers.
|1 ()||2 ()||3 ()||4 ()||Parallel warping||PSNR|
|Attention 1||Self Attn.||-||Mutual Attn.||Mutual Attn.|
|Attention 2||Self Attn.||Self Attn.||-||Self Attn.|
In this paper, we proposed the Video Restoration Transformer (VRT) for video restoration. Based on a multi-scale framework, it jointly extracts, aligns, and fuses information from different frames at multiple resolutions by two kinds of modules: multiple temporal mutual self attention (TMSA) and parallel warping. More specifically, TMSA is composed of mutual and self attention. Mutual attention allows joint implicit flow estimation and feature warping, while self attention is responsible for feature extraction. Parallel warping is also used to further enhance feature alignment and fusion. Extensive experiments on various benchmark datasets show that VRT brings significant performance gains (up to 2.16dB) for video restoration, including video super-resolution, video deblurring and video denoising.
Acknowledgements This work was partially supported by the ETH Zurich Fund (OK), a Huawei Technologies Oy (Finland) project, the China Scholarship Council and an Amazon AWS grant. Thanks Dr. Gurkirt Singh for insightful discussions. Special thanks goes to Yijue Chen.
-  Pablo Arias and Jean-Michel Morel. Video denoising via empirical bayesian estimation of space-time patches. Journal of Mathematical Imaging and Vision, 60(1):70–93, 2018.
-  Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691, 2021.
-  Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095, 2021.
Jose Caballero, Christian Ledig, Andrew Aitken, Alejandro Acosta, Johannes
Totz, Zehan Wang, and Wenzhe Shi.
Real-time video super-resolution with spatio-temporal networks and
IEEE Conference on Computer Vision and Pattern Recognition, pages 4778–4787, 2017.
-  Jiezhang Cao, Yawei Li, Kai Zhang, and Luc Van Gool. Video super-resolution transformer. arXiv preprint arXiv:2106.06847, 2021.
-  Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
-  Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential components in video super-resolution and beyond. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4947–4956, 2021.
Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy.
Understanding deformable alignment in video super-resolution.
AAAI Conference on Artificial Intelligence, pages 973–981, 2021.
-  Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. arXiv preprint arXiv:2104.13371, 2021.
-  Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half-quadratic regularization algorithms for computed imaging. In International Conference on Image Processing, volume 2, pages 168–172, 1994.
-  Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In IEEE Conference on Computer Vision and Pattern Recognition, pages 12299–12310, 2021.
-  Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In IEEE International Conference on Computer Vision, pages 764–773, 2017.
-  Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, pages 184–199, 2014.
-  Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
-  Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision, pages 2758–2766, 2015.
-  Yuchen Fan, Honghui Shi, Jiahui Yu, Ding Liu, Wei Han, Haichao Yu, Zhangyang Wang, Xinchao Wang, and Thomas S Huang. Balanced two-stage residual networks for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 161–168, 2017.
-  Yuchen Fan, Jiahui Yu, Ding Liu, and Thomas S Huang. Scale-wise convolution for image restoration. In AAAI Conference on Artificial Intelligence, pages 10770–10777, 2020.
-  Dario Fuoli, Shuhang Gu, and Radu Timofte. Efficient video super-resolution through recurrent latent space propagation. In IEEE International Conference on Computer Vision Workshop, pages 3476–3485, 2019.
Alexander Greaves-Tunnell and Zaid Harchaoui.
A statistical investigation of long memory in language and music.
International Conference on Machine Learning, pages 2394–2403, 2019.
-  Yong Guo, Jian Chen, Jingdong Wang, Qi Chen, Jiezhang Cao, Zeshuai Deng, Yanwu Xu, and Mingkui Tan. Closed-loop matters: Dual regression networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5407–5416, 2020.
-  Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12):4338–4364, 2021.
-  Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. Recurrent back-projection network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3897–3906, 2019.
-  Yan Huang, Wei Wang, and Liang Wang. Bidirectional recurrent convolutional networks for multi-frame super-resolution. Advances in Neural Information Processing Systems, 28:235–243, 2015.
-  Yan Huang, Wei Wang, and Liang Wang. Video super-resolution via bidirectional recurrent convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):1015–1028, 2017.
-  Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin Wang, and Qi Tian. Video super-resolution with recurrent structure-detail network. In European Conference on Computer Vision, pages 645–660, 2020.
-  Takashi Isobe, Songjiang Li, Xu Jia, Shanxin Yuan, Gregory Slabaugh, Chunjing Xu, Ya-Li Li, Shengjin Wang, and Qi Tian. Video super-resolution with temporal group attention. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8008–8017, 2020.
-  Takashi Isobe, Fang Zhu, Xu Jia, and Shengjin Wang. Revisiting temporal modeling for video super-resolution. arXiv preprint arXiv:2008.05765, 2020.
-  Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3224–3232, 2018.
Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K Katsaggelos.
Video super-resolution with convolutional neural networks.IEEE Transactions on Computational Imaging, 2(2):109–122, 2016.
-  Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. In Asian Conference on Computer Vision, pages 123–141, 2018.
-  Tae Hyun Kim, Mehdi SM Sajjadi, Michael Hirsch, and Bernhard Scholkopf. Spatio-temporal transformer network for video restoration. In European Conference on Computer Vision, pages 106–122, 2018.
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz,
Zehan Wang, et al.
Photo-realistic single image super-resolution using a generative adversarial network.In IEEE Conference on Computer Vision and Pattern Recognition, pages 4681–4690, 2017.
-  Seunghwan Lee, Donghyeon Cho, Jiwon Kim, and Tae Hyun Kim. Restore from restored: Video restoration with pseudo clean video. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3537–3546, 2021.
-  Dongxu Li, Chenchen Xu, Kaihao Zhang, Xin Yu, Yiran Zhong, Wenqi Ren, Hanna Suominen, and Hongdong Li. Arvo: Learning all-range volumetric correspondence for video deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7721–7731, 2021.
-  Juncheng Li, Zehua Pei, and Tieyong Zeng. From beginner to master: A survey for deep learning-based single-image super-resolution. arXiv preprint arXiv:2109.14335, 2021.
-  Sheng Li, Fengxiang He, Bo Du, Lefei Zhang, Yonghao Xu, and Dacheng Tao. Fast spatio-temporal residual network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10522–10531, 2019.
-  Wenbo Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, and Jiaya Jia. Mucan: Multi-correspondence aggregation network for video super-resolution. In European Conference on Computer Vision, pages 335–351, 2020.
-  Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, and Wanqing Li. Trear: Transformer-based rgb-d egocentric action recognition. IEEE Transactions on Cognitive and Developmental Systems, 2021.
-  Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
-  Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. SwinIR: Image restoration using swin transformer. In IEEE Conference on International Conference on Computer Vision Workshops, 2021.
-  Jingyun Liang, Andreas Lugmayr, Kai Zhang, Martin Danelljan, Luc Van Gool, and Radu Timofte. Hierarchical conditional flow: A unified framework for image super-resolution and image rescaling. In IEEE Conference on International Conference on Computer Vision, 2021.
-  Jingyun Liang, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Mutual affine network for spatially variant kernel estimation in blind image super-resolution. In IEEE Conference on International Conference on Computer Vision, 2021.
-  Jingyun Liang, Kai Zhang, Shuhang Gu, Luc Van Gool, and Radu Timofte. Flow-based kernel prior with application to blind super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10601–10610, 2021.
-  Renjie Liao, Xin Tao, Ruiyu Li, Ziyang Ma, and Jiaya Jia. Video super-resolution via deep draft-ensemble learning. In IEEE International Conference on Computer Vision, pages 531–539, 2015.
-  Jiayi Lin, Yan Huang, and Liang Wang. Fdan: Flow-guided deformable alignment network for video super-resolution. arXiv preprint arXiv:2105.05640, 2021.
-  Ce Liu and Deqing Sun. On bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2):346–360, 2013.
-  Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, and Thomas Huang. Robust video super-resolution with learned temporal dynamics. In IEEE International Conference on Computer Vision, pages 2507–2515, 2017.
-  Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, Xinchao Wang, and Thomas S Huang. Learning temporal dynamics for video super-resolution: A deep learning approach. IEEE Transactions on Image Processing, 27(7):3432–3445, 2018.
-  Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. Non-local recurrent network for image restoration. arXiv preprint arXiv:1806.02919, 2018.
-  Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128(2):261–318, 2020.
-  Yun Liu, Guolei Sun, Yu Qiu, Le Zhang, Ajad Chhatkuli, and Luc Van Gool. Transformer in convolutional neural networks. arXiv preprint arXiv:2106.03180, 2021.
-  Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
-  Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. arXiv preprint arXiv:2106.13230, 2021.
-  Matteo Maggioni, Yibin Huang, Cheng Li, Shuai Xiao, Zhongqian Fu, and Fenglong Song. Efficient multi-stage video denoising with recurrent spatio-temporal fusion. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3466–3475, 2021.
-  Yiqun Mei, Yuchen Fan, and Yuqian Zhou. Image super-resolution with non-local sparse attention. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3517–3526, 2021.
-  Yiqun Mei, Yuchen Fan, Yuqian Zhou, Lichao Huang, Thomas S Huang, and Honghui Shi. Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5690–5699, 2020.
-  Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1996–2005, 2019.
-  Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3883–3891, 2017.
-  Seungjun Nah, Sanghyun Son, and Kyoung Mu Lee. Recurrent neural networks with intra-frame iterations for video deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8102–8111, 2019.
-  Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. Video transformer network. arXiv preprint arXiv:2102.00719, 2021.
A reimplementation of SPyNet using PyTorch.https://github.com/sniklaus/pytorch-spynet, 2018.
-  Jinshan Pan, Haoran Bai, and Jinhui Tang. Cascaded deep video deblurring using temporal sharpness prior. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3043–3051, 2020.
-  Jinshan Pan, Haoran Bai, and Jinhui Tang. Cascaded deep video deblurring using temporal sharpness prior. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3043–3051, 2020.
-  Anurag Ranjan and Michael J Black. Optical flow estimation using a spatial pyramid network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4161–4170, 2017.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241, 2015.
-  Mehdi SM Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6626–6634, 2018.
-  Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
-  Dev Yashpal Sheth, Sreyas Mohan, Joshua L Vincent, Ramon Manzorro, Peter A Crozier, Mitesh M Khapra, Eero P Simoncelli, and Carlos Fernandez-Granda. Unsupervised deep video denoising. In IEEE International Conference on Computer Vision, pages 1759–1768, 2021.
-  Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
-  Hyeongseok Son, Junyong Lee, Jonghyeop Lee, Sunghyun Cho, and Seungyong Lee. Recurrent video deblurring with blur-invariant motion estimation and pixel volumes. ACM Transactions on Graphics, 40(5):1–18, 2021.
-  Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1279–1288, 2017.
-  Maitreya Suin, Kuldeep Purohit, and AN Rajagopalan. Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3615, 2020.
-  Maitreya Suin and AN Rajagopalan. Gated spatio-temporal attention-guided video deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7802–7811, 2021.
-  Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
-  Guolei Sun, Yun Liu, Thomas Probst, Danda Pani Paudel, Nikola Popovic, and Luc Van Gool. Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926, 2021.
-  Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Mefnet: Multi-scale event fusion network for motion deblurring. arXiv preprint arXiv:2112.00167, 2021.
-  Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deep video super-resolution. In IEEE International Conference on Computer Vision, pages 4472–4480, 2017.
-  Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8174–8182, 2018.
-  Matias Tassano, Julie Delon, and Thomas Veit. Dvdnet: A fast network for deep video denoising. In IEEE International Conference on Image Processing, pages 1805–1809, 2019.
-  Matias Tassano, Julie Delon, and Thomas Veit. Fastdvdnet: Towards real-time deep video denoising without flow estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1354–1363, 2020.
-  Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3360–3369, 2020.
-  Gregory Vaksman, Michael Elad, and Peyman Milanfar. Patch craft: Video denoising by deep modeling and patch matching. In IEEE International Conference on Computer Vision, pages 1759–1768, 2021.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
-  Longguang Wang, Yulan Guo, Li Liu, Zaiping Lin, Xinpu Deng, and Wei An. Deep video super-resolution using hr optical flow estimation. IEEE Transactions on Image Processing, 29:4323–4336, 2020.
-  Longguang Wang, Yingqian Wang, Xiaoyu Dong, Qingyu Xu, Jungang Yang, Wei An, and Yulan Guo. Unsupervised degradation representation learning for blind super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10581–10590, 2021.
-  Longguang Wang, Yingqian Wang, Zhengfa Liang, Zaiping Lin, Jungang Yang, Wei An, and Yulan Guo. Learning parallax attention for stereo image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 12250–12259, 2019.
-  Longguang Wang, Yingqian Wang, Zaiping Lin, Jungang Yang, Wei An, and Yulan Guo. Learning a single network for scale-arbitrary super-resolution. In IEEE Conference on International Conference on Computer Vision, pages 10581–10590, 2021.
-  Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1954–1963, 2019.
-  Zhendong Wang, Xiaodong Cun, Jianmin Bao, and Jianzhuang Liu. Uformer: A general u-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106, 2021.
-  Christoph Wick, Jochen Zöllner, and Tobias Grüning. Transformer for handwritten text recognition using bidirectional post-decoding. In International Conference on Document Analysis and Recognition, pages 112–126, 2021.
-  Xiaoyu Xiang, Qian Lin, and Jan P Allebach. Boosting high-level vision with joint compression artifacts reduction and super-resolution. In International Conference on Pattern Recognition, pages 2390–2397, 2021.
-  Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3370–3379, 2020.
-  Xinguang Xiang, Hao Wei, and Jinshan Pan. Deep video deblurring using sharpness features from exemplars. IEEE Transactions on Image Processing, 29:8976–8987, 2020.
-  Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8):1106–1125, 2019.
-  Ren Yang and Radu Timofte. Ntire 2021 challenge on quality enhancement of compressed video: Methods and results. In IEEE Conference on Computer Vision and Pattern Recognition, pages 647–666, 2021.
-  Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, Tao Lu, Xin Tian, and Jiayi Ma. Omniscient video super-resolution. In IEEE International Conference on Computer Vision, pages 4429–4438, 2021.
-  Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In IEEE International Conference on Computer Vision, pages 3106–3115, 2019.
-  Jiahui Yu, Yuchen Fan, Jianchao Yang, Ning Xu, Zhaowen Wang, Xinchao Wang, and Thomas Huang. Wide activation for efficient and accurate image super-resolution. arXiv preprint arXiv:1808.08718, 2018.
-  Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-stage progressive image restoration. In IEEE Conference on Computer Vision and Pattern Recognition, pages 14821–14831, 2021.
-  Hongguang Zhang, Yuchao Dai, Hongdong Li, and Piotr Koniusz. Deep stacked hierarchical multi-patch network for image deblurring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5978–5986, 2019.
-  Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
-  Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In IEEE Conference on International Conference on Computer Vision, 2021.
-  Kaihao Zhang, Wenhan Luo, Yiran Zhong, Lin Ma, Wei Liu, and Hongdong Li. Adversarial spatio-temporal learning for video deblurring. IEEE Transactions on Image Processing, 28(1):291–301, 2018.
-  Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
-  Kai Zhang, Wangmeng Zuo, and Lei Zhang. Learning a single convolutional super-resolution network for multiple degradations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3262–3271, 2018.
-  Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision, pages 286–301, 2018.
-  Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. Residual non-local attention networks for image restoration. arXiv preprint arXiv:1903.10082, 2019.
-  Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2472–2481, 2018.
-  Zhihang Zhong, Ye Gao, Yinqiang Zheng, and Bo Zheng. Efficient spatio-temporal recurrent neural network for video deblurring. In European Conference on Computer Vision, pages 191–207, 2020.
-  Shangchen Zhou, Jiawei Zhang, Jinshan Pan, Haozhe Xie, Wangmeng Zuo, and Jimmy Ren. Spatio-temporal filter adaptive network for video deblurring. In IEEE International Conference on Computer Vision, pages 2482–2491, 2019.