Deformable 3D Convolution for Video Super-Resolution

04/06/2020
by   Xinyi Ying, et al.
Sohu
0

The spatio-temporal information among video sequences is significant for video super-resolution (SR). However, the spatio-temporal information cannot be fully used by existing video SR methods since spatial feature extraction and temporal motion compensation are usually performed sequentially. In this paper, we propose a deformable 3D convolution network (D3Dnet) to incorporate spatio-temporal information from both spatial and temporal dimensions for video SR. Specifically, we introduce deformable 3D convolutions (D3D) to integrate 2D spatial deformable convolutions with 3D convolutions (C3D), obtaining both superior spatio-temporal modeling capability and motion-aware modeling flexibility. Extensive experiments have demonstrated the effectiveness of our proposed D3D in exploiting spatio-temporal information. Comparative results show that our network outperforms the state-of-the-art methods. Code is available at: https://github.com/XinyiYing/D3Dnet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

01/04/2022

MoCoPnet: Exploring Local Motion and Contrast Priors for Infrared Small Target Super-Resolution

Infrared small target super-resolution (SR) aims to recover reliable and...
04/05/2019

Fast Spatio-Temporal Residual Network for Video Super-Resolution

Recently, deep learning based video super-resolution (SR) methods have a...
05/27/2021

Blind Motion Deblurring Super-Resolution: When Dynamic Spatio-Temporal Learning Meets Static Image Understanding

Single-image super-resolution (SR) and multi-frame SR are two ways to su...
11/15/2021

D^2Conv3D: Dynamic Dilated Convolutions for Object Segmentation in Videos

Despite receiving significant attention from the research community, the...
01/17/2020

Temporal Interlacing Network

For a long time, the vision community tries to learn the spatio-temporal...
03/12/2022

Deformable VisTR: Spatio temporal deformable attention for video instance segmentation

Video instance segmentation (VIS) task requires classifying, segmenting,...
06/25/2020

SmallBigNet: Integrating Core and Contextual Views for Video Classification

Temporal convolution has been widely used for video classification. Howe...

Code Repositories

D3Dnet

Repository for "Deformable 3D Convolution for Video Super-Resolution", arXiv, 2020


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Video super-resolution (SR) aims at recovering high-resolution (HR) images from low-resolution (LR) video sequences. This technique has been widely employed in many applications such as video surveillance [6] and high-definition devices [2, 4]. Since multiple images provide additional information in temporal dimension, it is important to fully use the spatio-temporal dependency to enhance the video SR performance.

Current video SR methods commonly follows a three-step pipeline, which consists of feature extraction, motion compensation and reconstruction. Existing video SR methods generally focus on the motion compensation step and propose different approaches to handle this problem. Specifically, Liao et al. [10] first achieved motion compensation using several optical flow algorithms to generate SR drafts and then ensembling these SR drafts by a CNN. Liu et al. [11, 12]

first performed rectified optical flow alignment and then fed these aligned LR frames to a temporal adaptive neural network to reconstruct an SR frame in an optimal temporal scale. Wang

et al. [19, 20] proposed an SOF-VSR

network to obtain temporally consistent details through HR optical flow estimation. Caballero

et al. [1]

proposed a spatial transformer network by employing spatio-temporal

ESPCN [15] to recover an HR frame from compensated consecutive sequence in an end-to-end manner. Tao et al. [16] integrated a sub-pixel motion compensation (SPMC) layer into CNNs to achieve improved performance. All these methods perform motion compensation in two separate stages: motion estimation by optical flow approaches and frame alignment by warping, resulting in ambiguous and duplicate results [13].

To achieve motion compensation in a unified stage, Tian et al. [17] proposed a temporally deformable alignment network (TDAN) for video SR using deformable convolutions. Specifically, neighboring frames are first aligned to the reference frame by deformable convolutions. Afterwards, these aligned frames are fed to CNNs to generate SR results. Wang et al. [22] proposed an enhanced deformable video restoration network, namely EDVR. The pyramid, cascading and deformable (PCD) alignment module of EDVR can handle complicated and long-range motion and therefore improves the performance of video SR. Xiang et al. [23] proposed a deformable ConvLSTM method to exploit superior temporal information for video sequences with large motion. However, these aforementioned methods share a common issue. That is, feature extraction is performed within the spatial domain and motion compensation is performed within the temporal domain. As shown in Fig. 1 (a), these two steps are orthogonal. Consequently, the spatio-temporal information within a video sequence cannot be fully used and the coherence of SR video sequences is reduced.

Since 3D convolutions (C3D) [18] can model appearance and motion simultaneously, it is straightforward to apply C3D for video SR. Li et al. [9] proposed a fast spatio-temporal residual network (FSTRN) to perform feature extraction and motion compensation jointly. However, due to its fixed receptive field, C3D [18] cannot model large motion effectively. To obtain both spatio-temporal modeling capability and motion-aware modeling flexibility, we integrate 2D spatial deformable convolutions [3] with C3D [18] to achieve deformable 3D convolutions (D3D). D3D can enhance the transformation modeling capability in spatial dimension and achieve superior performance in motion compensation along temporal dimension.

In this paper, we propose a deformable 3D convolution network (D3Dnet) for video SR. Specifically, LR video sequences are first sent to C3D to generate features, which are further fed to 5 cascaded residual D3D (resD3D) blocks to exploit spatio-temporal information and achieve motion compensation, as shown in Fig. 1 (b). Then, the compensated LR features are fused by a bottleneck layer. Finally, the fused LR feature is fed to 6 residual blocks to reconstruct the SR reference frame. Extensive experiments have demonstrated the superiority of our D3Dnet.

Fig. 1: Difference between existing two-stage video SR methods and our one-stage D3Dnet. (a) The architecture of existing two-stage video SR methods, which achieve feature extraction and motion compensation along the spatial and temporal dimensions separately. (b) The architecture of our one-stage D3Dnet, which integrates the two steps to fully exploit the spatio-temporal information for video SR.

Ii Methodology

Ii-a Deformable 3D Convolution

The plain C3D [18] is achieved in the following two steps: 1) 3D convolution kernel sampling on input features ; 2) Weighted summation of sampled values by function . To be specific, the features passed through a regular 333 convolution kernel with a dilation of 1 can be formulated as:

(1)

where represents a location in the output feature and represents the 333 convolution sampling grid . Here, is the size of the sampling grid. As shown in Fig. 2, the 333 light orange cubes in the input feature can be considered as the plain C3D sampling grid, which is used to generate the dark orange cube in the output feature.

Modified from C3D, D3D can enlarge the spatial receptive field with learnable offsets, which improves the appearance and motion modeling capability. As illustrated in Fig. 2, the input feature with size is first fed to C3D to generate offsets with size . These offsets have the same spatio-temporal resolution as the input feature while the channel dimension is set to for 2D spatial deformable convolutions. More specifically, the light cube in the middle offset field lies at the same location as the core of the plain C3D kernel. Numerically, the offset cube has channels and its values represent the deformation of the convolution sampling grid in width and height . Then, the learned offsets are used to guide the deformation of the plain C3D sampling grid (i.e., the light orange cubes in the input feature) to generate a D3D sampling grid (i.e., the dark orange cubes in the input feature). Finally, the D3D sampling grid is employed to produce the output feature. In summary, D3D is formulated as:

(2)

where offsets and . Here, represents the temporal length of sampling grid and is set to 3. represents round down operation. The offsets are generally fractional. Following [3, 26]

, we use bilinear interpolation to generate exact values.

Ii-B Overall Framework

Fig. 2: Toy example of deformable 3D convolution (D3D). In the input feature with size , the light orange cubes represent the plain 333 convolution sampling grid, and the dark orange cubes represent the deformable grid. The deformations are caused by extra offsets, which are generated by an offset generator (the orange box of 333 convolution). The cube in offset field with size lies at the same location as the core of the convolution sampling grid and has values along its channel dimension. Here, is the size of the sampling grid and is set to 27. The values of the offset cube represent the deformation values of convolution sampling grid ( in width and in depth). Finally, the offset is used to guide the convolution for the generation of the dark orange cube in the output feature.
Fig. 3: An illustration of our deformable 3D convolution network (D3Dnet). (a) The overall framework. (b) The residual deformable 3D convolution (resD3D) block for simultaneous appearance and motion modeling. (c) The residual block for the reconstruction of SR results.

The overall framework is shown in Fig. 3 (a). Video sequences are used to reconstruct the SR reference frames. Specifically, a video sequence with 7 frames is first fed to a C3D [18] layer to generate features, which are then fed to 5 resD3D blocks (shown in Fig. 3 (b)) to achieve motion compensation while capturing spatial information. Then, a bottleneck layer is employed to fuse the compensated video sequence to a reference feature. Finally, the fused feature is processed by 6 cascaded residual blocks (shown in Fig. 3

(c)) to reconstruct the SR reference frame. We use the mean square error (MSE) between the SR reference frame and the groundtruth reference frame as the loss function of our network.

Iii Experiments

Iii-a Implementation Details

For training, we employed the Vimeo-90k dataset [24] as the training set with a fixed resolution of 448256. To generate training data, all video sequences were downsampled by times to produce their LR counterparts for SR. Then, we randomly cropped these LR images into patches of size

with a stride of

. Their corresponding HR images were cropped into patches accordingly. We followed [20, 19] to augment the training data by random flipping and rotation.

For test, we employed the Vid4 dataset [1]. Following [17, 21]

, we used peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) as quantitative metrics to evaluate the SR performance. In addition, we used motion-based video integrity evaluation index (MOVIE) and temporal MOVIE (T-MOVIE)

[14] to evaluate the consistency quality.

Fig. 4: Comparative results of the two-stage method and two one-stage methods (C3D and D3D) with respect to the number of blocks. “params.” represents the number of parameters.

All experiments were implemented by Pytorch with an Nvidia RTX 2080Ti GPU. The networks were trained using the Adam method

[8]. The learning rate was initially set to

and halved for every 6 epochs. We stopped the training after 35 epochs.

Iii-B Ablation Study

Iii-B1 One-stage vs. Two-stage

We designed two variants to test the performance improvement introduced by integrating feature extraction and motion compensation. For the two-stage variant, we replaced the resD3D blocks with residual blocks and deformable alignment [17] to sequentially perform spatial feature extraction and temporal motion compensation (see two-stage in Fig. 4). For the one-stage variant, we replaced resD3D blocks with residual C3D blocks to integrate the two steps without deformation (see C3D in Fig. 4). It can be observed that the PSNR and SSIM values of the two-stage variant are lower than the one-stage variant by 0.10 and 0.006 in average. In contrast, the two-stage variant has more parameters than the one-stage variant. It demonstrates that the one-stage method can fully exploit the spatio-temporal information for video SR with less parameters.

Fig. 5: Qualitative results achieved by different methods. Blue boxes represent the temporal profiles among different frames.
fra. City Walk Calendar Foliage Average
27.00/0.765 29.31/0.889 22.98/0.762 25.61/0.729 26.22/0.786
27.16/0.776 29.63/0.895 23.19/0.773 25.79/0.740 26.44/0.796
27.23/0.780 29.72/0.896 23.26/0.775 25.88/0.745 26.52/0.799
TABLE I: Results achieved by D3Dnet models trained with different numbers of input frames. “fra.” represents the number of input frames.

Iii-B2 C3D vs. D3D

To test the performance improvement of deformation operation, we compare the results of our network with different number of residual C3D blocks and resD3D blocks (see C3D and D3D in Fig. 4). It can be observed that D3D achieves a significant improvement in SR performance over C3D [18]. Specifically, our network with 5 resD3D blocks achieves an improvement of 0.40 in PSNR and 0.017 in SSIM as compared to the network with 5 resC3D blocks. It demonstrates that D3D is more effective than C3D in appearance and motion modeling. Note that, due to the additional offset learning branch, each resD3D block introduces 0.50M parameters.

Methods VSRnet [7] VESPCN [1] DBPN [5] RCAN [25] TDVSR [11, 12] DRVSR [16] SOF-VSR [20, 19] TDAN [17] D3Dnet
City 25.62/0.654 26.17/0.696 25.80/0.682 26.10/0.696 26.41/0.719 26.88/0.752 26.78/0.747 26.99/0.757 27.23/0.780
Walk 27.54/0.844 28.31/0.861 28.64/0.872 28.65/0.872 28.24/0.746 28.80/0.783 28.99/0.878 29.50/0.890 29.72/0.896
Calendar 21.34/0.644 21.98/0.691 22.29/0.715 22.33/0.725 22.15/0.704 22.76/0.744 22.76/0.744 22.98/0.756 23.26/0.775
Foliage 24.41/0.645 24.91/0.673 24.73/0.661 24.74/0.665 25.16/0.700 25.56/0.721 25.55/0.718 25.51/0.717 25.88/0.745
Average 24.73/0.697 25.34/0.730 25.37/0.732 25.46/0.740 25.49/0.746 25.99/0.773 26.02/0.771 26.24/0.780 26.52/0.799
TABLE II: Quantitative results (PSNR/SSIM) achieved by different methods. Best results are shown in boldface.

Iii-B3 Context Length

The results of our D3Dnet with different numbers (i.e., 3, 5 and 7) of input frames are shown in Table I. It can be observed that the performance improves as the number of input frames increases. Specifically, the PSNR/SSIM improves from 26.22/0.786 to 26.52/0.799 when the number of input frames increases from 3 to 7. That is because, more input frames introduce additional temporal information, which improves the performance of video SR.

Iii-C Comparison to the State-of-the-arts

We compare our D3Dnet with 2 single image SR methods (i.e., DBPN [5] and RCAN [25]) and 6 video SR methods (i.e., VSRnet [7], VESPCN [1], TDVSR [11, 12], DRVSR [16], SOF-VSR [20, 19], and TDAN [17]). For fair comparison, the first and the last 2 frames of the video sequences were not used for performance evaluation.

Methods VSRnet[7] TDVSR [11] SOF-VSR[20] DRVSR [16] D3Dnet
T-MOVIE() 26.05 23.23 19.35 18.28 15.45
MOVIE() 6.01 4.92 4.25 4.00 3.38
TABLE III: Quantitative results of T-MOVIE and MOVIE achieved by different video SR methods. Best results are shown in boldface.

Quantitative results are listed in Tables II and III. D3Dnet achieves the highest scores of PSNR and SSIM among all these methods, which means that our network can recover accurate spatial dependency details. That is because, D3D improves the spatial information exploitation capability and perform motion compensation effectively. In addition, D3Dnet outperforms exsiting methods in terms of T-MOVIE and MOVIE by a notable margin, which means that our results are temporally more consistent. That is because, the one-stage D3Dnet can recover more accurate temporal dependency details.

Qualitative results are shown in Fig. 5. It can be observed from the zoom-in regions that D3Dnet can recover finer details (e.g., the sharp edge of the word ‘MAREE’ and the clear and smooth roof pattern). In addition, the temporal profiles of D3Dnet are clearer and smoother than other methods. In conclusion, our network can achieve high-quality and temporally consistent SR results.

Iv Conclusion

In this paper, we have proposed a deformable 3D convolution network (D3Dnet) to exploit spatio-temporal information for video SR. Our network introduces deformable 3D convolutions (D3D) to model appearance and motion simultaneously. Experimental results have demonstrated that our D3Dnet can effectively use the additional temporal information for video SR and outperforms the state-of-the-art SR methods.

References

  • [1] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi (2017) Real-time video super-resolution with spatio-temporal networks and motion compensation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4778–4787. Cited by: §I, §III-A, §III-C, TABLE II.
  • [2] J. Chen, J. Nunez-Yanez, and A. Achim (2011) Video super-resolution using generalized gaussian markov random fields. IEEE Signal Processing Letters 19 (2), pp. 63–66. Cited by: §I.
  • [3] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773. Cited by: §I, §II-A.
  • [4] B. Gunturk, Y. Altunbasak, and R. Mersereau (2002) Multiframe resolution-enhancement methods for compressed video. IEEE Signal Processing Letters 9 (6), pp. 170–174. Cited by: §I.
  • [5] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1664–1673. Cited by: §III-C, TABLE II.
  • [6] K. Jiang, Z. Wang, P. Yi, and J. Jiang (2018) A progressively enhanced network for video satellite imagery superresolution. IEEE Signal Processing Letters 25 (11), pp. 1630–1634. Cited by: §I.
  • [7] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos (2016)

    Video super-resolution with convolutional neural networks

    .
    IEEE Transactions on Computational Imaging 2 (2), pp. 109–122. Cited by: §III-C, TABLE II, TABLE III.
  • [8] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Cited by: §III-A.
  • [9] S. Li, F. He, B. Du, L. Zhang, Y. Xu, and D. Tao (2019) Fast spatio-temporal residual network for video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [10] R. Liao, X. Tao, R. Li, Z. Ma, and J. Jia (2015) Video super-resolution via deep draft-ensemble learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 531–539. Cited by: §I.
  • [11] D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang, and T. Huang (2017) Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2507–2515. Cited by: §I, §III-C, TABLE II, TABLE III.
  • [12] D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang, X. Wang, and T. S. Huang (2018)

    Learning temporal dynamics for video super-resolution: a deep learning approach

    .
    IEEE Transactions on Image Processing 27 (7), pp. 3432–3445. Cited by: §I, §III-C, TABLE II.
  • [13] Y. Lu, J. Valmadre, H. Wang, J. Kannala, M. Harandi, and P. Torr (2020) Devon: deformable volume network for learning optical flow. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 2705–2713. Cited by: §I.
  • [14] K. Seshadrinathan and A. C. Bovik (2010) Motion tuned spatio-temporal quality assessment of natural videos. IEEE Transactions on Image Processing 19 (2), pp. 335–350. Cited by: §III-A.
  • [15] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883. Cited by: §I.
  • [16] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia (2017) Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4472–4480. Cited by: §I, §III-C, TABLE II, TABLE III.
  • [17] Y. Tian, Y. Zhang, Y. Fu, and C. Xu (2020) TDAN: temporally deformable alignment network for video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Cited by: §I, §III-A, §III-B1, §III-C, TABLE II.
  • [18] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. Cited by: §I, §II-A, §II-B, §III-B2.
  • [19] L. Wang, Y. Guo, Z. Lin, X. Deng, and W. An (2018) Learning for video super-resolution through HR optical flow estimation. In Asian Conference on Computer Vision, pp. 514–529. Cited by: §I, §III-A, §III-C, TABLE II.
  • [20] L. Wang, Y. Guo, L. Liu, Z. Lin, X. Deng, and W. An (2020) Deep video super-resolution using HR optical flow estimation. IEEE Transactions on Image Processing. Cited by: §I, §III-A, §III-C, TABLE II, TABLE III.
  • [21] L. Wang, Y. Wang, Z. Liang, Z. Lin, J. Yang, W. An, and Y. Guo (2019) Learning parallax attention for stereo image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12250–12259. Cited by: §III-A.
  • [22] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy (2019) EDVR: video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §I.
  • [23] X. Xiang, Y. Tian, Y. Zhang, Y. Fu, J. P. Allebach, and C. Xu (2020) Zooming slow-mo: fast and accurate one-stage space-time video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [24] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2017) Video enhancement with task-oriented flow. International Journal of Computer Vision (1). Cited by: §III-A.
  • [25] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision, pp. 286–301. Cited by: §III-C, TABLE II.
  • [26] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: §II-A.