Repository for "Deformable 3D Convolution for Video Super-Resolution", arXiv, 2020
The spatio-temporal information among video sequences is significant for video super-resolution (SR). However, the spatio-temporal information cannot be fully used by existing video SR methods since spatial feature extraction and temporal motion compensation are usually performed sequentially. In this paper, we propose a deformable 3D convolution network (D3Dnet) to incorporate spatio-temporal information from both spatial and temporal dimensions for video SR. Specifically, we introduce deformable 3D convolutions (D3D) to integrate 2D spatial deformable convolutions with 3D convolutions (C3D), obtaining both superior spatio-temporal modeling capability and motion-aware modeling flexibility. Extensive experiments have demonstrated the effectiveness of our proposed D3D in exploiting spatio-temporal information. Comparative results show that our network outperforms the state-of-the-art methods. Code is available at: https://github.com/XinyiYing/D3Dnet.READ FULL TEXT VIEW PDF
Repository for "Deformable 3D Convolution for Video Super-Resolution", arXiv, 2020
Video super-resolution (SR) aims at recovering high-resolution (HR) images from low-resolution (LR) video sequences. This technique has been widely employed in many applications such as video surveillance  and high-definition devices [2, 4]. Since multiple images provide additional information in temporal dimension, it is important to fully use the spatio-temporal dependency to enhance the video SR performance.
Current video SR methods commonly follows a three-step pipeline, which consists of feature extraction, motion compensation and reconstruction. Existing video SR methods generally focus on the motion compensation step and propose different approaches to handle this problem. Specifically, Liao et al.  first achieved motion compensation using several optical flow algorithms to generate SR drafts and then ensembling these SR drafts by a CNN. Liu et al. [11, 12]
first performed rectified optical flow alignment and then fed these aligned LR frames to a temporal adaptive neural network to reconstruct an SR frame in an optimal temporal scale. Wanget al. [19, 20] proposed an SOF-VSR
network to obtain temporally consistent details through HR optical flow estimation. Caballeroet al. 
proposed a spatial transformer network by employing spatio-temporalESPCN  to recover an HR frame from compensated consecutive sequence in an end-to-end manner. Tao et al.  integrated a sub-pixel motion compensation (SPMC) layer into CNNs to achieve improved performance. All these methods perform motion compensation in two separate stages: motion estimation by optical flow approaches and frame alignment by warping, resulting in ambiguous and duplicate results .
To achieve motion compensation in a unified stage, Tian et al.  proposed a temporally deformable alignment network (TDAN) for video SR using deformable convolutions. Specifically, neighboring frames are first aligned to the reference frame by deformable convolutions. Afterwards, these aligned frames are fed to CNNs to generate SR results. Wang et al.  proposed an enhanced deformable video restoration network, namely EDVR. The pyramid, cascading and deformable (PCD) alignment module of EDVR can handle complicated and long-range motion and therefore improves the performance of video SR. Xiang et al.  proposed a deformable ConvLSTM method to exploit superior temporal information for video sequences with large motion. However, these aforementioned methods share a common issue. That is, feature extraction is performed within the spatial domain and motion compensation is performed within the temporal domain. As shown in Fig. 1 (a), these two steps are orthogonal. Consequently, the spatio-temporal information within a video sequence cannot be fully used and the coherence of SR video sequences is reduced.
Since 3D convolutions (C3D)  can model appearance and motion simultaneously, it is straightforward to apply C3D for video SR. Li et al.  proposed a fast spatio-temporal residual network (FSTRN) to perform feature extraction and motion compensation jointly. However, due to its fixed receptive field, C3D  cannot model large motion effectively. To obtain both spatio-temporal modeling capability and motion-aware modeling flexibility, we integrate 2D spatial deformable convolutions  with C3D  to achieve deformable 3D convolutions (D3D). D3D can enhance the transformation modeling capability in spatial dimension and achieve superior performance in motion compensation along temporal dimension.
In this paper, we propose a deformable 3D convolution network (D3Dnet) for video SR. Specifically, LR video sequences are first sent to C3D to generate features, which are further fed to 5 cascaded residual D3D (resD3D) blocks to exploit spatio-temporal information and achieve motion compensation, as shown in Fig. 1 (b). Then, the compensated LR features are fused by a bottleneck layer. Finally, the fused LR feature is fed to 6 residual blocks to reconstruct the SR reference frame. Extensive experiments have demonstrated the superiority of our D3Dnet.
The plain C3D  is achieved in the following two steps: 1) 3D convolution kernel sampling on input features ; 2) Weighted summation of sampled values by function . To be specific, the features passed through a regular 333 convolution kernel with a dilation of 1 can be formulated as:
where represents a location in the output feature and represents the 333 convolution sampling grid . Here, is the size of the sampling grid. As shown in Fig. 2, the 333 light orange cubes in the input feature can be considered as the plain C3D sampling grid, which is used to generate the dark orange cube in the output feature.
Modified from C3D, D3D can enlarge the spatial receptive field with learnable offsets, which improves the appearance and motion modeling capability. As illustrated in Fig. 2, the input feature with size is first fed to C3D to generate offsets with size . These offsets have the same spatio-temporal resolution as the input feature while the channel dimension is set to for 2D spatial deformable convolutions. More specifically, the light cube in the middle offset field lies at the same location as the core of the plain C3D kernel. Numerically, the offset cube has channels and its values represent the deformation of the convolution sampling grid in width and height . Then, the learned offsets are used to guide the deformation of the plain C3D sampling grid (i.e., the light orange cubes in the input feature) to generate a D3D sampling grid (i.e., the dark orange cubes in the input feature). Finally, the D3D sampling grid is employed to produce the output feature. In summary, D3D is formulated as:
, we use bilinear interpolation to generate exact values.
The overall framework is shown in Fig. 3 (a). Video sequences are used to reconstruct the SR reference frames. Specifically, a video sequence with 7 frames is first fed to a C3D  layer to generate features, which are then fed to 5 resD3D blocks (shown in Fig. 3 (b)) to achieve motion compensation while capturing spatial information. Then, a bottleneck layer is employed to fuse the compensated video sequence to a reference feature. Finally, the fused feature is processed by 6 cascaded residual blocks (shown in Fig. 3
(c)) to reconstruct the SR reference frame. We use the mean square error (MSE) between the SR reference frame and the groundtruth reference frame as the loss function of our network.
For training, we employed the Vimeo-90k dataset  as the training set with a fixed resolution of 448256. To generate training data, all video sequences were downsampled by times to produce their LR counterparts for SR. Then, we randomly cropped these LR images into patches of size
with a stride of. Their corresponding HR images were cropped into patches accordingly. We followed [20, 19] to augment the training data by random flipping and rotation.
, we used peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) as quantitative metrics to evaluate the SR performance. In addition, we used motion-based video integrity evaluation index (MOVIE) and temporal MOVIE (T-MOVIE) to evaluate the consistency quality.
We designed two variants to test the performance improvement introduced by integrating feature extraction and motion compensation. For the two-stage variant, we replaced the resD3D blocks with residual blocks and deformable alignment  to sequentially perform spatial feature extraction and temporal motion compensation (see two-stage in Fig. 4). For the one-stage variant, we replaced resD3D blocks with residual C3D blocks to integrate the two steps without deformation (see C3D in Fig. 4). It can be observed that the PSNR and SSIM values of the two-stage variant are lower than the one-stage variant by 0.10 and 0.006 in average. In contrast, the two-stage variant has more parameters than the one-stage variant. It demonstrates that the one-stage method can fully exploit the spatio-temporal information for video SR with less parameters.
To test the performance improvement of deformation operation, we compare the results of our network with different number of residual C3D blocks and resD3D blocks (see C3D and D3D in Fig. 4). It can be observed that D3D achieves a significant improvement in SR performance over C3D . Specifically, our network with 5 resD3D blocks achieves an improvement of 0.40 in PSNR and 0.017 in SSIM as compared to the network with 5 resC3D blocks. It demonstrates that D3D is more effective than C3D in appearance and motion modeling. Note that, due to the additional offset learning branch, each resD3D block introduces 0.50M parameters.
|Methods||VSRnet ||VESPCN ||DBPN ||RCAN ||TDVSR [11, 12]||DRVSR ||SOF-VSR [20, 19]||TDAN ||D3Dnet|
The results of our D3Dnet with different numbers (i.e., 3, 5 and 7) of input frames are shown in Table I. It can be observed that the performance improves as the number of input frames increases. Specifically, the PSNR/SSIM improves from 26.22/0.786 to 26.52/0.799 when the number of input frames increases from 3 to 7. That is because, more input frames introduce additional temporal information, which improves the performance of video SR.
We compare our D3Dnet with 2 single image SR methods (i.e., DBPN  and RCAN ) and 6 video SR methods (i.e., VSRnet , VESPCN , TDVSR [11, 12], DRVSR , SOF-VSR [20, 19], and TDAN ). For fair comparison, the first and the last 2 frames of the video sequences were not used for performance evaluation.
|Methods||VSRnet||TDVSR ||SOF-VSR||DRVSR ||D3Dnet|
Quantitative results are listed in Tables II and III. D3Dnet achieves the highest scores of PSNR and SSIM among all these methods, which means that our network can recover accurate spatial dependency details. That is because, D3D improves the spatial information exploitation capability and perform motion compensation effectively. In addition, D3Dnet outperforms exsiting methods in terms of T-MOVIE and MOVIE by a notable margin, which means that our results are temporally more consistent. That is because, the one-stage D3Dnet can recover more accurate temporal dependency details.
Qualitative results are shown in Fig. 5. It can be observed from the zoom-in regions that D3Dnet can recover finer details (e.g., the sharp edge of the word ‘MAREE’ and the clear and smooth roof pattern). In addition, the temporal profiles of D3Dnet are clearer and smoother than other methods. In conclusion, our network can achieve high-quality and temporally consistent SR results.
In this paper, we have proposed a deformable 3D convolution network (D3Dnet) to exploit spatio-temporal information for video SR. Our network introduces deformable 3D convolutions (D3D) to model appearance and motion simultaneously. Experimental results have demonstrated that our D3Dnet can effectively use the additional temporal information for video SR and outperforms the state-of-the-art SR methods.
Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging 2 (2), pp. 109–122. Cited by: §III-C, TABLE II, TABLE III.
Learning temporal dynamics for video super-resolution: a deep learning approach. IEEE Transactions on Image Processing 27 (7), pp. 3432–3445. Cited by: §I, §III-C, TABLE II.