Convolutional neural networks have enabled accurate image super-resolution in real-time. However, recent attempts to benefit from temporal correlations in video super-resolution have been limited to naive or inefficient architectures. In this paper, we introduce spatio-temporal sub-pixel convolution networks that effectively exploit temporal redundancies and improve reconstruction accuracy while maintaining real-time speed. Specifically, we discuss the use of early fusion, slow fusion and 3D convolutions for the joint processing of multiple consecutive video frames. We also propose a novel joint motion compensation and video super-resolution algorithm that is orders of magnitude more efficient than competing methods, relying on a fast multi-resolution spatial transformer module that is end-to-end trainable. These contributions provide both higher accuracy and temporally more consistent videos, which we confirm qualitatively and quantitatively. Relative to single-frame models, spatio-temporal networks can either reduce the computational cost by 30 quality or provide a 0.2dB gain for a similar computational cost. Results on publicly available datasets demonstrate that the proposed algorithms surpass current state-of-the-art performance in both accuracy and efficiency.READ FULL TEXT VIEW PDF
In video super-resolution, the spatio-temporal coherence between, and am...
Interlacing is a widely used technique, for television broadcast and vid...
In this paper, we aim to reduce the computational cost of spatio-tempora...
Video super-resolution (VSR) and frame interpolation (FI) are traditiona...
We propose MeshfreeFlowNet, a novel deep learning-based super-resolution...
Recently, deep learning based video super-resolution (SR) methods have
Extracting per-frame features using convolutional neural networks for
Image and video SR are long-standing challenges of signal processing. SR aims at recovering a HR image or video from its LR version, and finds direct applications ranging from medical imaging [38, 34] to satellite imaging 
, as well as facilitating tasks such as face recognition. The reconstruction of HR data from a LR input is however a highly ill-posed problem that requires additional constraints to be solved. While those constraints are often application-dependent, they usually rely on data redundancy.
In single image SR, where only one LR image is provided, methods exploit inherent image redundancy in the form of local correlations to recover lost high-frequency details by imposing sparsity constraints  or assuming other types of image statistics such as multi-scale patch recurrence . In multi-image SR  it is assumed that different observations of the same scene are available, hence the shared explicit redundancy can be used to constrain the problem and attempt to invert the downscaling process directly. Transitioning from images to videos implies an additional data dimension (time) with a high degree of correlation that can also be exploited to improve performance in terms of accuracy as well as efficiency.
Video SR methods have mainly emerged as adaptations of image SR techniques. Kernel regression methods  have been shown to be applicable to videos using 3D kernels instead of 2D ones . Dictionary learning approaches, which define LR images as a sparse linear combination of dictionary atoms coupled to a HR dictionary, have also been adapted from images  to videos . Another approach is example-based patch recurrence, which assumes patches in a single image or video obey multi-scale relationships, and therefore missing high-frequency content at a given scale can be inferred from coarser scale patches. This was successfully presented by Glasner et al.  for image SR and has later been extended to videos .
When adapting a method from images to videos it is usually beneficial to incorporate the prior knowledge that frames of the same scene of a video can be approximated by a single image and a motion pattern. Estimating and compensating motion is a powerful mechanism to further constrain the problem and expose temporal correlations. It is therefore very common to find video SR methods that explicitly model motion through frames. A natural choice has been to preprocess input frames by compensating inter-frame motion using displacement fields obtained from off-the-shelf optical flow algorithms . This nevertheless requires frame preprocessing and is usually expensive. Alternatively, motion compensation can also be performed jointly with the SR task, as done in the Bayesian approach of Liu et al.  by iteratively estimating motion as part of its wider modeling of the downscaling process.
The advent of neural network techniques that can be trained from data to approximate complex nonlinear functions has set new performance standards in many applications including SR. Dong et al.  proposed to use a CNN architecture for single image SR that was later extended by Kappeler et al.  in a video SR network (VSRnet) which jointly processes multiple input frames. Additionally, compensating the motion of input images with a TV-based optical flow algorithm showed an improved accuracy. Joint motion compensation for SR with neural networks has also been studied through recurrent bidirectional networks .
The common paradigm for CNN based approaches has been to upscale the LR image with bicubic interpolation before attempting to solve the SR problem[6, 22]. However, increasing input image size through interpolation considerably impacts the computational burden for CNN processing. A solution was proposed by Shi et al. with an efficient sub-pixel convolution network (ESPCN) , where an upscaling operation directly mapping from LR to HR space is learnt by the network. This technique reduces runtime by an order of magnitude and enables real-time video SR by independently processing frames with a single frame model. Similar solutions to improve efficiency have also been proposed based on transposed convolutions [7, 20].
Existing solutions for HD video SR have not been able to effectively exploit temporal correlations while performing in real-time. On the one hand, ESPCN  leverages sub-pixel convolution for a very efficient operation, but its naive extension to videos treating frames independently fails to exploit inter-frame redundancies and does not enforce a temporally consistent result. VSRnet , on the other hand, can improve reconstruction quality by jointly processing multiple input frames. However, the preprocessing of LR images with bicubic upscaling and the use of an inefficient motion compensation mechanism slows runtime to about frames per second even on videos smaller than standard definition resolution.
Spatial transformer networks  provide a means to infer parameters for a spatial mapping between two images. These are differentiable networks that can be seamlessly combined and jointly trained with networks targeting other objectives to enhance their performance. For instance, spatial transformer networks were initially shown to facilitate image classification by transforming images onto the same frame of reference . Recently, it has been shown how spatial transformers can encode optical flow features with unsupervised training [11, 1, 29, 14], but they have nevertheless not yet been investigated for video motion compensation. Related approaches have emerged for view synthesis assuming rigid transformations .
In this paper, we combine the efficiency of sub-pixel convolution with the performance of spatio-temporal networks and motion compensation to obtain a fast and accurate video SR algorithm. We study different treatments of the temporal dimension with early fusion, slow fusion and 3D convolutions, which have been previously suggested to extend classification from images to videos [23, 37]. Additionally, we build a motion compensation scheme based on spatial transformers, which is combined with spatio-temporal models to lead to a very efficient solution for video SR with motion compensation that is end-to-end trainable. A high-level diagram of the proposed approach is show in Fig. 1.
The main contributions of this paper are:
Presenting a real-time approach for video SR based on sub-pixel convolution and spatio-temporal networks that improves accuracy and temporal consistency.
Comparing early fusion, slow fusion and 3D convolutions as alternative architectures for discovering spatio-temporal correlations.
Proposing an efficient method for dense inter-frame motion compensation based on a multi-scale spatial transformer network.
Combining the proposed motion compensation technique with spatio-temporal models to provide an efficient, end-to-end trainable motion compensated video SR algorithm.
Our starting point is the real-time image SR method ESPCN . We restrict our analysis to standard architectural choices and do not further investigate potentially beneficial extensions such as recurrence 15, 16]
or training networks based on perceptual loss functions[20, 26, 3, 8]. Throughout the paper we assume all image processing is performed on the y-channel in colour space, and thus we represent all images as 2D matrices.
For a given LR image which is assumed to be the result of low-pass filtering and downscaling by a factor the HR image , the CNN super-resolved solution can be expressed as
Here, are model parameters and represents the mapping function from LR to HR. A convolutional network models this function as a concatenation of layers defined by sets of weights and biases , each followed by non-linearities , with . Formally, the output of each layer is written as
with . We assume the shape of filtering weights to be , where and represent the number and size of filters in layer , with the single frame input meaning . Model parameters are optimised minimising a loss given a set of LR and HR example image pairs, commonly MSE:
Methods preprocessing with bicubic upsampling before mapping from LR to HR impose that the output number of filters is [6, 22]. Using sub-pixel convolution allows to process directly in the LR space and then use
output filters to obtain an HR output tensor with shapethat can be reordered to obtain . This implies that if there exists an upscaling operation that is better suited for the problem than bicubic upsampling, the network can learn it. Moreover, and most importantly, all convolutional processing is performed in LR space, making this approach very efficient.
Spatio-temporal networks assume input data to be a block of spatio-temporal information, such that instead of a single input frame , a sequence of consecutive frames is considered. This can be represented in the network by introducing an additional dimension for temporal depth , with the input depth
representing an odd number of consecutive input frames. If we denote the temporal radius of a spatio-temporal block to be, we define the group of input frames centered at time as , and the problem in Eq. 1 becomes
The shape of weighting filters is also extended by their temporal size , and their tensor shape becomes . We note that it is possible to consider solutions that aim to jointly reconstruct more than a single output frame, which could have advantages at least in terms of computational efficiency. However, in this work we focus on the reconstruction of only a single output frame.
One of the most straightforward approaches for a CNN to process videos is to match the temporal depth of the input layer to the number of frames . This will collapse all temporal information in the first layer and the remaining operations are identical to those in a single image SR network, meaning . An illustration of early fusion is shown in Fig. 1(a) for , where the temporal dimension has been colour coded and the output mapping to 2D space is omitted. This design has been studied for video classification and action recognition [23, 37], and was also one of the architectures proposed in VSRnet . However, VSRnet requires bicubic upsampling as opposed to sub-pixel convolution, making the framework computationally much less efficient in comparison.
Another option is to partially merge temporal information in a hierarchical structure, so it is slowly fused as information progresses through the network. In this case, the temporal depth of network layers is configured to be , and therefore some layers also have a temporal extent until all information has been merged and the depth of the network reduces to . This architecture, termed slow fusion, has shown better performance than early fusion for video classification . In Fig. 1(b) we show a slow fusion network where and the rate of fusion is defined by for or otherwise, meaning that at each layer only two consecutive frames or filter activations are merged until the network’s temporal depth shrinks to . Note that early fusion is an special case of slow fusion.
Another variation of slow fusion is to force layer weights to be shared across the temporal dimension, which has computational advantages. Assuming an online processing of frames, when a new frame becomes available the result of some layers for the previous frame can be reused. For instance, refering to the diagram in Fig. 1(b) and assuming the bottom frame to be the latest frame received, all activations above the dashed line are readily available because they were required for processing the previous frame. This architecture is equivalent to using 3D convolutions, initially proposed as an effective tool to learn spatio-temporal features that can help for video action recognition . An illustration of this design from a 3D convolution perspective is shown in Fig. 1(c), where the arrangement of the temporal and filter features is swapped relative to Fig. 1(b).
We propose the use of an efficient spatial transformer network to compensate the motion between frames fed to the SR network. It has been shown how spatial transformers can effectively encode optical flow to describe motion [29, 1, 14], and are therefore suitable for motion compensation. We will compensate blocks of three consecutive frames to combine the compensation module with the SR network as shown in Fig. 1, but for simplicity we first introduce motion compensation between two frames. Notice that the data used contains inherent motion blur and (dis)occlusions, and even though an explicit modelling for these effects is not used it could potentially improve results.
The task is to find the best optical flow representation relating a new frame with a reference current frame . The flow is assumed pixel-wise dense, allowing to displace each pixel to a new position, and the resulting pixel arrangement requires interpolation back onto a regular grid. We use bilinear interpolation as it is much more efficient than the thin-plate spline interpolation originally proposed in . Optical flow is a function of parameters and is represented with two feature maps corresponding to displacements for the and dimensions, thus a compensated image can be expressed as , or more concisely
We adopt a multi-scale design to represent the flow, which has been shown to be effective in classical methods [10, 2] and also in more recently proposed spatial transformer techniques [11, 1, 9]. A schematic of the design is shown in Fig. 3 and flow estimation modules are detailed in Table 1. First, a coarse estimate of the flow is obtained by early fusing the two input frames and downscaling spatial dimensions with strided convolutions. The estimated flow is upscaled with sub-pixel convolution and the result is applied to warp the target frame producing . The warped image is then processed together with the coarse flow and the original images through a fine flow estimation module. This uses a single strided convolution with stride and a final upscaling stage to obtain a finer flow map . The final motion compensated frame is obtained by warping the target frame with the total flow . Output activations use tanh to represent pixel displacement in normalised space, such that a displacement of means maximum displacement from the center to the border of the image.
To train the spatial transformer to perform motion compensation we optimise its parameters to minimise the MSE between the transformed frame and the reference frame. Similary to classical optical flow methods, we found that it is generally helpful to constrain the flow to behave smoothly in space, and so we penalise the Huber loss of the flow map gradients, namely
In practice we approximate the Huber loss with , where . This function has a smooth behaviour near the origin and is sparsity promoting far from it.
The spatial transformer module is advantageous relative to other motion compensation mechanisms as it is straightforward to combine with a SR network to perform joint motion compensation and video SR. Referring to Fig. 1, the same parameters can be used to model motion of the outer two frames relative to the central frame. The spatial transformer and SR modules are both differentiable and therefore end-to-end trainable. As a result, they can be jointly optimised to minimise a composite loss combining the accuracy of the reconstruction in Eq. 3 with the fidelity of motion compensation in Eq. 6, namely
|Layer||Coarse flow||Fine flow|
Conv k5-n24-s2 / ReLU
|Conv k5-n24-s2 / ReLU|
|2||Conv k3-n24-s1 / ReLU||Conv k3-n24-s1 / ReLU|
|3||Conv k5-n24-s2 / ReLU||Conv k3-n24-s1 / ReLU|
|4||Conv k3-n24-s1 / ReLU||Conv k3-n24-s1 / ReLU|
|5||Conv k3-n32-s1 / tanh||Conv k3-n8-s1 / tanh|
|6||Sub-pixel upscale||Sub-pixel upscale|
In this section, we first analyse spatio-temporal networks for video SR in isolation and later evaluate the benefits of introducing motion compensation. We restrict our experiments to tackle and upscaling of full HD video resolution (), and no compression is applied. To ensure a fair comparison of methods, the number of network parameters need to be comparable so that gains in performance can be attributed to specific choices of network resource allocation and not to a trivial increase in capacity. For a layer , the number of floating-point operations to reconstruct a frame is approximated by
In measuring the complexity of slow fusion networks with weight sharing we look at steady-state operation where the output of some layers is reused from one frame to the following. We note that the analysis of VSRnet variants in  does not take into account model complexity.
We use the CDVL database , which contains uncompressed full HD videos excluding repeated videos, and choose a subset of videos for training. The videos are downscaled and random samples are extracted from each HR-LR video pair to obtain training samples, of which are used for validation. Depending on the network architecture, we refer to a sample as a single input-output frame pair for single frame networks, or as a block of consecutive LR input frames and the corresponding central HR frame for spatio-temporal networks. The remaining videos are used for testing. Although the total number of training frames is large, we foresee that the methods presented could benefit from a richer, more diverse set of videos. Additionally, we present a benchmark against various SR methods on publicly available videos that are recurrently used in the literature and we refer to as Vid4111Vid4 is composed of walk, city, calendar and foliage, and has sizes or . The sequence city has dimensions , which we crop to for upscaling. Results on Vid4 can be downloaded from https://twitter.box.com/v/vespcn-vid4.
All SR models are trained following the same protocol and share similar hyperparameters. Filter sizes are set to, and all non-linearities
are rectified linear units except for the output layer, which uses a linear activation. Biases are initialised toand weights use orthogonal initialisation with gain following recommendations in . All hidden layers are set to have the same number of features. Video samples are broken into non-overlapping sub-samples of spatial dimensions , which are randomly grouped in batches for stochastic optimisation. We employ Adam  with a learning rate and an initial batch size . Every epochs the batch size is doubled until it reaches a maximum size of .
We choose for layers where the network temporal depth is (layers in gray in Figs. 1(c), 1(b) and 1(a)), and to maintain comparable network sizes we choose . This ensures that the number of features per hidden layer in early and slow fusion networks is always the same. For instance, the network shown in Fig. 1(b), for which and for , the number of features in a layer network for SR would be 6, 8, 12, 24, 24, .
First, we investigate the impact of the number of input frames on complexity and accuracy without motion compensation. We compare single frame models (SF) against early fusion spatio-temporal models using 3, 5 and 7 input frames (E3, E5 and E7). PSNR results on the CDVL dataset for networks of 6 to 11 layers are plotted in Fig. 4. Exploiting spatio-temporal correlations provides a more accurate result relative to an independent processing of frames. The increase in complexity from early fusion is marginal because only the first layer contributes to an increase of operations.
Although the accuracy of spatio-temporal models is relatively similar, we find that E7 slightly underperforms. It is likely that temporal dependencies beyond 5 frames become too complex for networks to learn useful information and act as noise degrading their performance. Notice also that, whereas the performance increase from network depth is minimal after 8 layers for single frame networks, this increase is more consistent for spatio-temporal models.
Here we compare the different treatments of the temporal dimension discussed in Section 2.2. We assume networks with an input of frames and slow fusion models with filter temporal depths as in Fig. 2. Using SF, E5, S5, and S5-SW to refer to single frame networks and 5 frame input networks using early fusion, slow fusion, and slow fusion with shared weights, we show in Table 2 results for 7 and 9 layer networks.
As seen previously, early fusion networks attain a higher accuracy at a marginal 3% increase in operations relative to the single frame models, and as expected, slow fusion architectures provide efficiency advantages. Slow fusion is faster than early fusion because it uses fewer features in the initial layers. Referring to Eq. 8, slow fusion uses in the first layers and , which results in fewer operations than , as used in early fusion.
While the 7 layer network sees a considerable decrease in accuracy using slow fusion relative to early fusion, the 9 layer network can benefit from the same accuracy while reducing its complexity with slow fusion by about 30%. This suggests that in shallow networks the best use of network resources is to utilise the full network capacity to jointly process all temporal information as done by early fusion, but that in deeper networks slowly fusing the temporal dimension is beneficial, which is in line with the results presented by  for video classification.
Additionally, weight sharing decreases accuracy because of the reduction in network parameters, but the reusability of network features means fewer operations are needed per frame. For instance, the 7 layer S5-SW network shows a reduction of almost 30% of operations with a minimal decrease in accuracy relative to SF. Using 7 layers with E5 nevertheless shows better performance and faster operation than S5-SW with 9 layers, and in all cases we found that early or slow fusion consistently outperformed slow fusion with shared weights in this performance and efficiency trade-off. Convolutions in spatio-temporal domain were shown in  to work well for video action recognition, but with larger capacity and many more frames processed jointly. We speculate this could be the reason why the conclusions drawn from this high-level vision task do not extrapolate to the SR problem.
In this section, the proposed frame motion compensation is combined with an early fusion network of temporal depth . First, the motion compensation module is trained independently using Eq. 7, where the first term is ignored and , . This results in a network that will compensate the motion of three consecutive frames by estimating the flow maps of outer frames relative to the middle frame. An example of a flow map obtained for one frame is shown in Fig. 5, where we also show the effect the motion compensation module has on three consecutive frames.
|Image and video SR||Proposed VESPCN|
|GOps / p frame||-||233.11||9.92||1108.73*||7.96||24.23|
|GOps / p frame||-||233.11||6.08||1108.73*||4.85||14.00|
The early fusion motion compensated SR network (E3-MC) is initialised with a compensation and a SR network pretrained separately, and the full model is then jointly optimised with Eq. 7 (, ). Results for SR on CDVL are compared in Table 3 against a single frame (SF) model and early fusion without motion compensation (E3). E3-MC results in a PSNR that is sometimes almost twice the improvement of E3 relative to SF, which we attribute to the fact that the network adapts the SR input to maximise temporal redundancy. In Fig. 6 we show how this improvement is reflected in better structure preservation.
We show in Table 4 the performance on Vid4 for SRCNN , ESPCN , VSRnet  and the proposed method, which we refer to as video ESPCN (VESPCN). To demonstrate its benefits in efficiency and quality we evaluate two early fusion models: a 5 layer 3 frame network (5L-E3) and a 9 layer 3 frame network with motion compensation (9L-E3-MC). The metrics compared are PSNR, SSIM  and MOVIE  indices. The MOVIE index was designed as a metric measuring video quality that correlates with human perception and incorporates a notion of temporal consistency. We also directly compare the number of operations per frame of all CNN-based approaches for upscaling a generic p frame.
Reconstructions for SRCNN, ESPCN and VSRnet use models provided by the authors. SRCNN, ESPCN and VESPCN were tested on Theano and Lasagne, and for VSRnet we used available Caffe Matlab code. We crop spatial borders as well as initial and final frames on all reconstructions for fair comparison against VSRnet222We used our own implementation of SSIM and use video PSNR instead of averaging individual frames PSNR as done in , thus values may slightly deviate from those reported in original papers..
An example of visual differences is shown in Fig. 7 against the motion compensated network. From the close-up images, we see how the structural detail of the original video is better recovered by the proposed VESPCN method. This is reflected in Table 4, where it surpasses any other method in PSNR and SSIM by a large margin. Figure 7 also shows temporal profiles on the row highlighted by a dashed line through 25 consecutive frames, demonstrating a better temporal coherence of the reconstruction proposed. The great temporal coherence of VESPCN also explains the significant reduction in the MOVIE index.
The complexity of methods in Table 4 is determined by network and input image sizes. SRCNN and VSRnet upsample LR images before attempting to super-resolve them, which considerably increases the required number of operations. VSRnet is particularly expensive because it processes input frames in and feature layers, whereas sub-pixel convolution greatly reduces the number of operations required in ESPCN and VESPCN. As a reference, ESPCN runs at ms per frame on a K2 GPU . The enhanced capabilities of spatio-temporal networks allow to reduce the network operations of VESPCN relative to ESPCN while still matching its accuracy. As an example we show VESPCN with 5L-E3, which reduces the number of operations by about 20% relative to ESPCN while maintaining a similar performance in all evaluated quality metrics.
The operations for motion compensation in VESPCN with 9L-E3-MC, included in Table 4 results, amount to and GOps for and upscaling, applied twice for each input frame requiring motion compensation. This makes the proposed motion compensated video SR very efficient relative to other approaches. For example, motion compensation in VSRnet is said to require 55 seconds per frame and is the computational bottleneck . This is not accounted for in Table 4 but is slower than VESPCN with 9L-E3-MC, which can run in the order of seconds. The optical flow method in VSRnet was originally shown to run at ms on GPU for each frame of dimensions , but this is still considerably slower than the proposed solution considering motion compensation is required for more than a single frame of HD dimensions.
In this paper we combine the efficiency advantages of sub-pixel convolutions with temporal fusion strategies to present real-time spatio-temporal models for video SR. The spatio-temporal models used are shown to facilitate an improvement in reconstruction accuracy and temporal consistency or reduce computational complexity relative to independent single frame processing. The models investigated are extended with a motion compensation mechanism based on spatial transformer networks that is efficient and jointly trainable for video SR. Results obtained with approaches that incorporate explicit motion compensation are demonstrated to be superior in terms of PSNR and temporal consistency compared to spatio-temporal models alone, and outperform the current state of the art in video SR.
European Conference on Computer Vision (ECCV), 4:25–36, 2004.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Spatio-temporal video autoencoder with differentiable memory.International Conference On Learning Representations (ICLR) Workshop, 2016.