Photorealistic Video Super Resolution

07/20/2018 ∙ by Eduardo Pérez-Pellitero, et al. ∙ 4

With the advent of perceptual loss functions, new possibilities in super-resolution have emerged, and we currently have models that successfully generate near-photorealistic high-resolution images from their low-resolution observations. Up to now, however, such approaches have been exclusively limited to single image super-resolution. The application of perceptual loss functions on video processing still entails several challenges, mostly related to the lack of temporal consistency of the generated images, i.e., flickering artifacts. In this work, we present a novel adversarial recurrent network for video upscaling that is able to produce realistic textures in a temporally consistent way. The proposed architecture naturally leverages information from previous frames due to its recurrent architecture, i.e. the input to the generator is composed of the low-resolution image and, additionally, the warped output of the network at the previous step. We also propose an additional loss function to further reinforce temporal consistency in the generated sequences. The experimental validation of our algorithm shows the effectiveness of our approach which obtains competitive samples in terms of perceptual quality with improved temporal consistency.



There are no comments yet.


page 4

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in convolutional neural networks have revolutionized computer vision and the popular field of super-resolution (SR) has been no exception to this rule, as in recent years numerous publications have made great strides towards better reconstructions of high-resolution pictures. A most promising new trend in SR has emerged as the application of

perceptual loss functions rather than the previously ubiquitous optimization of the mean squared error. This paradigm shift has enabled the leap from images with blurred textures to near-photorealistic results in terms of perceived image quality using deep neural networks. Notwithstanding the recent success in single image SR, perceptual losses have not yet been successfully utilized in the video super resolution (VSR) domain, as perceptual losses typically introduce artifacts that, while being undisturbing in the spatial domain, emerge as spurious flickering artifacts in videos.

In this paper we propose a neural network model that is able to produce sharp videos with fine details while improving its behavior in terms of temporal consistency. The contributions of the paper are: (1) A recurrent generative adversarial model with a video discriminator, (2) a multi-image warping that improves image alignment between adjacent frames, and (3) two novel loss terms that reinforce temporal coherency for consecutive frames.

2 Related work

The task of SR can be split into the groups of single image SR and multi-frame or video SR methods.

Single image SR is one of the most relevant inverse problems in the field of generative image processing tasks [30, 32]. Since the initial work by Dong et al. [5] which applied small convolutional neural networks to the task of single image SR, several better neural network architectures have been proposed that have achieved a significantly higher PSNR across various datasets [38, 19, 6, 41, 21, 25, 3]

. Generally, advances in network architectures for image detection tasks have also helped in SR, e.g. adding residual connections

[13] enables the use of much deeper networks and speeds up training [18]. We refer the reader to Agustsson and Timofte [1] for a survey of the state of the art in single image SR.

Since maximizing for PSNR leads to generally blurry images [36], another line of research has investigated alternative loss functions. Johnson et al. [16] and Alexey and Brox [8] replace the mean squared error (MSE) in the image space with an MSE measurement in feature space of large pre-trained image recognition networks. Ledig et al. [23] extend this idea by adding an adversarial loss and Sajjadi et al. [36] combine perceptual, adversarial and texture synthesis loss terms to produce sharper images with hallucinated details. Although these methods produce detailed images, they typically contain small artifacts that are visible upon close inspection. While such artifacts are bearable in images, they lead to flickering in super-resolved videos. For this reason, applying these perceptual loss functions to the problem of video SR is more involved.

Amongst classical video SR methods, Liu et al. [26] have achieved notable image quality using Bayesian optimization methods, but the computational complexity of the approach prohibits use in real-time applications. Neural network based approaches include Huang et al. [15] who use a bidirectional recurrent architecture with comparably shallow networks without explicit motion compensation. More recently, neural network based methods operate on a sliding window of input frames. The main idea of Kappeler et al. [17] is to align and warp neighboring frames to the current frame before all images are fed into a SR network which combines details from all frames into a single image. Inspired by this idea, Caballero et al. [2]

take a similar approach but employ a flow estimation network for the frame alignment. Similarly, Makansi et al. 

[29] use a sliding window approach but they combine the frame alignment and SR steps. Tao et al. [42] also propose a method which operates on a stack of video frames. They estimate the motion in the frames and subsequently map them into high-resolution space before another SR network combines the information from all frames. Liu et al. [27] operate on varying numbers of frames at the same time to generate different high-resolution images and then condense the results into a single image in a final step.

For generative video processing methods, temporal consistency of the output is crucial. Since most recent methods operate on a sliding window [2, 42, 27, 29], it is hard to optimize the networks to produce temporally consistent results as no information of the previously super-resolved frame is directly included in the next step. To accommodate for this, Sajjadi et al. [37] use a frame-recurrent approach where the estimated high-resolution frame of the previous step is fed into the network for the following step. This encourages more temporally consistent results, however the authors do not explicitly employ a loss term for the temporal consistency of the output.

To the best of our knowledge, VSR methods have so far been restricted to MSE optimization methods and recent advancements in perceptual image quality in single image SR have not yet been successfully transferred to VSR. A possible explanation is that perceptual losses lead to sharper images which makes temporal inconsistencies significantly more evident in the results, leading to unpleasing flickering in the high-resolution videos [36].

The style transfer community has faced similar problems in their transition from single-image to video processing. Single-image style-transfer networks might produce very distant images for adjacent frames [10], creating very strong transitions from frame to frame. Several recent works have overcome this problem by including a temporal-consistency loss that ensures that the stylized consecutive frames are similar to each other when warped with the optical flow of the scene [12, 35, 14].

In this work, inspired by the contributions above, we explore the application of perceptual losses for VSR using adversarial training and temporal consistency objectives.

3 Proposed method

Figure 1: Network architectures for generator and discriminator. The previous output frame is warped onto the current frame and mapped to LR with the space to depth transformation before being concatenated to the current LR input frame. The generator follows a ResNet architecture with skip connections around the residual blocks and around the whole network. The discriminator follows the common design pattern of decreasing the spatial dimension of the images while increasing the number of channels after each block.

3.1 Notation and problem statement

VSR aims at upscaling a given LR image sequence by a factor of , so that the estimated sequence {} resembles the original sequence by some metric. We denote images in the low-resolution domain by , and ground-truth images in the high-resolution domain by for a given magnification factor . An estimate of a high-resolution image is denoted by . We discern within a temporal sequence by a subindex to the image variable, e.g., . We use a superscript , e.g. , to denote an image that has been warped from its time step to the following frame .

The proposed architecture is summarized in Figure 1

and will be explained in detail in the following sections. We define an architecture that naturally leverages not only single image but also inter-frame details present in video sequences by using a recurrent neural network architecture. The previous output frame is aligned through a image alignment network. By including a video discriminator that is only needed at the training stage, we further enable adversarial training which has proved to be a powerful tool for generating sharper and more realistic images

[36, 23].

To the best of our knowledge, the use of perceptual loss functions (i.e. adversarial training in recurrent architectures) for VSR is novel.

3.2 Recurrent generator and video discriminator

Following recent SR state of the art methods for both classical and perceptual loss functions [20, 25, 36, 23], we use deep convolutional neural networks with residual connections. This class of networks facilitates learning the identity mapping and leads to better gradient flow through deep networks. Specifically, we adopt a ResNet architecture for our recurrent generator that is similar to the ones introduced by [23, 36] with some modifications.

Each of the residual blocks is composed by a convolution, a Rectified Linear Unit (ReLU) activation and another convolutional layer following the activation. Previous approaches have applied batch normalization layers in the residual blocks

[23], but we choose not to add batch normalization to the generator due to the comparably small batch size, to avoid potential color shift problems, and also taking into account recent evidence hinting that they might be problematic for generative image models [45]

. In order to further accelerate and stabilize training, we create an additional skip connection over the whole generator. This means that the network only needs to learn the residual between the nearest neighbor interpolation of the input and the high-resolution ground-truth image rather than having to pass through all low frequencies as well

[36, 41].

We perform most of our convolutions in low-resolution space for a higher receptive field and higher efficiency. Since the input image has a lower dimension than the output image, the generator needs to have a module that increases the resolution towards the end. There are several ways to do so within a neural network, e.g., transposed convolution layers, interpolation or depth to space units (pixelshuffle). In order to avoid potential grid artifacts when introducing the adversarial loss, we decided to perform the upscaling via nearest neighbor interpolation. The upscaling unit is divided into two stages with an intermediate magnification step (e.g. two times for a magnification factor of ). Each of the upscaling stages is composed of a nearest neighbor interpolation, a convolutional layer and a ReLU activation.

In contrast to general generative adversarial networks, the input to the proposed generative network is not a random variable but it is composed of the low-resolution image

(corresponding to the current frame ) and, additionally, the warped output of the network at the previous step . The difference in resolution of these two images is adapted through a space to channel layer which decreases the spatial resolution of without loss of information. As for the previous-image warping approach, we propose using an -dimensional optical flow field that is estimated with a separate, non-recurrent network (refer to Section 3.3).

Our discriminator follows common design choices and is composed of strided convolutions, batch normalization and leaky ReLU activations that progressively decrease the spatial resolution of the activations while increasing the channel count [23, 33, 36]

. The last stage of the discriminator is composed of two dense layers and a sigmoid activation function. Differently to single-image SR approaches, we let the discriminator see the stream of images produced by the unfolded generator, e.g. 10 images. This enables the discriminator to also evaluate temporal consistency in the classification of fake and real sequences.

3.3 Resampling and alignment between frames

Many classic vision tasks do not deal with single images but rather with streams of images that expand the temporal dimension. Optical flow has been widely used as a representation that describes temporal relationships within images, and thus enables temporal cues in learning. Optical flow represents the motion perceived by the camera by two image fields , where each element describes the pixel-wise vertical and horizontal displacement. Recently, there have been several contributions on optical flow estimation via deep neural networks, e.g. FlowNet [9], SPyNet [34]. The later method use the resampling operation (i.e. bilinear warping) in order to progressively refine its flow estimates. Within VSR, most of the motion-aware methods [37, 24] use optical flow fields to warp and align images. In VSR, however, motion compensation networks are generally trained on warping image error, and the decisive factor is not so much how accurate the flow fields are but rather how accurate the warped image is. Also, often the warped and aligned image is used for inference and the information of the flow fields is otherwise discarded, even though it might carry meaningful motion information.

In this paper we propose a new paradigm for warping and aligning images. Traditional optical flow algorithms estimate just one coordinate per input pixel. In this paper, we propose estimating coordinates whose correspondent pixel values are then linearly combined via a set of corresponding weights. The warped image is obtained as follows:


where operator performs element-wise multiplication, samples the image at spatial locations and is a matrix of weights matching the image size. The proposed multi-image warping scheme allows each warped pixel to be composed by a non-rigid linear combination of several pixels, thus being more expressive and also effectively bypassing the limiting factor of the bilinear interpolation as the last step of the image warping, i.e. the network can improve the final image by combining the intermediate images. We show warping accuracy with respect to in Figure 2.

In our architecture (see Figure 1) we also bridge the last feature activation of our motion compensation network to the image generator in order to provide motion-aware features to the generator and to improve information flow (i.e. as opposed to only passing the warped image).

Figure 2: Warping error PSNR vs number of coordinates used for the image alignment (see Equation 1) for the seq12 dataset. Previous works only estimate and apply , which corresponds to traditional optical flow . By increasing the number of estimated coordinates and linearly combining their correspondent pixel values the warping error is greatly reduced.

3.4 Losses

Upscaling video sequences has the additional challenge of respecting the original temporal consistency between adjacent frames so that the estimated video does not present unpleasing flickering artifacts.

When minimizing only MSE such artifacts are less noticeable for two main reasons: because (1) MSE minimization often converges to the mean in textured regions, and thus flickering is reduced and (2) the pixel-wise MSE with respect to the ground truth (GT) is up to a certain point enforcing the inter-frame consistency present in the training images. However, when adding an adversarial loss term, the difficult to maintain temporal consistency increases. Adversarial training aims at generating samples that lie in the manifold of images, and thus it generates high-frequency content that will hardly be pixel-wise accurate to any ground-truth image.

The architecture presented in Section 3.2 is naturally able to learn temporal dependencies thanks to its recurrent design and its multi-frame discriminator. We train it with L1, texture and adversarial losses. Additionally, we introduce two novel losses in order to further reinforce temporal consistency.

3.4.1 L1 distance

MSE is by far the most common loss in the SR literature as it is well-understood and easy to compute. It accurately captures sharp edges and contours, but it leads to over-smooth and flat textures as the reconstruction of high-frequency areas falls to the local mean rather than a realistic mode [36]. Recently, also L1 distance (absolute error) has been used for image restoration, as it behaves similarly to the well-known L2 distance and it has been reported to perform slightly better than L2 distance.

The pixel-wise L1 distance is defined as follows:


where denotes the estimated image of the generator for frame and denotes the ground-truth HR frame .

3.4.2 Adversarial Loss

Figure 3: Unfolded recurrent generator and discriminator during training for 3 temporal steps. The output of the previous time step is fed into the generator for the next iteration. Note that the weights of are shared across different time steps. Gradients of all losses during training pass through the whole unrolled configuration of network instances. The discriminator receives as input as many images as temporal steps are during training.

Generative Adversarial Networks (GANs) [11] and their characteristic adversarial training scheme have been a very active research field in the recent years, defining a wide landscape of applications. In GANs, a generative model is obtained by simultaneously training an additional network. A generative model (i.e. generator) that learns to produce samples close to the data distribution of the training set is trained along with a discriminative model

(i.e. discriminator) that estimates the probability of a given sample belonging to the training set or not, i.e., it is generated by

. The objective of is to maximize the errors committed by , whereas the training of should minimize its own errors, leading to a two-player minimax game.

Similar to previous single-image SR [23, 36], the input to the generator

is not a random vector but an LR image (in our case, with an additional recurrent input), and thus the generator minimizes the following loss:


where the operator denotes concatenation. The discriminator minimizes:


In this specific discriminator set-up, we would like to remark that the adversarial loss enforces temporal consistency as well, as differently to other single-image SR, in our architecture the discriminator has access to multiple frames and thus can leverage information contained over the temporal dimension in order to classify its inputs.

3.4.3 Texture Loss

For the purpose of image style transfer, Gatys et al. [10] found that feature correlations from pre-trained convolutional neural networks capture the style of images which can then be used for realistic texture synthesis. Sajjadi et al. [37] propose to apply this method for SR in oder to match the texture of the generated images to the original high-resolution textures at training time. To this end, we compute VGG [39] features for both ground-truth images and generated images and compute the corresponding gram matrices . The final loss term reads:


The texture loss term encourages sharper and more accurate textures in the generated videos and furthermore stabilizes training, which is important given the recurrence loop.

3.4.4 Static Temporal Loss

When warping an image to compensate its motion, most of its high-frequency content is filtered out (as the warping operation behaves as a low-pass filter in the frequency domain) and thus, comparing warped images is not effective in evaluating or avoiding flickering artifacts. Additionally, flickering artifacts are most noticeable when they occur in regions of the video that are still (i.e. there is not motion across frames). For that purpose, we propose the static temporal loss . This loss computes the difference across frames (without warping) only for regions where there is no pixel-value variation in the ground-truth images. First, we compute the following mask with the ground-truth images:


where is sufficiently large to have fast transitions from to whenever the frame difference is non-zero (we fixed . We compute then the distance between consecutive estimated images and apply to it, thus obtaining an loss signal for those regions that should remain static:


3.4.5 Temporal Statistics Loss

In order to reproduce the temporal characteristics of the original sequence without any direct pixel-wise comparison we compute the variance over the temporal dimension and match the statistics of the estimated images to those of the original sequence.

We compute at each image location the variance across time both for GT images and for the estimated images . Those statistics represent how much variation is there in a given image location across time, and thus is representative of the temporal consistency at each pixel location. We compute the loss term as follows:


4 Results

4.1 Training and parameters

Pixel-error objective Perceptual objective
PSNR 22.443 23.898 24.389 25.210 25.226 20.886 19.783 22.079 22.534 22.235 22.699
SSIM 0.741 0.808 0.831 0.869 0.865 0.683 0.642 0.758 0.775 0.761 0.784
LPIPS 0.489 0.345 0.323 0.251 0.248 0.253 0.277 0.225 0.220 0.225 0.202
NIQE 11.338 7.850 7.770 6.311 6.499 6.470 3.916 5.959 5.986 5.745 5.055
Static Loss 32.781 30.974 31.032 30.824 31.085 25.415 24.303 26.323 27.699 26.541 27.702
Var. Dist. 21.762 23.383 23.730 24.342 24.146 22.792 22.324 23.368 23.478 23.423 23.587
Warping err. 25.956 23.003 22.624 21.856 22.207 19.873 19.156 20.194 20.915 20.308 20.787
tLPIPS 0.143 0.125 0.136 0.094 0.098 0.504 0.866 0.658 0.496 0.607 0.459
Table 1: Experimental validation of our proposed architecture for vid4 dataset. The table is separated into pixel-error objective and perceptual objective methods. Best in bold and runner-ups in blue (per category).
bicubic ENet SRGAN
PSNR 27.131 29.750 25.131 23.918 26.411 26.833
SSIM 0.864 0.921 0.815 0.781 0.860 0.869
LPIPS 0.384 0.204 0.213 0.243 0.154 0.146
NIQE 3.895 3.814 4.970 4.141 3.397 3.332
Static Loss 32.046 31.309 27.300 26.341 27.941 28.722
Var. Dist. 24.590 26.877 25.448 24.692 25.977 26.112
Warping err. 21.825 20.296 19.243 19.059 19.319 19.664
tLPIPS 0.145 0.082 0.383 0.622 0.525 0.417
Table 2: Experimental evaluation of our proposed architecture for seq12 dataset. Best in bold and runner-ups in blue.

Our model falls in the category of recurrent neural networks, and thus must be trained via Back-propagation Through Time (BPTT) [44], which is a finite approximation of the infinite recurrent loop created in the model. In practice, BPTT unfolds the network into several temporal steps where each of those steps is a copy of the network sharing the same parameters. The back-propagation algorithm is then used to obtain gradients of the loss with respect to the parameters. We show an example of unfolded recurrent generator and discriminator in Figure 3. We select temporal steps for our training approximation and set the for our image alignment to . We choose a depth of 10 residual blocks for the image alignment and generator networks.

Our training set is composed by 4k videos downloaded from and downscaled to , from which we extract around 3M HR crops that serve as ground-truth images, and then further downsample them by a factor of to obtain the LR input of size . The training dataset thus is composed by around 300k sequences of 10 frames each (i.e. around 300k data-points for the recurrent network). We compile a testing set, larger than other previous testing sets in the literature, also downloaded from, favoring sharp 4k content that is further downsampled to for GT and for the LR input. In this dataset there are 12 diverse sequences (e.g. landscapes, natural wildlife, urban scenes) ranging from very little to fast motion. Each sequence contains 100 to 150 frames (1281 frames in total).

We use a batch size of sequences, i.e. each batch contains training images. All models are pre-trained with

for about 2 epochs and then trained with the rest of the losses for about 4 epochs more. The weights to the losses are:


Training was performed on Nvidia Tesla P100 and V100 GPUs, both of which have 16 GB of memory.

4.2 Evaluation

ENet [36]
Figure 4: Image close-ups for visual inspection and quantitative visual assessment. The close-ups have been extracted from the following sequences (left to right): newyork, mountain, tikal2, newyork, monkey.

Models: We performed exhaustive evaluation on intra-frame quality and temporal consistency for vid4 and seq12 datasets. In Table 1 (right side) and Table 2 we compare perceptual image quality and temporal consistency metrics against other generative SR methods, namely SRGAN [23] (pretrained model obtained from [7]) and Enhancenet [36] (code and pre-trained network weights obtained from the authors website). To the best of our knowledge, no perceptually driven VSR methods have been published to date. In Table 1 (left side) we also compare our methods with VSR methods based on MSE optimization: Frame Recurrent Video Super-Resolution [37] (FRVSR), Detail Revealing Deep VSR (DRDVSR) [43], and the Robust VSR with learned temporal dynamics [28] (denoted as

). For these methods, we compute the evaluation metrics on the vid4 image results collected from the authors websites.

Intra-frame quality: Even though it is trivial for humans to evaluate the perceived similarity between two images, the underlying principles of human perception are still not well-understood. Traditional metrics such as PSNR and Structural Self-Similarity (SSIM) still rely on well-aligned, pixel-wise accurate estimates. In order to evaluate image samples from models that deviate from the MSE minimization scheme other metrics need to be considered.

Mittal et al. [31] introduced the no-reference Natural Image Quality Evaluator (NIQE), which quantifies perceptual quality by the deviation from natural image statistics in the spatial, wavelet and DCT domains. Zhang et al. [46]

proposed recently the Learned Perceptual Image Patch Similarity (LPIPS), which explore the capabilities of deep architectures to capture perceptual features that are meaningful for similarity assessment. In their exhaustive evaluation they show how deep features of different architectures outperform other previous metrics by substantial margins and correlate very well with subjective human scores. They conclude that deep networks, regardless of the specific architecture, capture important perceptual features that are well-aligned with those of the human visual system.

We evaluate our testing sets with PSNR, SSIM, NIQE (we fit the NIQE model to the GT image statistics of vid4 and seq12 separately) and LPIPS using the AlexNet architecture with an additional linear calibration layer as the authors propose in their manuscript. We show these scores in Table 1 and Table 2, and we show some image crops in Figure 4 for qualitative evaluation. Our method trained with obtains the best LPIPS scores for both vid4 and seq12, and quantitative inspection of the image crops suggest that even though SRGAN and Enhancenet do generate fine textures, they also tend to deviate to a higher degree from plausible texture patterns (e.g. over sharpening in Figure 4 monkey).

Temporal Consistency: Evaluating the temporal consistency over adjacent frames in a sequence where the ground-truth optical flow is not known is an open problem. It is common in the literature to compute the warping error across consecutive frames with a flow estimator [12, 14, 22]. We compute the warping error PSNR with flow estimates from PWC-Net [40]. Please refer to the supplementary material for further discussion on this metric.

We include as well the tLPIPS as proposed by [4], which computes the LPIPS distance for consecutive estimated frames and references it to that of the ground-truth images where computes the LPIPS score between image and . Additionally, we evaluate temporal consistency with (i.e. static loss) and (i.e. variance distance), both of them related to reliable ground-truth data and with a simple interpretation (motionless flickering and differences in pixel statistics across time). For a clearer comparison, we present these two temporal metrics in logarithmic scale. We show the results in Table 1 and 2. As expected, there is a gap between methods optimized with pixel distances (e.g. , FRVSR) and generative algorithms. Within the later, all the configurations of our model perform well in all temporal metrics, even when we do not minimize any of the proposed temporal losses directly (i.e. ), which supports the effectiveness of the video discriminator and the recurrent generator. is the best performer among the perceptual methods. In contrast, models that are not aware of the temporal dimensions (such as Enhacenet or SRGAN) obtain worse scores, being SRGAN the worst performer (which is in line with what the quantitative evaluation of video sequences suggest).

Ablation Study: We show in Table 1 an ablation study of our proposed architecture trained with different loss functions: , , , and . Firstly, we would like to remark that includes a video discriminator, and thus its temporal consistency is improved when compared to SRGAN or Enhancenet. If we compare and we observe how perceptual quality metrics improve, however all temporal consistency metrics degrade (i.e. temporal consistency in perceptually driven methods is challenging). When comparing separately and to , we observe that both loss terms improve temporal consistency and quality metrics, which suggest that temporal consistency helps obtaining better intra-frame quality as well. The Static Loss has a higher impact on the temporal consistency metrics than . Finally, when optimizing our proposed loss function we further improve temporal consistency and quality over both and .

5 Conclusions

We present a novel generative adversarial model for video upscaling. Differently from previous approaches to video super-resolution based on MSE minimization, we use an adversarial loss function in order to recover videos with photorealistic textures. To the best of our knowledge, this is the first work that applies perceptual loss functions to the task of video super-resolution.

In order to tackle the problem of lacking temporal consistency, we propose three contributions: (1) A recurrent generative adversarial model with a video discriminator, (2) a multi-image warping that improves image alignment between adjacent frames, and (3) two novel loss terms that reinforce temporal coherency for consecutive frames. We conducted exhaustive quantitative evaluation both for intra-frame quality and temporal consistency on vid4 and seq12 (1281 frames) datasets. Our method obtains state-of-the-art results in terms of LPIPS and NIQE scores, and it improves temporal consistency when compared to other generative models such as SRGAN or Enhacenet.


  • [1] E. Agustsson and R. Timofte. NTIRE 2017 challenge on single image super-resolution: Dataset and study. In CVPR workshop, 2017.
  • [2] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi. Real-time video super-resolution with spatio-temporal networks and motion compensation. In CVPR, 2017.
  • [3] C. Chen, X. Tian, F. Wu, and Z. Xiong. UDNet: Up-down network for compact and efficient feature representation in image super-resolution. In ICCV, 2017.
  • [4] M. Chu, Y. Xie, L. Leal-Taixé, and N. Thuerey. Temporally coherent gans for video super-resolution (tecogan), 2018.
  • [5] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In ECCV, 2014.
  • [6] C. Dong, C. C. Loy, and X. Tang. Accelerating the super-resolution convolutional neural network. In ECCV, 2016.
  • [7] H. Dong, A. Supratak, L. Mai, F. Liu, A. Oehmichen, S. Yu, and Y. Guo. SRGAN implementation, tensorlayer.
  • [8] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. In NIPS, 2016.
  • [9] A. Dosovitskiy, P. Fischery, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.
  • [10] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [12] A. Gupta, J. Johnson, A. Alahi, and L. Fei-Fei. Characterizing and improving stability in neural style transfer. In ICCV, 2017.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [14] H. Huang, H. Wang, W. Luo, L. Ma, W. Jiang, X. Zhu, Z. Li, and W. Liu. Real-time neural style transfer for videos. In CVPR, 2017.
  • [15] Y. Huang, W. Wang, and L. Wang. Bidirectional recurrent convolutional networks for multi-frame super-resolution. In NIPS, 2015.
  • [16] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  • [17] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos. Video super-resolution with convolutional neural networks. In IEEE Transactions on Computational Imaging, 2016.
  • [18] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
  • [19] J. Kim, J. Kwon Lee, and K. Mu Lee. Deeply-recursive convolutional network for image super-resolution. In CVPR, 2016.
  • [20] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
  • [21] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, 2017.
  • [22] W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and M.-H. Yang. Learning blind video temporal consistency. In European Conference on Computer Vision, 2018.
  • [23] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
  • [24] D. Li, Y. Liu, and Z. Wang. Video super-resolution using motion compensation and residual bidirectional recurrent convolutional network. In ICIP, 2017.
  • [25] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. In CVPR workshop, 2017.
  • [26] C. Liu and D. Sun. A bayesian approach to adaptive video super resolution. In CVPR, 2011.
  • [27] D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang, and T. Huang. Robust video super-resolution with learned temporal dynamics. In CVPR, 2017.
  • [28] D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang, and T. Huang. Robust video super-resolution with learned temporal dynamics. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2507–2515, 2017.
  • [29] O. Makansi, E. Ilg, and T. Brox. End-to-end learning of video super-resolution with motion compensation. In GCPR, 2017.
  • [30] P. Milanfar. Super-resolution Imaging. CRC press, 2010.
  • [31] A. Mittal, R. Soundararajan, and A. C. Bovik. Making a completely blind image quality analyzer. IEEE Signal Processing Letters, 20(3), 2013.
  • [32] K. Nasrollahi and T. B. Moeslund. Super-resolution: A comprehensive survey. Machine Vision and Applications, 2014.
  • [33] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • [34] A. Ranjan and M. Black. Optical flow estimation using a spatial pyramid network. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, Piscataway, NJ, USA, July 2017. IEEE.
  • [35] M. Ruder, A. Dosovitskiy, and T. Brox. Artistic style transfer for videos. GCPR, 2016.
  • [36] M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch. EnhanceNet: Single image super-resolution through automated texture synthesis. In ICCV, 2017.
  • [37] M. S. M. Sajjadi, R. Vemulapalli, and M. Brown. Frame-Recurrent Video Super-Resolution. In CVPR, 2018.
  • [38] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, 2016.
  • [39] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [40] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.
  • [41] Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network. In CVPR, 2017.
  • [42] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia. Detail-revealing deep video super-resolution. In ICCV, 2017.
  • [43] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia. Detail-revealing deep video super-resolution. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [44] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, pages 1550–1560, 1990.
  • [45] S. Xiang and H. Li. On the effects of batch and weight normalization in generative adversarial networks. arXiv:1704.03971v4, 2017.
  • [46] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. arXiv:1801.03924, 2018.