Official repository of FISR (AAAI 2020).
Super-resolution (SR) has been widely used to convert low-resolution legacy videos to high-resolution (HR) ones, to suit the increasing resolution of displays (e.g. UHD TVs). However, it becomes easier for humans to notice motion artifacts (e.g. motion judder) in HR videos being rendered on larger-sized display devices. Thus, broadcasting standards support higher frame rates for UHD (Ultra High Definition) videos (4K@60 fps, 8K@120 fps), meaning that applying SR only is insufficient to produce genuine high quality videos. Hence, to up-convert legacy videos for realistic applications, not only SR but also video frame interpolation (VFI) is necessitated. In this paper, we first propose a joint VFI-SR framework for up-scaling the spatio-temporal resolution of videos from 2K 30 fps to 4K 60 fps. For this, we propose a novel training scheme with a multi-scale temporal loss that imposes temporal regularization on the input video sequence, which can be applied to any general video-related task. The proposed structure is analyzed in depth with extensive experiments.READ FULL TEXT VIEW PDF
Official repository of FISR (AAAI 2020).
With the prevalence of high resolution (HR) displays such as UHD TVs or 4K monitors, the demand for higher resolution visual contents (videos) is also increasing with already supporting 8K UHD video services (76804320). Super-resolution (SR) technologies are closely related to this trend, as they can enlarge the spatial resolution of legacy low resolution (LR) videos to higher resolution ones. However, the increase in spatial resolution necessarily entails the increase in temporal resolution, or the frame rate, for videos to be properly rendered on larger-sized displays from a perceptual quality perspective. The human visual system (HVS) becomes more sensitive to the temporal distortion of videos as the spatial resolution increases, and tends to easily perceive motion judder (discontinuous motion) artifacts in HR videos, which deteriorates the perceptual quality [Daly2001]. To this regard, the frame rate must be increased from low frame rate (LFR) to high frame rate (HFR) for HR videos to be visually pleasing. This is the reason behind UHD (Ultra High Definition) broadcast standards specifying 60 fps and 120 fps (frames per second) for 4K (38402160) and 8K (76804320) UHD videos [ETSI2019], compared to the 30 fps of conventional 2K (FHD, 19201080) videos.
Therefore, in order to convert legacy 2K 30 fps videos to genuine 4K 60 fps videos that can be viewed on 4K UHD displays, video frame interpolation (VFI) is essential along with SR. Nevertheless, VFI and SR have been intensively but separately studied in low level vision tasks. None of the existing methods have jointly handled both VFI and SR problems, which is a complex task where both the spatial and temporal resolutions must be increased. In this paper, we first propose a joint VFI-SR method, called FISR, that enables the direct conversion of 2K 30 fps videos to 4K 60 fps. We employ a novel training strategy that handles multiple consecutive samples of video frames per each iteration, with a novel temporal loss that exerts temporal regularization across these consecutive samples. This scheme is general and can be applied to any video-related task. To handle the high resolution of 4K UHD, we propose a multi-scale structure trained with the novel temporal loss applied across all scale levels.
Our contribution can be summarized as follows:
We first propose a joint VFI-SR method that can simultaneously increase the spatio-temporal resolution of video sequences.
We propose a novel multi-scale temporal loss that can effectively regularize the spatio-temporal resolution enhancement of video frames with high prediction accuracy.
All our experiments are based on 4K 60 fps video data to account for realistic application scenarios.
The purpose of SR is to recover the lost details of the LR image to reconstruct its HR version. SR is widely used in diverse areas such as medical imaging [Yang et al.2012], satellite imaging [Cao et al.2016], and as pre-processing in person re-identification [Jiao et al.2018]Dong et al.2015, Lim et al.2017, Lai et al.2017, Zhang et al.2019], which enhance the spatial resolution by focusing only on the spatial information of the given LR image as shown in Fig. 2 (a).
On the other hand, video SR (VSR) can additionally utilize the temporal information of the consecutive LR frames to enhance the performance. If SISR is independently applied to each of the single frames to generate the VSR results, the output HR videos tend to lack temporal consistency, which may cause flickering artifacts [Shi et al.2016]. Therefore, VSR methods exploit the additional temporal relationships as in Fig. 2 (b), and popular ways to achieve this include simply concatenating the sequential input frames, or adopting 3D convolution filters [Caballero et al.2017, Huang, Wang, and Wang2017, Jo et al.2018, Li et al.2019, Kim et al.2019]. However, these methods tend to fail to capture large motion, where the absolute motion displacements are large, or multiple local motions, due to the simple concatenation of inputs where many frames are processed simultaneously in the earlier part of the network. Furthermore, the use of 3D convolution filters leads to expensive computation complexity, which may cause the degradation of VSR performance when the overall network capacity is restricted. To overcome this issue, various methods have utilized motion information [Makansi, Ilg, and Brox2017, Wang et al.2018, Kalarot and Porikli2019], especially optical flow, to improve the prediction accuracy. While using motion information, Haris et al. [Haris, Shakhnarovich, and Ukita2019] proposed an iterative refinement framework to combine the spatio-temporal information of LR frames by using a recurrent encoder-decoder module. It is worth pointing out that although Vimeo-90K [Xue et al.2017] with resolution is a relatively high resolution benchmark dataset used in VSR, it is still insufficient to represent the characteristics of recent UHD video data. Furthermore, none of the aforementioned VSR methods generate HFR frames simultaneously.
The goal of VFI is to generate high quality non-existent middle frames by appropriately combining two original consecutive input frames as in Fig. 2 (c). VFI is highly important in video processing as viewers tend to feel visually comfortable towards HFR videos [Mackin, Zhang, and Bull2015]. VFI has been applied to various applications such as slow motion generation [Jiang et al.2018], frame rate up conversion (FRUC) [Yu and Jeong2019], novel view synthesis [Flynn et al.2016], and frame recovery in video streaming [Wu et al.2015]. The main difficulties in VFI are the consideration of fast object motion and the occlusion problem. Fortunately, with various deep-learning-based methods, VFI has been actively studied and has shown impressive results on LR benchmarks [Niklaus and Liu2018, Liu et al.2019, Bao et al.2019]. Niklaus et al. [Niklaus and Liu2018] proposed a context-aware frame synthesis method where per-pixel context maps are extracted and warped prior to entering a GridNet architecture for enhanced frame interpolation. Liu et al. [Liu et al.2019] proposed a cycle consistency loss that not only forces the network to enhance the interpolation performance, but also makes better use of the training data. Bao et al. [Bao et al.2019] proposed DAIN, which jointly optimizes five different network components to produce a high quality intermediate frame by exploiting depth information.
However, these methods face difficulties against higher resolution videos, where the absolute motion tends to be large, often exceeding the receptive field of the networks, resulting in performance degradation of the interpolated frames. Meyer et al. [Meyer et al.2015] first noticed the weakness of VFI methods for HR videos, and employed a hand-crafted phase-based method. Among deep-learning-based methods, a deep CNN was proposed in IM-Net [Peleg et al.2019] to cover fast motions so that it can handle the VFI for higher resolution () inputs. However, their testing scenarios were still limited to spatial resolutions lower than 2K videos, which is not adequate for 4K/8K TV displays.
On the other hand, Ahn et al. [Ahn, Jeong, and Kim2019] first proposed a hybrid task-based network for a fast and accurate VFI of 4K videos based on a coarse-to-fine approach. To reduce the computation complexity, they first down-sample two HR input frames prior to temporal interpolation (TI), and generate an LR version of the interpolated frame. Then, a spatial interpolation (SI) takes in the bicubic up-sampled version of the LR interpolated frame concatenated with the original two HR input frames to synthesize the final VFI output. Although their network performs a two-step spatio-temporal resolution enhancement, it should be noted that they take an advantage of the original 4K input frames, and their final goal is VFI (not joint VFI-SR) of 4K videos. This is different from our problem of jointly optimizing VFI-SR that generates the HR-HFR outputs directly from the LR-LFR inputs.
In this paper, we handle the joint VFI-SR, especially for FRUC applications, to generate high quality middle frames with higher spatial resolutions, which enables the direct conversion of 2K 30 fps videos to 4K 60 fps videos, named as frame interpolation and super-resolution (FISR). This is a novel problem, which has not been previously considered.
A common VFI framework involves the prediction of a single middle frame from the input of two consecutive frames as in Fig. 2 (c). In this case, the final HFR video constitutes of alternately located original input frames between the interpolated frames. However, this scheme cannot be directly applied for joint VFI-SR since the spatial resolutions of original input frames (LR) and predicted frames (HR) are different, and there is a resolution mismatch if we wish to insert the input frames among the predicted frames. Therefore, we propose a novel input/output framework as shown in Fig. 2 (d), where three consecutive HR HFR frames are predicted from three consecutive LR LFR frames. That is, for every three consecutive LR input frames, only SR is performed to produce the middle HR output frame while joint VFI-SR is performed to synthesize the other two end-frames (HR and HR). With the per-frame shift of a sliding temporal window, the frames HR and HR in the current temporal window will overlap with HR from the previous temporal window, and HR from the next time window, respectively. As blurry frames are produced if the two overlapping frames are averaged, the frame from the later sliding window is used for simplicity.
We propose a novel temporal loss for regularization in network training with video sequences. Instead of back-propagating the error at each mini-batch of data samples of three input/predicted frames, a training sample of FISR is composed of five consecutive input frames, thus containing three consecutive data samples with temporal stride 1 and one data sample with temporal stride 2. By considering the relations of these multiple data samples, more regularization istemporally imposed on network training for a more stable prediction. A detailed schema of this multiple data sample training strategy is illustrated in Fig. 3.
As shown in Fig. 3, we let the input frames be , where t is the time instance. Then, one training sample consists of five frames, , and each training sample includes three data samples with temporal stride 1, at each temporal window centered at , 0, and , respectively. Their corresponding predictions are denoted by , where indicates the -th temporal window, and their ground truth frames are given by .
Due to the sliding temporal window within each training sample, there exist two time instances and where the predicted frames overlap across the different time window w. The temporal matching loss enforces these overlapping frames to be similar to each other, formally given by,
We also consider an additional data sample with temporal stride 2 within the training sample, centered at 0, which in turn produces , , , as shown in yellow boxes in Fig. 3. With the stride 2 predictions, there are three overlapping time instances with the predictions from the stride 1 data samples. Accordingly, the temporal matching loss for stride 2 is given by,
To further regularize the predictions, we also impose the L2 loss on the mean of the overlapping frames of stride 1 and the corresponding ground truth at the overlapping time instance as follows:
In order to enforce the temporal coherence in the predicted frames, we design a simple temporal difference loss, , applied for all sets of predictions, where the difference between the consecutive predicted frames must be similar to the difference between the consecutive ground truth frames. For the predictions from the data samples of temporal stride 1, the loss is given by,
For the stride 2 predictions, the loss is given by,
Lastly, the reconstruction loss, , is the L2 loss between all predicted frames and the corresponding ground truths. Firstly, for the predictions from the data samples of temporal stride 1, the loss is given as,
For the stride 2 predictions, the loss is given as,
Finally the total loss is given by,
where the different types of are the weighting parameters for the corresponding losses to be determined empirically. The CNN parameters are updated at once for every mini-batch of training samples, consisting of four data samples (three stride 1 samples and one stride 2 sample).
We design a 3-level multi-scale network as shown in Fig. 4, which is beneficial in handling large motion in the HR frames with the enlarged effective receptive fields in the lower scale levels. In levels 1 and 2, the input frames are down-scaled by 4 and 2, respectively, from level 3 of the original scale using a bicubic filter, and all scales employ the same U-Net-based architecture. With the multi-scale structure, a coarse prediction is generated at the lowest scale level, which is then concatenated and progressively refined at the subsequent scale levels. The total loss of Eq. (8) is respectively computed at all three scale levels with weighting parameters , as .
Furthermore, to effectively handle large motion and occlusions, the bidirectional optical flow maps and the corresponding warped frames are stacked with the input frames. We use the pre-trained PWC-Net [Sun et al.2018] to obtain the optical flows , , , , and the concatenated flow maps , , , are approximated with the linear motion assumption from the respective flow maps (e.g. ). The corresponding backward warped frames , , ,
are estimated from the approximated flows, and are also concatenated along with the input frames.
All convolution filters have a kernel size of , and in the U-Net architecture, the output channel is set to 64. The final output channels are 6 for the two VFI-SR frames and 3 for the single SR frame. As PWC-Net was trained for RGB frames, the flows and the warped frames
were obtained in the RGB domain, and the warped frames were converted back to YUV for concatenation with the input frames. For all experiments, the scale factor is 2 for both the spatial resolution and the frame rate, to target 2K 30 fps to 4K 60 fps applications, and we use Tensorflow 1.13 in our implementations.
We collected 4K 60 fps videos of total 21,288 frames that contain 112 scenes with diverse object and camera motions from . Among the collected scenes, we especially selected 88 for training and 10 scenes for testing, both of which contain large object or camera motions. In the 10 scenes for testing, the pixel displacement range amounts up to [-124, 109] in pixels, frame-to-frame, and all 10 scenes contain at least [-103, 94] pixel displacement within the input 2K video frame, quantitatively demonstrating the large motion contained in the data. Additionally, the average motion magnitude in each scene of the 2K frames ranges from 5.61 to 11.40 pixels frame-to-frame, with the total average for all 10 scenes being 7.64 pixels.
To create one training sample, we randomly cropped a series of HR patches at the same location throughout 9 consecutive frames. With the 5-frame input setting as shown in Fig. 3, the 2nd () to the 8th (
) frames were used as the 4K ground truth HR HFR frames, and the five odd-positioned frames (, , , , ) were bicubic down-scaled to the size of to be used as the LR LFR input frames for training, as shown in green and blue boxes, respectively, in Fig. 3. To obtain diverse training samples, each training sample was extracted with a frame stride of 10. By doing so, we constructed 10,086 training samples in total before starting the training process to avoid heavy training time required for loading 4K frames at every iteration.
During the test phase, the test set composed of 10 different scenes with 5 consecutive LR (2K) LFR (30 fps) frames was used, where the full 2K frames were entered as a whole, and the average PSNR and SSIM were measured for a total of 90 ((two VFI-SR and one SR frame)(three sliding windows in five consecutive input frames)(ten scenes)) predicted frames. The input and the ground truth frames are in YUV channels, and the performance was also measured in the YUV color space.
For training, we adopted the Adam optimizer [Kingma and Ba2015] with the initial learning rate of
, reduced by a factor of 10 at the 80-th and 90-th epoch of total 100 epochs. The weights were initialized with Xavier initialization[Glorot and Bengio2010] and the mini-batch size was set to 8. The weighting parameters for the total temporal loss in Eq. (8) were empirically set to , , , , and . The weighting parameters for the multi-scale loss in had to be set carefully, since with certain combinations such as , and , a performance drop was observed compared to a single-scale architecture. We found that more emphasis must be imposed on the lower levels, which is consistent with previous work, and empirically, the best combination was , and . This is because having an accurate reconstruction to start with is important for the later levels.
|*VS: VFI-SR, S: SR, P: PSNR (dB), S: SSIM|
|*VS: VFI-SR, S: SR, P: PSNR (dB)|
We conducted an ablation study on the components of the temporal loss to analyze their effect. Table 1 shows the average PSNR/SSIM performance of the predicted VFI-SR frames and the SR frames for in-depth analysis. This experiment was performed with the multi-scale architecture without using the optical flow and the warped frames inputs, to solely examine the effect of the temporal loss without any additional motion cues. Additionally, we conducted another ablation study on the temporal loss with a single scale (not multi-scale) U-Net architecture in Table 2 to investigate the effect of the temporal loss in simpler CNN architectures. In this experiment as well, and inputs were not used.
Firstly, the overall PSNR/SSIM values for the SR frames are higher than those of the VFI-SR frames in both Table 1 and Table 2, because VFI-SR is a more complex joint task where spatio-temporal up-scaling must be performed simultaneously, whereas for SR, only the spatial resolution is up-scaled. Secondly, the usage of the losses and related to the sample with temporal stride 2 in column (e) of both tables, forces the FISRnet to produce improved reconstruction accuracy by effectively regularizing the temporal relations with 0.48 dB and 0.53 dB gain over column (d) in Table 1 and Table 2, respectively. Considerable performance gains in SR can be also observed, with 0.38 dB gain from column (d) to (e) in both tables. With (the temporal matching loss with temporal stride 2 frames) additionally included as in column (f) of Table 1 and Table 2, 0.51 dB and 0.59 dB gain in PSNR is obtained for joint VFI-SR, respectively, by comparing to column (d).
Although the final temporal loss improves the prediction accuracy of the VFI-SR frames by enforcing temporal regularization, there exists a performance trade-off between the predicted VFI-SR and SR frames in both cases. The temporal loss adds regularization in the temporal sense at the cost of lowered accuracy for SR predictions. However, we focus on enhancing the joint VFI-SR performance to increase the overall temporal coherence of the final video results. Fig. 5 shows the visual comparison of the VFI-SR frames with and without the temporal loss in both architectures. It is clear that incorporating the temporal loss helps to enhance the edge details and structural construction of objects in both cases with and without the multi-scale structure.
|*VS: VFI-SR, S: SR, P: PSNR (dB), S: SSIM|
|Order||SR VFI||VFI SR||Joint|
|VFI Method||CyclicGen||CyclicGen||CyclicGen||CyclicGen||FISR-Baseline||FISRnet (Ours)|
|VFI-SR PSNR (dB)||36.15||36.13||36.24||36.23||36.34||37.66|
|SR PSNR (dB)||49.01||48.88||49.01||48.88||49.93||47.74|
|Total PSNR (dB)||41.66||41.60||41.71||41.65||42.16||42.00|
Another ablation study was conducted on the architecture components as shown in Table 3. We set the baseline network by excluding the multi-scale feature, optical flows () and warped frames () from the final FISRnet trained without the temporal loss. Each component is accumulatively added from the top row to the bottom row, starting from the baseline network. The temporal loss is again effective, showing 0.44 dB performance gain in PSNR. The multi-scale component also helps to boost the performance, since large motions can be effectively handled in the lower levels of the U-Net with larger receptive fields, guiding the upper level network to learn more efficiently from the coarsely predicted results in this structure. Although the optical flow information results in a marginal performance gain of 0.06 dB in PSNR, additionally providing the warped images as motion information is highly beneficial for VFI-SR, yielding 0.67 dB gain in PSNR if both the optical flow and the warped images are stacked with the input frames in the multi-scale architecture with the temporal loss. Moreover, the components of FISRnet boosts the qualitative performance of the VFI-SR frames as shown in Fig. 6. The final FISRnet with all components is able to restore the small letters on the boat, and catch the shapes and patterns of the balls and fingers in Fig. 6.
Since there are no existing joint VFI-SR methods, we conduct an experiment with the cascade of existing VFI and SISR methods with our 4K 60 fps test set. There can be two variations of the cascade connections as shown in Fig. 7. In the first variation as shown in Fig. 7 (a), SR can be performed first to enlarge the spatial resolution of the LR frames, resulting in HR-LFR frames, then VFI can be performed on the up-scaled frames to obtain the HR middle frames for finally generating the HR-HFR video outputs. As for the second variation shown in Fig. 7 (b), the LR middle frames can be produced first to increase the temporal resolution, resulting in LR-HFR frames, and then SR can be performed on all LR frames to generate the final HR-HFR video outputs. For the compared methods, we select the recent CyclicGen [Liu et al.2019] for the VFI method, and cascade EDSR [Lim et al.2017] or LapSRN [Lai et al.2017] as the SR method. For all methods, we used the official codes provided by the authors.
The quantitative comparison for the FISRnet and the cascaded methods are given in Table 4. For the cascade orders, performing VFI followed by SR (VFI SR) seems to generally show better performance, since in the perspective of the VFI method, it is easier to capture the motion along the temporal evolution of the 2K LR frames (VFI SR) than along the up-scaled 4K HR frames (SR VFI), where the absolute motion displacement is larger. Our proposed FISRnet outperforms the four cascaded combinations for the VFI-SR frames with at least 1.42 dB gain in terms of PSNR. Due to the trade-off between VFI-SR and SR performance, the baseline architecture of FISR (FISR-Baseline) shows better performance for SR, outperforming EDSR by 0.92 dB.
The qualitative comparison of the VFI-SR frames is given in Fig. 1 and Fig. 8. FISRnet accurately reconstructs the objects with realistic textures and sharp edges. Our method is able to capture the texture of the water waves and reconstruct small letters on the ball and the boat in Fig. 1 and Fig. 8. Furthermore, performing VFI followed by SR generates better structural context ( and column in Fig. 8) but often produces blurry edges, while SR followed by VFI restores sharper edge details ( and column in Fig. 8) at the cost of less accurate structural reconstructions. In the latter case, the motion displacement seems to have exceeded the maximum motion that the network [Liu et al.2019] can handle, due to the large resolution (4K) inputs.
The testing runtime of FISRnet is average 2.73 seconds, for one input test sample of three 2K (19201080) resolution frames that generates two 4K (38402160) VFI-SR frames and one 4K SR frame at once, with an NVIDIA TITAN Xp GPU.
In this paper, we first defined a novel problem of joint VFI-SR to directly synthesize high quality HR HFR frames from LR LFR input frames, which can be applied for the direct conversion of 2K 30 fps videos to 4K 60 fps videos. This is a very useful means to generate high quality visual content for premium displays. However, joint VFI-SR is a difficult task, where the spatio-temporal up-scaling must be performed simultaneously to produce non-existent up-scaled frames. We proposed a three-level multi-scale U-Net-based network, called FISRnet, to handle the large motion present in the high resolution data of 2K resolution inputs, trained via the proposed temporal loss with the multiple data sample training strategy that allows for a more stable temporal regularization. Applying the temporal loss exploits the temporal relations existing across the multiple data samples, helping the FISRnet to sharpen the edges and construct the correct shapes of diverse objects. Besides, the temporal loss and the multiple data sample training can be applied to any video-related vision task. We analyzed the effect of the temporal loss and the components of the network architecture with various ablation studies in the Experiment Section, and also demonstrated that our FISRnet outperforms the cascades of existing state-of-the-art VFI and SISR methods. The official Tensorflow code is available at https://github.com/JihyongOh/FISR.
This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2017-0-00419, Intelligent High Realistic Visual Processing for Smart Broadcasting Media).
A fast 4k video frame interpolation using a hybrid task-based convolutional neural network.Symmetry 11(5):619.