Zoom-In-to-Check: Boosting Video Interpolation via Instance-level Discrimination

12/04/2018 ∙ by Liangzhe Yuan, et al. ∙ 0

We propose a light-weight video frame interpolation algorithm. Our key innovation is an instance-level supervision that allows information to be learned from the high-resolution version of similar objects. Our experiment shows that the proposed method can generate state-of-art results across different datasets, with fractional computation resources (time and memory) with competing methods. Given two image frames, a cascade network creates an intermediate frame with 1) a flow-warping module that computes large bi-directional optical flow and creates an interpolated image via flow-based warping, followed by 2) an image synthesis module to make fine-scale corrections. In the learning stage, object detection proposals are generated on the interpolated image. Lower resolution objects are zoomed into, and the learning algorithms using an adversarial loss trained on high-resolution objects to guide the system towards the instance-level refinement corrects details of object shape and boundaries. As all our proposed network modules are fully convolutional, our proposed system can be trained end-to-end.



There are no comments yet.


page 1

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High fidelity video frame interpolation has usages in novel-view rendering, video compression, and frame rate conversion. Flow-based synthesis algorithms [1][9][18][21]

generate realistic colors and patterns, as the pixels are explicitly copied from given frames. However, for challenging scenes with heavy occlusion, complex deformation or fast motion, flow-based interpolation suffers due to inaccuracy in optical flow estimation algorithms. To compensate for optical flow error, one can add another network to refine the interpolation results

[21][18], at a cost of much higher computational cost. Alternatively, a kernel-based interpolation approach seeks to learn a pixel-level transformation from input images to the target frame without requiring a precise per-pixel flow estimation. The size of blending kernels in these methods directly restricts the motion that the network is able to capture. To allow larger motion, a big () kernel is used in [19], which results in heavy memory and computation resource usage.

Generative Adversarial Network can hallucinate pixels on occluded object[14] or sharpen blurry image[12]. However, without an accurate motion estimation, such generative models are susceptible to mode collapse, resulting in severe over-fitting issue: when object is blurry, it favors removing the object all together.

We propose a lightweight video synthesis that combines the benefits each of the approaches above. Our system consists of a two-stage interpolation network: a cascade design with a flow-based module followed by a kernel-based module. In the first module, we learn a bi-directional optical flow estimation and a sampling map that captures figure-ground (dis)occlusion. This flow-based interpolation module can deal with large motion displacement and generate a roughly correct interpolated image by flow-based warping.

An image synthesis module combines the warped image and its CNN feature map to make fine-scale corrections. Our cascade network design substantially alleviates the computational resource at inference stage, as it requires neither 1) large network size needed to estimate accuracy optical flow, nor 2) large kernel sizes needed to preserve clean boundary and capture large motion.

We use a combination of a synthesis loss and an adversarial loss to train the network. Image level supervision has a tendency to remove object details, particularly when the optical flow is blurry. We propose a region-of-interest (RoI) discriminator to focus our system on the fine details of individual of moving objects.

Simply zooming-in is not sufficient: if the ‘ground-truth’ reference images are also lack of details due to lower solution or motion blur, there is not sufficient feedback to the network how to correct its mistakes.

Our key observation is that in the video we often have similar objects that appear at high-resolution with greater details. This allows the algorithm to learn not just from the current reference frame, but also from semantically similar objects at higher-resolution. This design allows our network to leverage instance-level attention in learning, and thus perform better in challenging scenes with large displacement motion and fine-grained details.

2 Related Work

Optical flow estimation is a basic building block for video frame interpolation [29][31][9][18], and frame interpolation has been used to evaluate the accuracy of optical flow estimation [1]. With rapidly improving quality of optical flow estimation, state-of-the-art optical flow methods [26][7][3] can serve as a strong baseline for video interpolation. However, drawbacks for flow-based video interpolation include 1) producing artifacts around object boundary due to lack of occlusion reasoning, and 2) the approaches are not end-to-end trainable.

To correct this problem, Liu et al. [31]

develop a network to extract per-pixel 3D optical flow vector across space and time in the input video. The intermediate image is generated by trilinear interpolation across the input video volume. The method obtains high-quality results in frame interpolation and their unsupervised flow estimation results are also promising compared to the state-of-the-art. Noticeably, the deep voxel flow method tends to fail when the scene contains repetitive patterns.

Similarly, a recent work by Jiang et al. [9] interpolate frames at any arbitrary time step between two frames. They adopt the U-Net architecture [23] for bidirectional flow and visibility mask estimation, followed by a flow refinement network.

Niklaus et al. [18] warp both the input frames and pixel-wise contextual information extracted from ResNet-18[6], and employ a synthesis network with a GridNet [4] architecture to generate the interpolated frame.

Alternatively, several methods interpolate frames without explicit motion estimation[16] [20] [19]. Niklaus et al. [19] combine motion information and pixel synthesis in one step of local convolution over two input frames. They employ a network to estimate a spatially-adaptive convolution kernel for each pixel. Although this method enables high-quality video frame interpolation, it is difficult to estimate all the kernels at once and the interpolation is very memory intensive. Niklaus et al. [20] improve the efficiency by approximating a 2D kernel with a pair of 1D kernels. This work relieves the intensive memory requirement but the fundamental limitation still exists, where the capability of capturing large motion and flexibility in frame resolution is still limited by the kernel size, which can be prohibitively expensive to increase. Phase-based methods encode the motion in the phase shift between input frames. A recent work of video interpolation by Meyer et al. [16] propagates predicted phase information across oriented multi-scale pyramid levels to cope with large motions.

A related problem is video frame extrapolation. Earlier approaches focused on developing latent variable models that can represent the inherent underlying uncertainty in prediction. Mathieu et al. [15] develop a multi-scale conditional GAN architecture to improve the prediction, however it still suffers from blurriness and would contain artifacts for large motion. Vondrick et al. [27] uses a network with two independent streams: a moving foreground and a static background. A recent work by Lee et al. [13] uses a stochastic video prediction model, based on VAE-GANs, which combine VAE-based latent variable models with an adversarial loss.

Several authors developed methods which seek to learn a transformation from past pixels to the future. Vondrick et al. [28] untangle the memory of the past from the prediction of the future by learning to predict sampling kernels. The work of [9] can be extended to video extrapolation since it learns offset vectors for sampling. Reda et al. [21] combines the merits of both flow-based and kernel-based approaches by learning a model for predicting a motion vector and a kernel for each pixel.

3 Method

We first introduce our forward image interpolation model, flow estimation module in Sec.3.1 and image synthesis module in Sec.3.2. We discuss our region-of-interest (ROI) discriminator in Sec.3.3, which is used to facilitate training our synthesis model. In Sec.3.4 we describe the overall losses used to train our network. Finally, we show our training details in Sec.3.5.

3.1 Coarse optical flow estimation

To compensate large displacement motion, we first estimate coarse optical flow to generate an initial interpolated frame given two consecutive video frames and . We use a U-Net like network to estimate bidirectional optical flows and , which can be used to warp and respectively to designated time . In the meantime, our network also predicts a per-pixel weighting mask to blend two synthesis images into one. The blending mask here can be seen as a confidence mask and it’s designed to deal with occlusion. Meanwhile, inspired by [18], we employ a pre-trained feature extractor to extract high level features on both and , denoted as and respectively. Note that empirically the flow-based methods present a satisfying performance on a large area but often fail to cope with fine-grain details and complex motions. Thus our flow estimation module only serves as an initial step for video interpolation task.

3.2 Image synthesis module

We perform both pixel-level and semantic feature-level warping as shown in Fig.2. In detail, we feed images ,

, corresponding deep feature maps

, , flows , and mask into later module for further refinement. In the image synthesis module, we use estimated bi-directional flow , and blending mask to warp both images and features into time by bi-linear interpolation[8].


here the function is the warping function which takes a warping map

to warp image tensor

to .


And the operator is an element-wise multiplication. Then we concatenate warped features and image and feed it into the image synthesis layer. Different from [18], in which the author used a giant GridNet[4] to refine the image, we simply use three convolutional layers with kernel size 9 to approximate a large receptive field. We will show that this approximation is enough to get good performance with our proposed instance-level adversarial loss.

Figure 2: An overview of our model. The flow estimation module (left) takes two frames: and as input. It predicts the bidirectional optical flows and for coarse motion estimation, and a blending mask as a confidence mask for occlusion reasoning. The image synthesis module (right) takes , corresponding features , , estimated optical flows , and blending mask to synthesize target frame . Instance-level adversarial discrimination is further added on to preserve sharper image details. The adversarial learning structure is described in Fig.3.

3.3 Instance level Discriminator

Figure 3: Image level adversarial learning v.s. proposed instance level adversarial learning. We crop ROIs from high resolution images and resize them into constant size patches, which are used to train our low resolution images. This forces the system to focus on refining details and boundaries of instances.

Directly using the flow-guided warped image as the synthesized image has two problems: (a) as the optical flow is trained on the whole image, it is more likely to have twisted and blurry boundaries in image , as shown in Fig.1; b) the bi-linear interpolation result is less realistic compared with the original target image . We employ adversarial learning [5]

to solve these problems. The adversarial learning has shown promising performance in static image generation, activity prediction, and other computer vision problems. In this paper, we explore two algorithm variations on the video interpolation problem: (a) directly use the adversarial learning on the whole image, and (b) use adversarial learning conditioned on the instance area, as shown in Fig.

3. Directly employing the adversarial learning on the whole image makes the generated more realistic compared with the real image , however since the majority of the image is usually the background, the adversarial loss of Eq.3.4 will be biased in these areas, which hurts the optimization of foreground areas (see Fig.1 row 2 and 3). To make the model focus more on the foreground, we further employ instance-level discriminator. The instance-centered learning force the model to pay more attention to instances, especially for small-scale objects.

Here we introduce instance level adversarial learning in more detail. Given the real image , we use region proposal method to generate several regions of interest(ROI), and the ROIs on and are fed into a bi-linear interpolation reshaping layer and are reshaped into patches with a fixed size of . The bi-linear interpolation reshaping layer can achieve two effects: a) through bi-linear interpolation, the estimated loss could be back-propagated to the exact pixel location and previous modules, thus the total network can be updated end-to-end; b) reshaping operation naturally realizes zoom-in effect, balancing network’s focus on close and far away, large and small objects. The reshaped ROI results of different objects are illustrated in Fig.1. Generated ROI numbers are inconsistent among different images, which makes it hard to construct data batches fed into the network. To solve the issue, a fixed region proposal number is assigned to each image, and empty proposal spaces left in each image are filled in with zeros. An extra number recording valid ROI numbers in each image is fed into the network as well, in order to guarantee that only valid ROIs are processed in later discrimination process. A discriminator with spectral normalization[17] is employed to examine only on the specific ROIs instead of on the whole image. The details of adversarial loss computation is described in the next section.

3.4 Training Objectives

We use two losses to constrain the network: a synthesis loss and an adversarial loss .


Synthesis Loss For the synthesis loss, we first calculate the robust norm[25] on the per-pixel color difference, which is commonly used in recent self-supervised optical flow estimation work[30]. Additionally, we estimate the difference of the first-order gradient for per-pixel color, which further improves the reconstruction quality[15]. The overall photometric loss is computed as


where is the robust norm also known as Charbonnier norm.

The second term of synthesis loss we add is perceptual loss[10]. It quantifies the network higher-level feature reconstruction quality and thus makes more visually plausible results for image interpolation. Our results show that the perceptual loss enables the network to learn to reconstruct crispy images. The perceptual loss is defined as follows



is the feature extraction function and in our work, we use the latent space features from RPN in Faster R-CNN

[22]. We apply photometric loss and perceptual loss on both initial interpolated image and synthesis image to stabilize flow estimation network training. We also constrain the first-order gradient of bi-directional optical flow and the corresponding blending mask to be locally smooth, resulting in smoothness loss .

The above loss functions mainly guide the initial synthesis of our network and we group them as the synthesis loss,


Adversarial Loss Importantly, in order to deal with complex scenarios and enlarge model capacity, we utilize another network to discriminate synthesis images. The adversarial loss consists of two parts, namely the generator loss and the discriminator loss. As illustrated in Fig.3, a typical practice is computing adversarial losses on the whole image, which leads to image-level adversarial training. This training provides a uniform gradient across the whole image, such that semantic details and fine-grained differences are ignored. In contrast, we propose an instance-level adversarial learning. For each synthesis and ground truth image pair , we use region proposal method to propose N region proposals and resize them into patches, resulting in ROI pairs. As shown in Fig. 3, if we have access to the high-resolution images for training, we crop the ROIs from high-resolution images and use them guide synthesis of low-resolution results. The refers to a pair of synthesis and ground truth ROIs, where . The discriminator will examine each one of them. And the adversarial losses are formulated as:


3.5 Training

The network is trained on a mixer of UCF101 and cityscapes dataset. Here we randomly pick four triplets in every video clip of UCF101 and one triplet in every sequence of cityscapes training set, which gives us 3198 triplets in all. In practice, as our proposed network is self-contained and does not need labels, any collection of video clips are sufficient to train our network. We keep UCF101 original image size and downsample cityscapes images to for training inputs. Noticeably, we use the high-resolution version of images in cityscapes dataset to supervise adversarial learning. Forming high-resolution and low-resolution training pair is the key to our learning algorithm. During training, we select a triplet with duration of 80ms and randomly crop a region of input triplets for training. We also randomly flip images for data augmentation. The Adam optimizer [11] with and

is used with initial learning rate 0.0001, which is decayed exponentially by a factor of 0.1 for 10 epochs and clipped at 1e-8 during training. The weights for different losses are

. We trained a model with batch size 2 on a 11GB NVIDIA GeForce 1080 Ti GPU for around 12 hours.

4 Experiments

DVF [31] 17.49 0.72 23.88
SepConv [20] 7.85 0.92 30.92
SuperSloMo [9]
Ours 9.38 0.89 29.31
Ours 9.04 0.90 29.93
Ours 8.39 0.92 30.43
Table 1: Quantitative evaluation of different methods on cityscapes dataset444SuperSloMo [9] is not open-sourced so we don’t have their results on cityscapes dataset., including Interpolation Error (IE) [1], Peak-Signal-To-Noise (PSNR), and Structural-Similarity-Image-Metric (SSIM). Lower IE and higher values of SSIM and PSNR indicate better quality.
DVF [31] 11.54 0.86 29.70
SepConv [20] 11.28 0.87 30.29
SuperSloMo [9] 10.87 0.88 30.48
Ours 11.23 0.88 30.08
Ours 11.66 0.87 29.85
Ours 11.30 0.88 30.14
Table 2: Quantitative evaluation of different methods on UCF101 dataset 666We re-run the evaluation on the synthesis images provided by [9]., including Interpolation Error (IE) [1], Peak-Signal-To-Noise (PSNR), and Structural-Similarity-Image-Metric (SSIM). Lower IE and higher values of SSIM and PSNR indicate better quality.
Figure 4: Training with our proposed region adversarial loss, our model has minimal number of parameters and runs least time on one single HD resolution image comparing to competing methods, while our algorithm achieves start-of-the-art results.888The SDC-Net[21] has more than 160 million parameters.
Ground Truth Ours Deep Voxel Flow Separable Convolution
Figure 5: Qualitative results from different methods on cityscapes dataset. Best viewed in color.
Ground Truth Ground Truth Enlarged Ours Deep Voxel Flow Separable Convolution Super SloMo
Figure 6: Qualitative results from different methods on UCF101 dataset. Best viewed in color.

To evaluate our method, we quantitatively and qualitatively compare it with several state-of-art video frame interpolation methods. DVF[31] is a classic paper employing flow warping method to do video interpolation. We also select SepConv[20] which is based on adaptive separable convolutions. And SuperSloMo[9] is now the state-of-art method doing video interpolation. We compare the algorithm results on two different datasets, cityscapes [2] and UCF101 [24] datasets. Cityscapes dataset contains many different sizes of objects like cars, people, traffic lights, etc., in both close and far away distance, which is good to differentiate algorithms’ interpolation abilities on small objects. UCF101 contains people activities like boating, applying makeup, etc., which is good at showing non-rigid body motion results. In ablation experiments, we compare several variations of our proposed approach, basic flow warping structure, adversarial learning on the entire image structure, and instance-level adversarial learning network structure.

4.1 Ablation studies

Baseline. In this section, we evaluate the effectiveness of different modules in our proposed system for video interpolation tasks and show both quantitative and qualitative results on cityscapes datasets. We refer Ours as our network trained with proposed ROI adversarial loss, Ours as the model trained with the adversarial loss on overall image, and Ours as the model trained without any adversarial loss. Note that without discriminator in the training stage, our model mostly degenerates to deep voxel flow[31]. For all studies in this section, we trained the networks on mixed dataset consisting of cityscapes and UCF101 for 300k iterations and report the interpolation error (IE)[1], peak signal to noise ratio (PSNR) and structural similarity index(SSIM).

Adversarial training. We verify the advantages of using adversarial learning to improve video interpolation performance. From experiment on both datasets, we can find training with adversarial loss gives us sharper boundaries on images. In Table 4 and 6, we find that our model with adversarial loss consistently outperforms the baseline model. In Fig.1, we show an example of the effectiveness of adversarial loss. From zoomed-in figures, we can see adversarial loss helps preserve edges. This can be attributed to the adversarial loss better-facilitating image synthesis module’s learning and potentially correcting the inaccurate optical flow estimation.

Interestingly, we also find training model with adversarial loss on the whole image would lead to a local minimum solution sometimes. This phenomenon is especially noted when testing on cityscapes dataset. The network tends to erase the uncertain objects in the scene and recover the background. This is because the discriminator in the training is much stronger than the generator and the data distribution is dominated by rigid objects and background, leading to a biased learning result. In the next part, we will discuss our proposed ROI discriminator which would potentially fix this issue.

ROI discriminator. We further verify the advantages of introducing a focus mechanism in adversarial training, which greatly improves the video interpolation performance as a result. From experiment, it is shown that training with ROI zoom-in method gives us sharper boundaries on small, thin objects and image details. With the adversarial loss, both rigid moving object and non-rigid human body shape are preserved better than baseline method with only flow warping loss, as we can see in Fig.1. From Table 4, we show Ours method outperforms Ours and Ours method on all three standards. One typical failure case for applying adversarial losses to the whole image is that it is prone to erasing large non-rigid moving body parts as mentioned above. From Table 6, we can see that as the ROI size in the image is quite close to the entire image size, the instance-level discriminator model and entire-image-level discriminator model perform considerably similar. By introducing the region proposal method, we are actually making the network focus more on instance-level semantics in the image. By formulating video interpolation problem as perturbing semantic objects in image space, the pixel-level motion estimation can be better grouped and updated.

Training with high resolution patches. We also study the effects of training with different image resolution. Due to data augmentation and the concerns of training speed, researchers used to down-sample high-resolution images or crop part of images for training. However, high-resolution images often provide fine-grained information and it can potentially improve algorithms performance. In our model, we train our proposed model with ROI discriminator on real image patches from high-resolution images. More specifically, based on region proposal we crop ”fake” ROIs from synthesized images and crop the corresponding ”real” patches from its high-resolution counterpart, forming low-resolution high-resolution pairs. The high-resolution patches ultimately force the generator to pay much more attention to details on low-resolution images. From Table 4, we show that using high-resolution patches to train the network with our method would boost performances beyond both the baseline model and the model using full image adversarial training.

4.2 Quantitative evaluation

We compare our approach with state-of-the-art video interpolation methods, including separable adaptive convolution (SepConv) [20], and deep voxel flow (DVF) [31] on both UCF101 and cityscapges dataset. As shown in Table 4, our method achieves the best SSIM score. Table 6 demonstrates the quantitative results on UCF101 dataset, where we also compare with the SuperSloMo [9]. We re-run the evaluation on the images provided by [31] and [9], and images generated from [20]. All metrics are computed under the motion masks provided by [31], which highlights the capabilities to cope with regions of motion and occlusion. Our method achieves the highest SSIM score among the lightweight models and performs comparably to the heavy model, SuperSloMo [9].

4.3 Qualitative results

In this section, we present an intuitive comparison with state-of-the-art video frame interpolation methods, using multiple examples across scenes and datasets. In Fig.5, we present the comparison on cityscapes dataset on different street scenes under different lighting condition. It is obvious that DVF[31] generates most artifacts such as distortion of the whole scene, unrealistic deformation of cars and buildings, misalignment of white lines and etc.. Sepconv[20] is capable to deal with motion within their kernel size, but it consistently results in severe blur and artifacts near the image boundary, as shown in all of our examples. Our proposed approach is particularly good at recovering fine-grained details, for example, the traffic sign in the first example. Also, it fills up the occluded regions in a natural and realistic way, such as the white lines on the road in the fourth example. Fig.6 shows a qualitative comparison on UCF101. It is hard for DVF[31] to handle the occlusion as shown in the second example, although it was trained on UCF101. SepConv[20] is observed to have frequent duplicate artifacts, such as splits of horse legs and pole vault pole. SuperSloMo[9] performs well in most scenes but sometimes fails in the refinement of details in small scale such as the chin of the boxing player, and legs of running horses. Our proposed method enables the reconstruction with fine-grained details and thus is capable of interpolating most challenging scenes of fast motion, such as running horses and boxing.

4.4 Discussion

Figure 7: Failure cases of our method on cityscapes dataset. Best viewed in color. (left) The legs of the father in the right part of the image are cut while the legs of the son are fully reconstructed. (right) One leg of the pedestrian is partially erased.

Our network can achieve state-of-art video interpolation accuracy results using minimal model parameters and running fastest at inference time. Only the image synthesis module is needed at inference time which is very light-weight compared to SuperSlomo[9] using two-stage U-Nets, SDC-Net[21] using FlowNet2 and Context-aware Synthesis[18] using an expensive GridNet. It costs our network 0.36s to run on a image, while SDC-Net needs 1.66s and Context-aware Synthesis needs 0.77s to run on a image. In terms of visual comparison, besides keeping good interpolation results on rigid moving objects and background, our network works well on preserving sharp instance details like bicycle wheels, pen tips, etc. The network also recovers non-rigid body movement remarkably, keeping objects realistic when interpolating human faces, bodies, horse legs, etc. However, our network still has several limitations. For large non-rigid body movements, similar to other state-of-art methods, it is still hard for the model to recover body shape when doing interpolation and the interpolated objects are slightly distorted. Also, as the left figure in Fig.7 shows, cluttered scenes will lead to failure. This is due to the large overlapping instances and severe occlusion that make system hard to distinguish individual objects. Finally, larger models with mature optical flow estimation would still generate better results than ours on a clean and texture-rich area, like the ground, which is taking advantages of accurate flow estimation.

5 Conclusions

We have demonstrated a lightweight video interpolation framework that can retain object details in image synthesis. We use a flow estimation module to synthesize the intermediate frame followed by a simple image synthesis module to correct detailed shape errors. The network is trained by a region based discriminator which utilizes high-resolution image patches to supervise low-resolution ROIs synthesis, constraining instances in images to look realistic. Our proposed adversarial training strategy can be universally used as a training block to improve algorithm performance. In the future, we hope to improve the model design to compensate some drawbacks in our model, such as employing deformable convolutions to tackle large motions and deformations. And we want to further expand our work to video prediction task.