Deep Reference Generation with Multi-Domain Hierarchical Constraints for Inter Prediction

05/16/2019 ∙ by Jiaying Liu, et al. ∙ 0

Inter prediction is an important module in video coding for temporal redundancy removal, where similar reference blocks are searched from previously coded frames and employed to predict the block to be coded. Although traditional video codecs can estimate and compensate for block-level motions, their inter prediction performance is still heavily affected by the remaining inconsistent pixel-wise displacement caused by irregular rotation and deformation. In this paper, we address the problem by proposing a deep frame interpolation network to generate additional reference frames in coding scenarios. First, we summarize the previous adaptive convolutions used for frame interpolation and propose a factorized kernel convolutional network to improve the modeling capacity and simultaneously keep its compact form. Second, to better train this network, multi-domain hierarchical constraints are introduced to regularize the training of our factorized kernel convolutional network. For spatial domain, we use a gradually down-sampled and up-sampled auto-encoder to generate the factorized kernels for frame interpolation at different scales. For quality domain, considering the inconsistent quality of the input frames, the factorized kernel convolution is modulated with quality-related features to learn to exploit more information from high quality frames. For frequency domain, a sum of absolute transformed difference loss that performs frequency transformation is utilized to facilitate network optimization from the view of coding performance. With the well-designed frame interpolation network regularized by multi-domain hierarchical constraints, our method surpasses HEVC on average 6.1 saving for the luma component under the random access configuration.



There are no comments yet.


page 1

page 5

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

WITH the booming multimedia social networking and consumer electronics markets, a tremendously increasing amount of images and videos are uploaded to the community everyday. The new trend calls for new coding techniques to further improve the compression efficiency. Successive video frames are usually continuous in the temporal dimension and capture the same scene. Therefore, video codecs like MPEG-4 AVC/H.264 [1] and High Efficiency Video Coding (HEVC) [2] seek to improve the video coding performance with inter prediction by removing temporal redundancy between video frames. Specifically, in the inter prediction module, for a block which is to be coded (to-be-coded block), the motion estimation technique is first used to search for reference blocks among the reconstructed frames. Based on the motion estimation results, motion compensation technique then predicts the to-be-coded block based on reference blocks. After that, only the block-level motion information and the prediction residue between the predicted result and the original to-be-coded-block need to be coded. Consequently, temporal redundancies are largely removed and many bits can be saved.

However, there are lots of obstacles to performing the inter prediction. Even for continuous frames, content changes and complex local motions are quite common, which lead to large residues. Thus, many bits are used to code these residues between the prediction and the to-be-coded block. Many researches are conducted to better estimate global and local motions, namely capturing inter-frame correspondences, for better motion compensation and temporal redundancies removal. The early works [1, 2] start to perform block-level motion estimation and compensation. In these methods, the prediction is derived directly from one individual reference block or a linear combination of the reference blocks. In real videos, besides block-level translational motion between reference blocks and the to-be-coded block, there exist complex local motions caused by non-translational camera and object movements, which are called inconsistent pixel-wise displacement, like rotation and deformation between the matched blocks. These kinds of inconsistent pixel-wise displacement cannot be modeled only by block-level motion estimation and compensation. The residues are still large and cost a lot of bits for coding.

Therefore, some methods [3, 4, 5] turn to applying pixel-wise refinement for inter prediction with the bi-directional optical flow (BIO) between reference frames. Alshin et al. [3, 4] calculated BIO between reference blocks and operated pixel-wise motion refinement for the bi-directional motion compensation. In [5], with the BIO estimated from reference frames, a co-located reference frame is interpolated as the additional reference for motion compensation. Although alleviating pixel-wise displacement to some extent, the performance of these methods heavily relies on the accuracy of optical flow estimation. However, the optical flow estimation process used by these methods is manually designed, which will inevitably lead to the inaccurate estimation of the complex pixel-wise inconsistent displacement.

Recently, with the rise and development of deep learning-based image processing, some researchers begin to devote their efforts to utilizing deep learning techniques to address motion-related problems, e.g. optical flow estimation [6, 7, 8] and frame interpolation [9, 10, 11]. Besides significantly improving the performances in these tasks, these works bring in new insights and methodologies for pixel-level motion modeling, which provide new foundations for the successive works. Meanwhile, more and more works [12, 13, 14, 15] explore to introduce deep learning techniques to the video coding scenario and offer significant improvements in coding performance. Due to the powerful capacity of representation learning, deep learning techniques can flexibly handle various kinds of video signals and successfully construct the non-linear mapping from input signals to the target domain.

We follow both trends, deep learning-based motion modeling and deep learning-based video coding optimization, and offer optimized video coding techniques to better model the pixel-wise inconsistent displacement. Specifically, we choose to use deep learning techniques to interpolate a pixel-wise closer frame (PC-frame) from existing reconstructed frames. Here, “pixel-wise closer” means that the inconsistent pixel-wise displacement between the interpolated frame and the the frame which is to be coded (to-be-coded-frame) is smaller than that between the reconstructed frames and the to-be-coded frame. After that, the PC-frame is utilized as an additional reference frame for the to-be-coded frame. Thereby reference blocks with smaller pixel-wise displacement may be retrieved for the to-be-coded blocks in inter prediction. Compared to the individual video frame interpolation task, frame interpolation in video coding additionally faces more issues. 1) In the lossy compression, reconstructed reference frames are heavily degraded so less reference information could be used for interpolation. Moreover, the interpolation of the detail will be disturbed by compression artifacts in the prediction process. 2) Coding performance should be considered as the metric for coding-oriented frame interpolation. However, the existing coding pipeline is very complex and not end-to-end trainable. So it is also a great challenge to introduce proper objective functions to train the coding-oriented frame interpolation network. 3) There are various kinds of dependencies in different domains which can be utilized for PC-frame interpolation, e.g. spatial domain, frequency domain, et al. There is not a unified framework to consider these dependencies and their potential interactions jointly.

In our work, we tackle the above issues by building a multi-scale quality attentive factorized kernel convolutional neural network (MSQ-FKCNN). The network exploits an encoder-decoder convolutional neural network (CNN) to generate factorized kernels for synthesizing the target frame from compressed frames. Compared with a single large kernel or separable kernels, the proposed network is both flexible and economic to model video frame signals with factorized kernels. Meanwhile, we introduce multi-domain hierarchical constraints to train the network. 1) To reduce the disturbance of compression noise, we introduce a quality attentive mechanism which guides the network to make choices in the

quality domain

to use more information from high quality frames for inter prediction. 2) For the metric to train such a network, inspired by HEVC, a sum of absolute transformed difference (SATD) loss function that integrates measurements in both

spatial and frequency domains is used. 3) To better utilize dependencies in the spatial domain and model the joint interdependencies among different domains, our network takes a multi-scale structure to exploit the spatial dependencies and model the multi-domain dependencies in a unified way. Benefiting from our well-designed factorized kernel CNN and the carefully considered multi-domain hierarchical constraints, the proposed network can be trained not only for better interpolation quality but also greater coding performance.

Our contributions are summarized as follows:

  • We propose to utilize deep frame interpolation to generate an additional pixel-wise closer reference frame for inter prediction. A coding-oriented frame interpolation network MSQ-FKCNN is specially designed to flexibly synthesize the target frame from input frames with factorized kernels. MSQ-FKCNN significantly alleviates the inconsistent pixel-wise displacement between existing reference frames and the to-be-coded frame.

  • To better train our network, multi-domain hierarchical constraints are designed for the coding-oriented frame interpolation. The hierarchical dependencies in spatial, quality and frequency domains are considered jointly to obtain more abundant reference information and achieve better interpolation results.

  • To additionally deal with compression artifacts, the multi-scale quality attentive mechanism is designed to make the network pick up more information from high quality frames and further exploit spatial dependencies for prediction, which further improves the interpolation accuracy.

  • In order to improve the modeling capacity of our network in the video coding scenario, a multi-scale SATD loss function is implemented to guide the network optimization in the joint spatial and frequency domain, which can better indicate the coding cost of the prediction residue and lead to better coding performance.

The rest of the paper is organized as follows. Sec. II introduces recently proposed deep learning-based methods which solve motion-related problems. Some recent works that use deep learning techniques to improve video coding performance are also presented. Our proposed coding-oriented frame interpolation method will be introduced in Sec. III. Implementation details about the training data preparation and how to integrate generated PC-frames into HEVC are shared in Sec. IV. Experimental results and analyses are shown in Sec. V and concluding remarks are given in Sec. VI.

Ii Related Works

Ii-a Deep Learning-Based Motion-Related Works

Recently, deep learning-based motion estimation works have been widely proposed and show impressive results compared with traditional methods. In [16], an end-to-end optical flow estimation network FlowNet is first proposed and achieves comparable estimation accuracy with traditional methods. A succeeding network FlowNet 2.0 is later designed to progressively estimate the optical flow and perform on par with state-of-the-art methods at higher frame rates. Recently, Sun et al. [8, 17] proposed a compact but effective PWC-Net integrating pyramid processing, warping, and the cost volume. The proposed PWC-Net successfully outperforms previous works on the KITTI benchmark [18].

Meanwhile, some motion-related applications like frame interpolation are also greatly facilitated by deep learning techniques. Niklaus et al. [10] formulated video frame interpolation as two steps, i.e. motion estimation and pixel synthesis, and they proposed an end-to-end deep learning framework to solve these two tasks. Spatially adaptive kernels are estimated for synthesizing target frames. In [11], adaptive separable kernels are successively proposed to largely reduce the model parameters. Liu et al. [19] choosed to directly synthesize the target frame from the input by learning pixel displacement with the network. In [20], flows between the target frame and two input frames are also estimated and utilized for warping. The warped contextual information which is extracted from the response of ResNet-18 [21] is additionally used for blending intermediate frames warped from two-sided input frames. In [22], bi-directional optical flows between input frames are inferred by U-Net [23] and then linearly combined at each time step for interpolating target frames at arbitrary time points.

The successes of all the above deep learning-based methods have identified the ability of deep learning techniques in handling motion related problems. Thus, based on the meaningful experiences of previous methods, we make deeper explorations to estimate the pixel-wise displacement between compressed video frames and generate better temporal reference samples for inter prediction in video coding with deep neural networks.

Ii-B Deep Learning-Based Video Coding

There has been a bunch of works exploiting deep learning techniques to improve video coding performance by optimizing modules in the coding structure, e.g., loop filtering, mode decision, rate control, intra and inter prediction.

CNN has brought significant performance gain to many image restoration tasks like super-resolution

[24, 25], denoising [26] and compression artifacts removal [27, 28]. The success of CNN on these image restoration tasks has promoted the development of deep learning-based loop filtering methods. In [12], Kang et al. proposed a multi-modal/multi-scale convolutional neural network to replace existing deblocking filter and sample adaptive offset for loop filtering, which obtains considerable gains over HEVC. A content-aware mechanism [29] is designed to use different CNN models for the adaptive loop filtering in different regions. In addition to improving loop filtering performance, Laude and Ostermann [30] proposed to replace the conventional Rate Distortion Optimization (RDO) with CNN for the intra prediction mode decision. Furthermore, coding unit (CU) partition mode decision can also be predicted by CNN [31] and Long and Short-Term Memory (LSTM) network [32].

Considering the strong nonlinear mapping ability of deep learning techniques, it is also very promising to predict more reference signals from existing reconstructed signals for intra and inter prediction by deep learning. Li et al. [14] firstly adopted fully connected network (FCN) to learn an end-to-end mapping from neighboring reconstructed pixels to the to-be-coded bolck in the intra coding of HEVC. Moreover, Hu et al. [33]

used a recurrent neural network to explore the correlations between reconstructed reference pixels and predict the to-be-coded block in a progressive manner. As for inter prediction, Wang

et al. [34] additionally used spatially neighboring pixels of both reference blocks and current to-be-coded blocks to refine initial predicted blocks with an FCN and a CNN. In [15] and [35], CNNs are used for fractional interpolation in the motion compensation process, which provide better sub-pixel level reference samples for inter prediction. Zhao et al. [36] first tried to apply deep frame interpolation to video coding by directly using interpolated blocks as the reconstructed blocks at coding tree unit (CTU) level. They directly applied a pre-trained video frame interpolation model without any specific optimization for the video coding scenario. Comparatively, in our work, we exploit multi-domain hierarchical constraints for the additional reference generation to overcome new challenges faced in the video coding scenario.

Iii Pixel-Wise Closer Reference Generation with Hierarchical Constraints in Multiple Domains

In this section, we first illustrate the frame interpolation in HEVC and analyze several issues faced in the coding scenario. Then, we build an MSQ-FKCNN for deep frame interpolation. At last, we present several well-designed constraints to regularize the training of our MSQ-FKCNN to address the above issues for better interpolation.

Iii-a Frame Interpolation in HEVC

We implement and test our method on the HEVC reference software HM-16.15 under the RA configuration. For a to-be-coded frame , two-sided frames and are previously coded and used as the input of MSQ-FKCNN. The PC-frame will be interpolated to facilitate inter prediction. In the video coding scenario, only reconstructed reference frames and are available for reference. Compared to frame interpolation of high-quality videos, frame interpolation in the coding scenario inevitably faces four issues:

  1. High frequency detail loss of reference frames caused by the quantization operation leads to difficulty in the implicit motion estimation and inaccuracy of local details inference.

  2. Compression artifacts, i.e. the blockiness, in reference frames originate from block-based quantization. The artifacts are easy to be brought into the generated interpolation results.

  3. Inconsistent quality of input frames. Due to the design of coding configurations, and may be coded with different QPs. A desirable frame interpolation model should consider the quality of the input frames and utilize this information adaptively.

  4. The purpose of video coding is to maintain the quality of decoded frames with fewer bits. Thus, training a coding oriented frame interpolation model should pay attention to both distortion and bit cost.

Iii-B Overview of the Proposed Method

To tackle the above mentioned issues, we explore possible models and potential constraints to effectively infer temporally intermediate frames from noisy and inconsistent input frames. To build an effective interpolation model, we start from raw adaptive kernel CNN and separate kernel CNN [11], analyze their correlation with a unified viewpoint, and develop the proposed MSQ-FKCNN for better interpolation. To better train this model, the hierarchical constraints in several domains are introduced:

  1. Spatial Domain

    . Our feature extraction network takes an encoder-decoder structure that first down-samples features and then up-samples features. In this network, the kernels and interpolation results are inferred from small to large progressively. At a small scale, the motion information is easier to be learnt and details can be better inferred in this progressive way even with the high frequency detail loss. Furthermore, at the small scale, compression artifacts are suppressed, and more clean and accurate interpolation results are obtained.

  2. Quality Domain. To handle the inconsistent quality of input frames, we make our model be aware of the quality differences. The factorized kernel is modulated with quality-related features, which guides the model to utilize more information of the high quality input frame.

  3. Frequency Domain. To better regularize the training of our frame interpolation network for better video coding performance, a loss considering both distortion as well as the bit cost is implemented by frequency transformation.

In the following sections, we will introduce our MSQ-FKCNN and the multi-domain hierarchical constraints in details.

Fig. 1: Architecture of different adaptive kernel models for frame interpolation. (a) Raw adaptive convolution and its factorized norm. (b) Adaptive separable convolution. (c) Factorized convolution. (d) Multi-scale factorized convolution. (e) Multi-scale quality attentive factorized convolution.
Fig. 2: Architecture of MSQ-FKCNN. Numbers below the feature maps indicate channel numbers. and mean scales of the images. The feature extraction part predicts bi-directional motion information between input frames. In the multi-scale frame interpolation component, intermediate frames of different scales are interpolated with the estimated factorized convolution kernels and quality attentive maps. QA-FKC denotes quality attentive factorized kernel convolution. The convolution is modulated with quality-related features to be aware of using more information from high quality frames. SATD loss measures the difference in both spatial and frequency domains. The whole network is regularized by the constraints in the spatial, frequency and quality domains.

Iii-C Multi-Scale Quality Attentive Factorized Kernel CNN

For a to-be-coded frame , the frame interpolation method based on adaptive convolutions uses the two-sided reference frames and as input. The bi-directional motion feature is first extracted and then used for inferring adaptive kernels to reconstruct the temporally intermediate frame. The adaptive convolutions used in previous works, the related variants and our newly proposed one are discussed as follows.

Adaptive Convolution. Adaptive convolution works in this way. To predict a pixel in the target frame, two adaptive 2D kernels and will be first estimated respectively for two-sided input reference frames and . is then interpolated via local adaptive convolution on and as follows:


where and are patches in and centered at the position . For better illustration in the following parts, we first introduce a factorized form of the adaptive convolution as shown in Fig. 1 (a). We concatenate the 2D kernels and together to form a 3D adaptive kernel for each pixel . We assume the size of to be , where and represent heights and weights of the adaptive kernel and are set equally to . belongs to the temporal dimension and corresponds to the number of input reference frames. Then, the kernel can be respectively factorized along the temporal dimension as follows:


where indicates the sequence number of the input reference frame, and is the rank number of the adaptive kernel. and are factorized kernels of size .

Adaptive Separable Convolution. The raw adaptive convolution with a large kernel size leads to a huge amount of parameters, which makes the model training less promising. The adaptive separable convolution [11] addresses the problem by estimating the separable form of the convolutions. In fact, the adaptive separable convolution can be viewed as the special case of the factorized form of the adaptive convolution when , as shown in Fig. 1 (b). For each pixel in the target frame, four one-dimensional kernels , , and will be first estimated. Then, two adaptive kernels and are obtained by and . Promising frame interpolation results can be achieved by estimating the adaptive separable kernels.

Factorized Kernel Convolution. When we relax the approximate rank number and set to an intermediate value, we can get the convolutions with different number of model parameters and modeling capacities, as shown in Fig. 1 (c) with .

Multi-Scale Factorized Kernel Convolution. We assume that coding artifacts can be alleviated by the down-scaling operation. Thus, more accurate synthesis results can be obtained at small scales. By constraining the interpolation process at small scales, the main structure of the target frame is better learned and the frame interpolation quality can be further improved.

In conjunction with multi-scale frame interpolation, we project the factorized kernels to different scales and build a multi-scale factorized kernel convolution as shown in Fig. 1 (d) as follows:


where represents the scale of the factorized kernel. Here, the representation of the adaptive kernel is not realized by kernel-wise summation and is equivalently injected into the frame generation process. That is, the separate kernels are directly used to interpolate the target frame successively at different scales and are combined by the fusion of the synthesized frames of different scales.

Specifically, for each scale, the target pixel is synthesized by:


where is the target frame and represents the corresponding scale of , or . is the reference patch centered at the position . is obtained by doubly up-sampling the previously interpolated and it will be set to 0 for .

Multi-Scale Quality Attentive Factorized Kernel Convolution. As mentioned above, under the RA configuration, two-sided reference frames will be of different quality since they are coded with different QPs. It is meaningful to pay more attention to the reference frame of higher quality. Consequently, the quality attentive mechanism is introduced to the factorized kernel convolution and a new quality attentive kernel of size is added. The quality attentive modulation as shown in Fig. 1 (e) is formulated as follows,


From the view of frame interpolation, an illustration of the target frame synthesis that uses quality attentive factorized kernel convolution is shown as the QA-FKC component in Fig. 2. We generate normalized quantization parameter (QP) maps and of the two-sided reference frames as the additional input to make our network more aware of the quality differences between reference frames. Two-sided quality attentive kernels and factorized kernels are estimated for synthesizing the target frame.

The target pixel is obtained by:


Iii-D Architecture of MSQ-FKCNN

The architecture of MSQ-FKCNN is shown in Fig. 2. The whole pipeline is illustrated in details as follows.

Bi-Directional Motion Feature Extraction. An encoder-decoder structure is employed to extract bidirectional motion feature. The progressive down-sampling and up-sampling operations effectively enlarge the receptive fields so large scale motion can also be caught by MSQ-FKCNN. Kernel sizes of all convolutional layers are set to

and the rectified linear unit (ReLU) is utilized as the activation function. At the encoder side, average pooling is used for down-sampling. Bilinear interpolation is used for up-sampling at the decoder side. Skip connections are used here to bypass low-level information from the encoder side to the decoder side.

Multi-Scale Frame Interpolation. With the extracted bi-directional motion feature, the multi-scale frame interpolation part generates target intermediate frames of different scales from small to large at the decoder side. At each scale , the target intermediate frame is interpolated by quality attentive factorized kernel convolution.

Quality Attentive Factorized Kernel Convolution. Details of factorized kernels estimation have been described in Sec. III-C. At each scale , two-sided factorized kernels , , and quality attentive maps , are estimated for interpolation. Feature maps of corresponding scales extracted in the feature extraction part are used as input. For a target frame of size , four factorized kernel maps of size will be inferred by four layers of convolutions. For scales of , and , is respectively set to , and . Thereby, each pixel in the target frame can find four corresponding factorized kernels at the same position of four factorized kernel maps.

As for the quality attentive maps estimation, normalized QP maps of reference frames are generated and used as the input together with the extracted bi-directional motion feature. The normalized QP maps are derived by dividing QPs of reference frames with the value 51. Then, an quality attentive map is estimated by four layers of convolutions. The interpolated result is obtained by quality attentive factorized kernel convolution on the reference frames as illustrated in Eq. (III-C).

After multi-scale frame interpolation, the multi-scale SATD loss function is used to measure the prediction error and guide optimization of network parameters, which is illustrated in Sec. III-E. The corresponding components in our network which implement the multi-domain hierarchical constraints are summarized in Table I.

Iii-E Multi-Scale SATD Loss Function

In the training process, parameters of the network are optimized by back-propagating the gradient of the loss calculated between the interpolated frame and ground truth . In deep video frame interpolation methods, the loss function is commonly adopted [19, 11, 20, 22] to train the model for the order of better objective performance:


However, loss function cannot fully measure the modeling capacity from the view of video coding performance. It regards each pixel as an independent one and thus cannot measure the bits needed for coding the prediction residue, which is the other important factor that affects the final coding performance.

Module Constrained Domain Explanation
Bi-directional motion
feature extraction
Spatial Multi-scale encoder decoder
Multi-scale frame
Spatial, quality, frequency
It integrates all parts
to get results
Factorized kernel
Spatial As shown in Fig. 1 (e)
Quality attentive
maps estimation
Make the network aware
of the quality differences
SATD loss Spatial, frequency
The signal difference after
a frequency transformation
TABLE I: Summarization of modules and the corresponding constraints.
(a) ,
(b) ,
Fig. 3: Two example residue blocks with same losses but different losses. Intuitively, SATD loss is superior in measuring the redundancy of the residual signal after transform.

In the fractional motion estimation process of HEVC, SATD is adopted as a matching criterion for it can better indicate the requirement of bits for coding residual signals. It is empirically proven that the numerical value of SATD after frequency transformation is more consistent with the number of bits to be spent for residual signals coding. Two example residue blocks and the corresponding Hadamard transformed blocks are shown in Fig. 3. Though their losses are the same, the residue block (a) will intuitively cost less in the successive coding process since there are higher spatial similarities among the residual signals in the block. Compared with , SATD successfully reflects the difference of the coding cost.

Consequently, we adopt SATD as the loss function to apply constraints to MSQ-FKCNN in the frequency domain for better coding performance. In conjunction with the hierarchical prediction architecture, multi-scale SATD loss function is further calculated to constrain the prediction process from coarse to fine in the frequency domain.

We calculate the loss by blocks. The Hadamard transformation matrix is defined as follows:


By dividing the residue into non-overlapping residue blocks, we transform each residue block by:


where is the transformed residue block. Then, can be obtained by sum of the absolute values of all the transformed residual signals:


The final multi-scale loss is calculated by:


where are the weighting parameters which are empirically set to . The down-scaled images and are derived from with Bilinear interpolation.

Iv Training and Integration Details of MSQ-FKCNN

Iv-a Training Data Preparation

As for training data preparation, we use video clips to form the training samples. Each video clip consists of three consecutive frames , and , where and are the two-sided reference frames and is the ground truth. In the video coding scenario, two-sided reference frames are reconstructed frames which suffer from coding artifacts. The frame quality may be low especially for high QPs. In order to make the network work well in this condition, we code the reference frames and and use the reconstructed frames and as the input in training data generation. The reference frames are coded with HM-16.15 under the all intra configuration with a random QP value ranging from to .

Besides, frames are coded under different QPs in RA configuration, which means two-sided reference frames usually have different quality. For the sake of further simulating the real application situation, we set QPs of two-sided reference frames to have a random difference of 0 to 10. QPs of the reference frames are also saved in the training set as the side information for training. With the quality attentive mechanism, our MSQ-FKCNN can be more aware of the quality difference between reference frames and learn to interpolate higher quality frames.

Later on, we randomly extract blocks with a size of pixels at the same positions from two-sided coded reference frames and the ground truth frame to form the training data. A deep learning-based optical flow estimation method SpyNet[7] is utilized here for candidate blocks selection. We will not add blocks whose mean optical flow values are large to the training set.

In the training process, we refer to [11] for training data augmentation. blocks are randomly cropped from the blocks for training. The cropped blocks are also augmented by randomly changing the order of two-sided reference blocks and flipping all the blocks horizontally or vertically.

Fig. 4: Illustration of the hierarchical B coding structure in HM-16.15.

Iv-B Integration into HEVC

We implement and test our method on HM-16.15 under the RA configuration, where frames are coded in the hierarchical B coding structure. Frames are allocated to different group of pictures (GOP) and frames of different GOPs are coded successively. In HM-16.15, each GOP consists of 16 frames. The coding order of frames in the same GOP is not decided by their picture order count (POC) value but systematically redesigned. As shown in Fig. 4, frames are assigned to different temporal layers. The frames are coded successively according to their temporal layers. Frames in higher layers can utilize the reconstructed frames in lower layers for inter prediction. Moreover, in addition to frames of the same GOP, coded frames in previous GOPs can also be adopted as the reference.

We choose to generate the PC-frame for frames whose temporal layers are greater than 1 in this paper. Specifically, for a to-be-coded frame , we denote its temporal layer as and the PC-frame can be generated as follows:


where is the desired PC-frame and means the reconstructed reference frame. represents MSQ-FKCNN which infers the PC-frame from two-sided coded reference frames.

In the coding process, two reference picture lists and will be maintained. For most frames, two forward frames in and two backward frames in are available as reference for inter prediction. For each reference frame, the reference frame index will be allocated to it which indicates its place in the reference picture list. Prediction units (PU) at the decoder side can find corresponding reference frames through decoded reference frame indexes. To add the interpolated PC-frame to reference picture lists, we choose an existing reference frame in reference lists which is farthest from the to-be-coded frame and use its reference index to access at the decoder side.

and share the reference index of in inter prediction. Specifically, we implement a CU level RDO to decide which reference frame to be accessed by the shared reference index. Two passes of encoding that respectively use and for inter prediction are performed at the decoder side. A flag is set based on the rate-distortion costs of the two passes to indicate which reference frame to be used. When the flag is set to true, will be accessed if the shared reference index is chosen. Otherwise will be used. The flag is coded with one bit and integrated at CU level. All PUs in a CU share the same flag. Moreover, if all PUs in a CU do not choose the shared reference index after two passes of encoding, we will not code the flag since it is no need to indicate which frame the shared reference index points to if it is never visited.

Class Sequence BD-rate
Class A Traffic -6.1% -6.0% -4.6% [t]
PeopleOnStreet -11.0% -14.6% -12.6% [b]
Average -8.6% -10.3% -8.6%
Class B Kimono -3.8% -5.7% -3.7% [t]
BQTerrace -0.6% -0.8% 0.1%
BasketballDrive -2.5% -4.5% -3.5%
ParkScene -5.1% -5.5% -4.1%
Cactus -5.4% -8.4% -7.4% [b]
Average -3.5% -5.0% -3.7%
Class C BasketballDrill -5.2% -10.3% -9.9% [t]
BQMall -10.7% -13.9% -12.9%
PartyScene -7.4% -11.8% -9.6%
RaceHorsesC -2.4% -4.7% -4.7% [b]
Average -6.4% -10.2% -9.3%
Class D BasketballPass -8.8% -11.0% -13.5% [t]
BlowingBubbles -6.5% -8.0% -7.6%
BQSquare -10.5% -6.4% -9.1%
RaceHorses -5.5% -8.2% -7.8% [b]
Average -7.8% -8.4% -9.5%
All Sequences Overall -6.1% -8.0% -7.4%
TABLE II: BD-rate reduction of the proposed method compared to HEVC.

V Experimental Results

V-a Experimental Settings

We use the Vimeo-90K dataset [37] to generate training data. The dataset consists of video clips with a fixed resolution of resized from high-quality video frames. In total,

samples are generated from the dataset for training. The network is implemented on PyTorch and AdaMax

[38] is used as the optimizer with

. The learning rate is initially set to 0.001 and changed to 0.0001 after 30 epochs. We end the training process when 70 epochs are reached. The batch size is set to 16. We train our network on the Titan X GPU.

The proposed method is tested in HM-16.15 under the RA configuration with the intra period set to -1. BD-rate is used to measure the coding performance. HEVC common test sequences are adopted for testing. The number of encoding frames is set to be twice of the frame rate. Four QP values , , and are employed in the experiment. It should be noted that we only need to train one model for all QPs. Luma and chroma components share the same interpolation model. During testing, the chroma components will be first up-sampled and concatenated with the luma component to form a three-channel YUV image. The YUV image is then transformed to an RGB image to form the input. We also compare with a method proposed in [36], which also introduces deep frame interpolation to video coding but directly use the interpolated block as the reconstruction block. For simplicity, we call it DVRF.

(a) PeopleOnStreet
(b) BQMall
(c) BasketballPass
(d) BQSquare
Fig. 5: Four example R-D curves of the sequences PeopleOnStreet, BQMall, BasketballPass and BQSquare for the luma component under RA configuration.

V-B Experimental Results and Analysis

V-B1 Overall Performance

Table II shows the overall performance of our method for classes A, B, C and D. Our method has obtained on average , and BD-rate savings respectively for the Y, U, V components. For the test sequence PeopleOnStreet, up to BD-rate saving can be obtained for the luma component. For further verification, some example rate-distortion (R-D) curves are shown in Fig. 5. It can be seen that our method is superior to HEVC under most QPs.

Class Sequence DVRF Ours
Class B Kimono -1.7% -4.7% [t]
BQTerrace -0.2% -0.3%
BasketballDrive -1.1% -2.7%
ParkScene -2.6% -5.3%
Cactus -4.6% -6.1% [b]
Average -2.0% -3.8%
Class C BasketballDrill -3.2% -5.6% [t]
BQMall -6.0% -10.1%
PartyScene -3.0% -6.3%
RaceHorsesC -0.8% -2.0% [b]
Average -3.2% -6.0%
Class D BasketballPass -5.4% -9.9% [t]
BlowingBubbles -4.1% -6.0%
BQSquare -7.1% -9.0%
RaceHorses -2.2% -6.0%
Average -4.7% -7.7% [b]
All Sequences Overall -3.2% -5.7%
TABLE III: BD-rate reduction comparison between DVRF and MSQ-FKCNN.
(a) Target frame: Racehorses POC 1, left reference QP: 29, right reference QP: 40
(b) Target frame: Racehorses POC 13, left reference QP: 38, right reference QP: 40
Fig. 6: Visualization examples of the weighting maps which indicate the proportion different reference frames take in the target frame interpolation. Brighter pixels mean higher weightings.

V-B2 Comparison with the Existing Method

Furthermore, we compare our MSQ-FKCNN with DVRF [36], which introduces a deep frame interpolation method to video coding. DVRF is implemented on HM-16.6. For a fair comparison, we also implement our method on HM-16.6 and test our method under the same conditions as DVRF. In the RA configuration of HM-16.6, the GOP size is 8 and the frames are divided into four temporal layers. Following DVRF, we also only deal with frames of layer 2 and layer 3 and directly replace the temporally farthest reference frame without CU level RDO.

As shown in Table III, though DVRF obtains gain over HEVC, they use a pre-trained model without any consideration on the video coding scenario, whose performance is limited. Moreover, directly utilizing generated blocks as the reconstructed blocks cannot fully exploit the benefits of frame interpolation and will bring prediction errors to the following coding process. Differently, by specially designing our model in the video coding scenario and integrating the generated PC-frame into inter prediction, our method obtains on average more BD-rate saving for the luma component compared with DVRF.

V-B3 Verification of Multi-Domain Hierarchical Constraints

The effectiveness of multi-domain hierarchical constraints is also verified. A network named FKCNN is first implemented without the quality attentive mechanism and multi-scale frame interpolation. Q-FKCNN is later trained by adding the quality attentive mechanism to FKCNN to verify the quality attentive mechanism. Both FKCNN and Q-FKCNN are trained with the same settings as MSQ-FKCNN. The effectiveness of the hierarchical constraints can be proven by comparing between Q-FKCNN and MSQ-FKCNN.

Class C BasketballDrill -3.3% -3.4% -3.6% [t]
BQMall -8.4% -8.7% -9.2%
PartyScene -5.1% -5.1% -5.3%
RaceHorsesC -1.8% -2.1% -2.2% [b]
Average -4.7% -4.8% -5.1%
Class D BasketballPass -6.0% -6.7% -8.5% [t]
BlowingBubbles -5.3% -5.6% -6.6%
BQSquare -6.8% -5.4% -7.4%
RaceHorses -3.9% -4.6% -5.0%
Average -5.5% -5.6% -6.9% [b]
All Sequences Overall -5.1% -5.2% -6.0%
TABLE IV: BD-rate reduction comparison for the verification of multi-domain hierarchical constraints.

Note that in the following comparison, we test all the sequences with 32 frames and only the first frame is coded under the all intra configuration. The comparison results of different networks on classes C and D for the luma component are shown in Table IV.

It can be seen that a considerable BD-rate reduction can be obtained by adding the quality attentive mechanism to FKCNN for most test sequences. We additionally visualize the fusion weighting maps of the two-sided synthesized results for further verification of our quality attentive mechanism. The weighting maps are generated by dividing the synthesized results from left and right reference frames with the interpolated frame, which indicate the proportion each reference frame takes in the final result. The visualization results are shown in Fig. 6. As we can see, reference frames of higher quality usually take a larger proportion in the final results. Moreover, the greater the quality difference is, the more proportion the higher quality one will obtain.

By comparing between Q-FKCNN and MSQ-FKCNN, we can find that on average 0.8% and up to 2.0% BD-rate reduction can be brought by employing the hierarchical constraints, which brings more multi-domain dependencies for reference and leads to more accurate prediction results.

V-B4 Verification of SATD Loss Function

The superiority of loss function is also proven by experiments. We denote the models trained with and losses as MSQ-FKCNN- and MSQ-FKCNN-, respectively. Table V shows the BD-rate reduction obtained by models trained with different loss functions. By additionally constraining the interpolation in frequency domain, we can obtain on average more BD-rate reduction for the luma component for sequences of classes C and D.

(a) BasketballPass
(b) BlowingBubbles
(c) BQSquare
(d) RaceHorses
Fig. 7: Changes of the CU partition before and after using generated PC-frames for inter prediction. In each set, the left one shows ratios of different types of pixels coded by HM and the right one shows ratios of the pixels coded by our proposed method.
Class C -4.6% -8.0% -7.8% -5.1% -8.0% -7.6%
Class D -5.7% -7.2% -8.2% -6.9% -7.3% -7.7%
All -5.2% -7.6% -8.0% -6.0% -7.6% -7.6%
TABLE V: BD-rate reduction comparison between models trained with different loss functions.
Sequence Choosing Ratio BD-rate
27 32 37 42
BasketballPass 40.4% 48.0% 51.7% 52.7% -8.8%
BlowingBubbles 51.6% 60.3% 61.8% 44.1% -6.5%
BQSquare 64.5% 70.8% 71.9% 40.2% -10.5%
RaceHorses 22.3% 26.9% 34.8% 40.9% -5.5%
TABLE VI: Ratios of CUs that choose PC-frames for inter prediction under different QPs.
Class Sequence BD-rate
Class C BasketballDrill -2.1% -9.4% -6.9% [t]
BQMall -6.5% -11.1% -10.7%
PartyScene -3.3% -9.9% -7.8%
RaceHorsesC -0.6% -1.1% -0.8% [b]
Average -3.1% -7.9% -6.6%
Class D BasketballPass -4.0% -8.5% -6.7% [t]
BlowingBubbles -3.1% -7.2% -9.4%
BQSquare -3.3% -6.6% -3.4%
RaceHorses -0.6% -1.2% -1.3% [b]
Average -2.7% -5.9% -5.2%
Class E FourPeople -6.9% -8.5% -5.9% [t]
Johnny -3.7% -3.0% -2.6%
KristenAndSara -4.9% -7.3% -5.0% [b]
Average -5.2% -6.3% -4.5%
All Sequences Overall -2.9% -6.9% -5.9%
TABLE VII: BD-rate reduction under the LDB configuration.

V-B5 Rate Distortion Optimization and CU Partition Results Analysis

For further verification of the proposed method, we analyze the RDO and CU partition results on the sequences of class D. It should be noted that only frames whose temporal layers are greater than 1 are covered in our analysis. We first calculate the ratio of the CUs that choose PC-frames for inter prediction. The ratios are shown in Table VI. It can be seen that PC-frames generated by our MSQ-FKCNN are adopted by a considerable number of CUs for inter prediction.

Intuitively, more larger CUs will be used if we successfully alleviate the inconsistent pixel-wise displacement, since it is no need to further divide the CUs to handle the local differences caused by pixel-wise displacement. So we further analyze changes of the CU partition results before and after using the generated PC-frames. We divide pixels into four types according to sizes of the CUs they belong to. Later, ratios of different types of pixels are calculated and shown in Fig. 7. It can be found that more larger CUs have been used for inter prediction after adding PC-frames to the reference lists.

V-B6 Results under LD Configuration

Furthermore, to test the generality of the proposed method, we also test our method under the low delay (LDB) configuration. We additionally train MSQ-FKCNN for the LDB configuration on newly prepared training data. Video clips containing three consecutive frames are used to form the training samples. In each clip, the first two frames are used to form the input and the third frame is used as the target. Testing results on sequences of classes C, D and E are shown in Table VII. On average 2.9% BD-rate reduction can be obtained for the luma component under the LDB configuration.

Vi Conclusion

In this paper, we propose a deep learning based frame interpolation method to improve the inter prediction performance of HEVC. We carefully analyze the difficulties of frame interpolation encountered in the video coding scenario and pertinently propose the MSQ-FKCNN based frame interpolation regularized by multi-domain hierarchical constraints. The multi-scale quality attentive factorized kernel convolution is implemented to interpolate the target frame from small to large with quality attention. For the training of MSQ-FKCNN, multi-scale SATD loss function is employed to guide the network optimization in both spatial and frequency domains, which further improves the coding performance. After adding the generated PC-frames under the hierarchical B coding structure, significant BD-rate reduction can be obtained. Extensive experiments identify the effectiveness of each component in our MSQ-FKCNN and demonstrate the superiority of MSQ-FKCNN to the previous method.


  • [1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H. 264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, 2003.
  • [2] G. J. Sullivan, J. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012.
  • [3] A. Alshin, E. Alshina, and T. Lee, “Bi-directional optical flow for improving motion compensation,” in Proc. Picture Coding Symposium, 2010.
  • [4] A. Alshin and E. Alshina, “Bi-directional optical flow for future video codec,” in Proc. Data Compression Conference, 2016.
  • [5] B. Li, J. Han, and Y. Xu, “Co-located reference frame interpolation using optical flow estimation for video compression,” in Proc. Data Compression Conference, 2018.
  • [6] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in

    Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition

    , 2017.
  • [7] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2017.
  • [8] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2018.
  • [9] S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. Sorkine-Hornung, “Phase-based frame interpolation for video,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2015.
  • [10] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptive convolution,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2017.
  • [11] ——, “Video frame interpolation via adaptive separable convolution,” in Proc. IEEE Int’l Conf. Computer Vision, 2017.
  • [12] J. Kang, S. Kim, and K. M. Lee, “Multi-modal/multi-scale convolutional neural network based in-loop filter design for next generation video codec,” in Proc. IEEE Int’l Conf. Image Processing, 2017.
  • [13] C. Jia, S. Wang, X. Zhang, S. Wang, J. Liu, S. Pu, and S. Ma, “Content-aware convolutional neural network for in-loop filtering in high efficiency video coding,” IEEE Transactions on Image Processing, 2019.
  • [14] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network-based intra prediction for image coding,” IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3236–3247, 2018.
  • [15] N. Yan, D. Liu, H. Li, and F. Wu, “A convolutional neural network approach for half-pel interpolation in video coding,” in Proc. IEEE Int’l Symposium on Circuits and Systems, 2017.
  • [16] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbas, V. Golkov, P. v.d. Smagt amd D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Proc. IEEE Int’l Conf. Computer Vision, 2015.
  • [17] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Models matter, so does training: An empirical study of cnns for optical flow estimation,” arXiv preprint arXiv:1809.05571, 2018.
  • [18] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2012.
  • [19] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow,” in Proc. IEEE Int’l Conf. Computer Vision, 2017.
  • [20] S. Niklaus and F. Liu, “Phase-based frame interpolation for video,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2018.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2016.
  • [22] H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, and J. Kautz, “Super slomo: High quality estimation of multiple intermediate frames for video interpolation,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2018.
  • [23] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. Int’l Conf. Medical Image Computing and Computer Assisted Intervention, 2015.
  • [24] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Proc. European Conf. Computer Vision, 2014.
  • [25] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2016.
  • [26] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [27] C. Dong, Y. Deng, C. C. Loy, and X. Tang, “Compression artifacts reduction by a deep convolutional network,” in Proc. IEEE Int’l Conf. Computer Vision, 2015.
  • [28] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approach for post-processing in HEVC intra coding,” in Proc. Int’l Conf. Multimedia Modeling, 2017.
  • [29] C. Jia, S. Wang, X. Zhang, S. Wang, J. Liu, S. Pu, and S. Ma, “Content-aware convolutional neural network for in-loop filtering in high efficiency video coding,” IEEE Transactions on Image Processing, 2019.
  • [30] T. Laude and J. Ostermann, “Deep learning-based intra prediction mode decision for HEVC,” in Picture Coding Symposium, 2016, pp. 1–5.
  • [31] Z. Liu, X. Yu, Y. Gao, S. Chen, X. Ji, and D. Wang, “CU partition mode decision for HEVC hardwired intra encoder using convolution neural network,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5088–5103, 2016.
  • [32] M. Xu, T. Li, Z. Wang, X. Deng, and Z. Guan, “Reducing complexity of HEVC: A deep learning approach,” arXiv preprint arXiv:1710.01218, 2017.
  • [33] Y. Hu, W. Yang, S. Xia, and J. Liu, “Optimized recurrent network for intra prediction in video coding,” in Proc. IEEE Visual Communication and Image Processing, 2018.
  • [34] Y. Wang, X. Fan, C. Jia, D. Zhao, , and W. Gao, “Neural network based inter prediction for hevc,” in Proc. IEEE Int’l Conf. Multimedia and Expo, 2018.
  • [35] J. Liu, S. Xia, W. Yang, M. Li, and D. Liu, “One-for-all: Grouped variation network-based fractional interpolation in video coding,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2140–2151, 2019.
  • [36] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Enhanced ctu-level inter prediction with deep frame rate up-conversion for high efficiency video coding,” in Proc. IEEE Int’l Conf. Image Processing, 2018.
  • [37] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhancement with task-oriented flow,” arXiv preprint arXiv:1711.09078, 2017.
  • [38] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int’l Conf. Learning Representations, 2015.