Log In Sign Up

Flexible-Rate Learned Hierarchical Bi-Directional Video Compression With Motion Refinement and Frame-Level Bit Allocation

by   Eren Cetin, et al.

This paper presents improvements and novel additions to our recent work on end-to-end optimized hierarchical bi-directional video compression to further advance the state-of-the-art in learned video compression. As an improvement, we combine motion estimation and prediction modules and compress refined residual motion vectors for improved rate-distortion performance. As novel addition, we adapted the gain unit proposed for image compression to flexible-rate video compression in two ways: first, the gain unit enables a single encoder model to operate at multiple rate-distortion operating points; second, we exploit the gain unit to control bit allocation among intra-coded vs. bi-directionally coded frames by fine tuning corresponding models for truly flexible-rate learned video coding. Experimental results demonstrate that we obtain state-of-the-art rate-distortion performance exceeding those of all prior art in learned video coding.


End-to-End Rate-Distortion Optimization for Bi-Directional Learned Video Compression

Conventional video compression methods employ a linear transform and blo...

End-to-End Rate-Distortion Optimized Learned Hierarchical Bi-Directional Video Compression

Conventional video compression (VC) methods are based on motion compensa...

Deep Frame Prediction for Video Coding

We propose a novel frame prediction method using a deep neural network (...

Conditional Coding and Variable Bitrate for Practical Learned Video Coding

This paper introduces a practical learned video codec. Conditional codin...

Boosting neural video codecs by exploiting hierarchical redundancy

In video compression, coding efficiency is improved by reusing pixels fr...

ELF-VC: Efficient Learned Flexible-Rate Video Coding

While learned video codecs have demonstrated great promise, they have ye...

Fast and Efficient Lenslet Image Compression

Light field imaging is characterized by capturing brightness, color, and...

1 Introduction

Following the pioneering work [balle2017endtoend]

on variational autoencoder-based end-to-end optimized image compression, significant rate-distortion (R-D) performance improvements have been obtained to achieve state-of-the-art results in image compression by using better entropy modeling 

[balle2018variational, minnen_joint, variational_low, cheng2020image, minnen2020channelwise].

End-to-end learned video compression frameworks employ learned image compression models for intra-frame coding, and typically replace the functional blocks used in P and/or B-picture mode of standards-based video codecs, such as motion estimation, motion compensation, motion vector compression and residual compression, with their learned counterparts. Sub-networks corresponding to these blocks are jointly trained using a single R-D loss. One of the first low-latency video coding models is DVC [lu2019dvc], which performs optical flow-based backwarping for motion compensation and employs a post-processing network to alleviate motion compensation errors. Scale-space flow (SSF) [agustsson_scale] introduces the scale channel, which applies blurring to regions where flow estimates are unreliable. In contrast to DVC and SSF, which use a single past reference frame, RLVC [rlvc] and ELF-VC [elfvc] utilize a recurrent architecture to exploit long-term temporal correlations. A detailed review on end-to-end video compression frameworks can be found in [video_review].

This paper addresses the combination of bi-directional hierarchical video coding and flexible rate coding, which are less studied in the learned compression community. The literature on these topics is reviewed in Section 2. The proposed method is introduced in Section 3. Our main contributions are to improve the compression efficiency of our previous work [lhbdc] by motion refinement and residual motion coding and also employ gain unit for flexible rate coding using a single end-to-end optimized model. Experimental results are presented in Section 4. Finally, Section 5 concludes the paper.

2 Related work and Contributions

2.1 Bi-directional Video Compression

Hierarchical Learned Video Compression (HLVC) [hlvc] is a bi-directional codec that employs three hierarchical quality layers and a recurrent enhancement network. LHBDC [lhbdc], proposed concurrently, performs bi-directional motion compensation and obtains superior R-D performance compared to HLVC. Racape et al. [spie_bidir] later used a similar approach using conditional convolutions for video compression in YCrCb 420 space. Ladune et al. [ladune2021conditional] implemented a conditional coder to process I, P and B frames using a single network, which ignores unavailable elements. In this paper, we extend our previous work [lhbdc] for improved performance and added flexible rate functionality as detailed in Section 2.3.

2.2 Flexible Rate Image/Video Compression

Most learned image/video compression frameworks require training separate models for different rate-distortion points, which increases training cost and memory requirement. As a promising solution to this problem, AG-VAE [Cui_2021_CVPR]

proposed learned gain units to scale the latent representation in the channel dimension with gain vectors prior to quantization for flexible-rate image compression. After learning the gain vectors for different bitrates, exponential interpolation is proposed to achieve continuous rate adaptation during inference. ELF-VC 

[elfvc] propose a different flexible-rate framework allowing a single model to cover a large and dense range of bitrates, at a negligible increase in computation and parameter count. We adopt the former approach in this paper and extended it for the video compression setting.

2.3 Contributions

Our previous work [lhbdc] presented an end-to-end optimized (for a particular R-D operating point) B-frame encoder, called LHBDC, which relied on a pre-trained optical flow estimation model. This paper proposes a truly end-to-end optimized flexible-rate video encoder with the following contributions:
1) The proposed new B-frame encoder model does not rely on a pretrained motion estimation network. Instead, we propose an autoencoder model for direct flow prediction and a flow refinement approach for flow field compression.
2) We adapt the gain unit [Cui_2021_CVPR] proposed for flexible rate image coding to achieve continuous R-D curve in video coding.
3) We fine-tune the bitrate allocation between intra-coded and bi-directionally coded frames to achieve a frame-level bitrate allocation scheme that yields superior results in the R-D sense. To the best of our knowledge, this is the first paper to exploit the gain unit for frame-level bitrate allocation.

3 Flexible-rate Bi-directional Video Compression with Motion Refinement

Figure 1:

Overview of the proposed learned bi-directional video encoder network. The motion prediction and learned frame fusion modules have 5 and 4 layers of stacked convolution blocks (yellow layers), respectively, with leaky ReLUs as activation functions. The motion compression and residual compression modules utilize a variational autoencoder with residual blocks (orange layers) and gain units (green blocks) to learn quantization parameters while achieving low entropy latent representations.

We propose a network composed of four modules, depicted in Figure 1, for bi-directional B-frame compression, which does not depend on any pre-trained component, given a past and a future reference frame. We introduce motion prediction in Section 3.1, motion compression in Section 3.2, learned frame fusion in Section 3.3, residual compression in Section 3.4, and gain unit in Section 3.5.

The first frame of each group-of-pictures (GoP) is compressed as keyframe using the model proposed by Minnen et al. [minnen_joint] without the context model while rest of the frames are compressed bi-directionally using the proposed network. We fine-tune the I-frame model and the proposed B-frame model jointly to achieve optimal frame-level bitrate allocation in the R-D sense benefiting from the gain unit.

3.1 Motion Prediction

We utilize the U-Net architecture with 5 layers, depicted in Figure 1 (upper left), to predict bi-directional motion vectors. This module takes the past and future decoded frames, and , respectively, and yields predicted motion vectors, and , from the current frame towards the past and future decoded frames. We then compute two predictions for the current frame by first bilinear backwarping the past decoded frame, , using , and second, backwarping the future decoded frame, , using to generate the predicted current frames, and , given by


where denotes the backwarping operator.

3.2 Motion Refinement and Compression

The motion compression module, depicted in Figure 1 (upper right), learns bi-directional residual motion vectors between the current frame and two predicted current frames, from the past and from the future reference, and encodes them. We learn the residual motion vectors as the output of an autoencoder, similar to that used in the scale-space flow model [agustsson_scale]. This module employs residual blocks with filters each. Novelties compared to [agustsson_scale] are: i) we adapt the model in [agustsson_scale] to bi-directional motion-compensation, ii) we estimate residual/refined flow vectors unlike [agustsson_scale]

, which estimates flow field in one shot. iii) We do not use scale channel as occlusions are taken care by the frame fusion module. In order to compress the residual motion vectors, we employ a hyperprior network similar to 

[minnen_joint] to learn entropy parameters.

The inputs to this module are the predicted motion vectors, and , the predicted current frames, and , the decoded reference frames, and , and the ground-truth current frame, . The decoder part of the autoencoder generates bi-directional motion residual vectors, and as outputs. These residual motion vectors are added to the predicted motion vectors to compute the final bi-directional motion vectors given by


3.3 Motion Compensation by Learned Frame Fusion

The refined bi-directional motion vectors (3)-(4) are used to backwarp the past and future reference frames to obtain higher quality forward and backward motion compensated frames, compared to one-shot motion, which are input to a learned mask generator network similar to one used in [lhbdc].

The learned frame fusion module, which is a U-Net with 4 layers, generates two masks, and , with the same dimensions as video frames. The inputs to the module are the motion compensated frames, and , the refined motion vectors, and , and the reference frames, and . Given the masks, and , a single motion compensated frame is computed as


3.4 Motion-compensated Residual Frame Compression

The motion-compensated residual frame is computed by subtracting the ground truth current frame, from the motion compensated frame, and is compressed by an autoencoder network that is similar to the motion compression module with filters at each layer. After we decode the encoded latent representation of the residual frame, we compute the current frame by adding the motion compensated frame, and the reconstructed residual frame, .

3.5 Gain Unit for Flexible-Rate Coding

The gained network architecture [Cui_2021_CVPR] enables a single model to operate at multiple R-D points by learning scaling parameters per latent channel for discrete rate-distortion levels during training. The learned scaling parameters for each discrete level are stored in vectors called “gain” and “inverse gain” units. As the latent representations for the motion residuals and frame residuals both have channels, we scale the latent representations with two separate vectors of length in the channel dimension. This way, we effectively change the quantization bin sizes before rounding the latent representation to the nearest integer at inference time. The inverse quantization operation is performed on the quantized latent representation using separate learned vectors of length at the decoder. We also employ gain and inverse gain units in the respective hyper-prior networks, which learn and compress the entropy parameters in a separate bottleneck. Using the gain units, we can control the bitrate allocated to motion and frame residuals separately. In addition, we can also control the bitrates allocated to intra-coded and bi-directionally coded individual frames.

4 Evaluation

The proposed network is implemented using the PyTorch 

[pytorch] framework and is employed to compress bi-directional predicted frames. The convolution layers in compression modules have channels, while the number of channels in motion prediction and frame fusion modules are 64 and 32, respectively. The keyframes are compressed with the pretrained model [minnen_joint] from the CompressAI library [compressai]. The details of the training process is explained in Section 4.1. Experimental results and comparison with the state of the art are presented in Section 4.2.

4.1 Training Details

The training is performed on the Vimeo-90k dataset [vimeo]. The overall network shown in Figure 1 is trained jointly using a single R-D loss with 4 discrete R-D levels


where are the Lagrange multipliers. The gain and inverse gain vectors for both motion and frame residual compression modules as well as for their hyper-prior networks are paired with the respective Lagrange multipliers. Hence, we have 4 gain and 4 inverse gain vectors for each compression module as well as its hyper-prior network.

For training, we crop random patches of from frames of size for data augmentation. The mini-batch size is set to 4 and the model is optimized with respect to mean squared error using Adam optimizer [adam] for 1M iterations. To prevent overshooting, we clip the norm of the gradients so that the maximum norm value is 1. The initial learning rate is set to 1e-4 and a learning rate scheduler is deployed so that the learning rate is halved if there is no improvement for 100K iterations.

Figure 2: Comparison of R-D performance of the proposed model with prior works in terms of PSNR.

Figure 3: Average percent BD-BR improvements (RGB bpp) for the proposed model and other video codecs vs. the anchor x265 encoder (veryslow preset) on the UVG dataset.

4.2 Experimental Results

The proposed model with GoP size 16 is evaluated on the UVG dataset [uvg]

. The rate-distortion (R-D) performance of our model, i.e., the peak signal-to-noise ratio (PSNR) vs. bits per RGB pixel, is compared with those of other leading learned video compression models and traditional H.265/HEVC video codecs in Figure 


R-D points for the proposed model are achieved by performing exponential interpolation on the adjacent gain vectors. For example, to determine an R-D point between the gain vectors, and , we interpolate an intermediate gain vector using


where corresponds to the interpolation coefficient. As our model can achieve arbitrarily many R-D points using this process with varying , and , we aimed finding the optimal bitrate allocation of hierarchical levels that yields the optimal R-D curve. Our experimental results revealed that the superior R-D curve was achieved by lowering quality levels for bi-directional coded frames by intervals of 0.33 so that where is the interpolation coefficient of the hierarchical level .

When we compare the resulting R-D performance in terms of PSNR, our proposed network achieves superior results at higher bitrates while it attains similar performance at lower bitrate regions compared to ELF-VC [elfvc]. On the other hand, it indubitably outperforms other learned and traditional video codecs that are present in Figure 2.

BD-BR improvements [bdrate] over other leading learned and standards-based video codecs are presented in Figure 3. The proposed model achieves % 43.99 BD-BR improvement over the anchor x265 encoder (veryslow preset) while other codecs achieve lower gains.

Finally, ablation studies reveal the significance of the proposed motion prediction and learned frame fusion improvements on the R-D performance. The motion prediction module reduces the entropy of motion vectors to be encoded while the learned frame fusion alleviates occlusion artifacts. The improvements due to both modules are presented in Table 1.

Property Option
Total #
of Param
Motion Predictor  Yes 0%   35M
 No 37%   23M (-34+)
Learned Frame
 Yes 0%   35M
 No 23%   32M (-9%)
Table 1: Ablation study on motion prediction and learned frame fusion modules on the UVG dataset.

5 Conclusion

We propose an end-to-end optimized flexible-rate hierarchical bi-directional video compression network that is trained once to operate at multiple bitrates. The resulting single learned model yields superior R-D performance in terms of PSNR vs. RGB bits/sec from very low to high bitrates.

The proposed network demonstrates improvements by employing motion prediction and learned frame fusion while achieving arbitrarily many R-D points using the gain and inverse gain units that scale latent representations in the channel dimension effectively. Compared to other learned and traditional video codecs, the proposed network also achieves the best BD-BR reduction on the UVG dataset.

To the best of our knowledge, this is the first study on frame-level bitrate allocation for learned codecs. We aim to further investigate optimal rate allocation among intra-coded and bi-directional coded frames jointly as future work.