Video content occupied more than 70% Internet traffic, and it became a big challenge for transmission and storage along its explosive growth. Researchers, engineers, etc, continuously pursue the (next-generation) high-efficiency video compression for wider application enabling and larger market adoption. Conventional video compression approaches usually follow the hybrid coding framework over decades  with hand-crafted tools for individual components. It is not efficient to jointly optimize the system in an end-to-end manner, especially for the inter tool efficiency exploration despite its great success of H.264/AVC  and H.265/HEVC .
based on machine learning have shown great superiority in coding efficiency forspatial redundancy
removal, compared with conventional codecs. These methods benefit from non-linear transforms, deep neural network (DNN) based conditional entropy model, and joint rate-distortion optimization (RDO), under an end-to-end learning strategy. Learned video compression can be extended from image compression by further exploiting thetemporal redundancy or correlation.
We proposed an end-to-end video compression framework using joint spatial-temporal priors to generate compact latent feature representations for intra texture, inter motion and sparse inter residual signals. Intra textures are well represented by spatial priors for both reconstruction and entropy context modeling using a variational autoencoder (VAE) structure[19, 15]. We directly use NLAIC method proposed in  for our intra texture compression because of its state-of-the-art efficiency, and primarily investigate learned inter coding with the focus on efficient temporal motion representation in this paper.
We represent temporal information or correlation using its both first-order and second-order statistics. The first-order temporal information is referred to as the motion fields (e.g., intensity, orientation) between consecutive frames. Motion fields can be described by either optical flow or block-based motion vectors. Here, we suggest an one-stage unsupervised flow learning approach, where first-order motions are quantized temporal features learned from the consecutive frames directly. Our unsupervised flow learning does not rely on a well pre-trained optical flow estimation network, such as FlowNet2[11, 28], and can derive the compressed optical flow from quantized features directly.
The second-order temporal information is flow-to-flow correlations, describing the object acceleration. Flow can be further predicted for energy reduction. Thus, we fuse priors from spatial, temporal and hyper information to predict flow elements and compress the predictive difference. It turns out that this can be realized by entropy coding with adaptive contexts conditioned on fused priors.
Inter residual is derived between original frame and flow warped prediction. We reuse the intra texture coding network here directly for residual coding. It is worth to point out that VAE structures are applied for all components (e.g., intra, inter, residual) in this paper.
1) We propose an end-to-end video compression method which offers the state-of-the-art performance against traditional video codecs and recent learned approaches with consistent gains across a variety of common test sequences;
2) High-efficient inter coding is achieved by representing temporal correlation using both first-order optical flow and second-order flow predictive difference;
3) First-order flow is offered using an one-stage unsupervised learning and represented by quantized features derived from consecutive frames;
Learned Image Compression
DNN based image compression approaches generally rely on autoencoders.  first proposed to use recurrent autoencoders to progressively encode bits for image compression. Recent years, convolutional autoencoders are studied extensively, including non-linear transforms (e.g., generalized divisive normalization  and non-local attention transforms (NLAIC) ), differentiable quantization (e.g., soft-to-hard quantization  and uniform noise approximation ), and adaptive entropy model using the Bayesian generative rules (e.g., PixelCNNs  and variational autoencoders ). RDO  is applied by minimizing Lagrangian cost , in end-to-end training. Here, is referred to as entropy rate, and is the distortion measured by either mean squared error (MSE) or multiscale structural similarity (MS-SSIM).
These approaches demonstrated better coding efficiency against traditional image coders, both objectively and subjectively. In addition, extreme compression are under exploration with adversarial training (e.g., conditional GANs , multi-scale discriminators  for satisfied subjective quality at very low bit rates.
Learned Video Compression
first proposed the DeepCoder where DNNs were used for intra texture and inter residual, and block motions were applied using traditional motion estimation for temporal information representation. Inspired by temporal interpolation and prediction[22, 21, 10],  introduced a RNN based video compression framework through frame interpolation, offering comparable performance with H.264/AVC. However, interpolation typically brings structural delay. Recently, unsupervised flow estimation methods [12, 17] are introduced to utilize end-to-end learning for predicting optical flow between two frames without leveraging groundtruth flow. The proposed brightness constancy, motion smoothness and bidirectional census loss are proven to be efficient for flow generation. Robust and reliable optical flow derivation methods are emerged, such as FlowNet2  and PWC-Net . FlowNet2 stacks multiple simple flownet and applies small displacements to correct flow. PWC-Net extends traditional pyramid flow reconstructed rules to learn and generate precise flow faster.
 replaced the block based motion estimation with a pre-trained FlowNet2 followed by a cascaded autoencoder for flow compression. U-Net alike processing network was used to enhance the quality of predicted frame. In addition, they directly used models in  for their intra and residual coding. The entire framework is so called DVC. DVC outperformed H.265/HEVC mainly at high bit rates but the coding efficiency dropped unexpectly at low bit rates as reported.
All these attempts in learned video compression were trying to represent the temporal information more efficiently.
For , we first learn the first-order flow for the predicted frame using . Corresponding is encoded using residual coding sharing the same architecture as NLAIC-based intra coding, and reconstructed as . First-order flow is represented using quantized features that are entropy-coded with adaptive contexts conditioned on priors, by exploiting the second-order temporal correlation. Final reconstruction is given by . Subsequent frames follow the same process as of .
Adaptive entropy models for rate estimation and arithmetic encoding are embedded for intra, inter and residual components. Thanks to the VAE structures, intra and residual coding use the joint autoregressive spatial and hyper priors for the probability estimation; and inter coding applies priors from spatial, hyper and temporal information.
We directly apply the NLAIC approach in  for our intra coding. Its state-of-the-art coding efficiency in image compression comes from the introduction of non-local attention transform that are embedded in both main and hyper encoder-decoder network in Fig. 3. Note that the main network is used to obtain the reconstructed frame and the hyperprior network is used for context modeling of adaptive entropy coding.
In NLAIC method, non-local attention modules (NLAM) are embedded to capture joint local and global correlations for both reconstruction and context probability modeling, by inheriting the advantages from both nonlocal processing and attention mechanism. NLAM applies joint spatial-channel attention masks for more compact feature representation. And, masked 3D convolutions are used to fuse hyperpriors and autoregressive priors for accurate context estimation of adaptive entropy coding,
represents cascaded 3D 111 convolutions to fuse the priors. denote the causal (and possibly former reconstructed) pixels prior to current pixel obtained by a 3D 555 masked convolution and are the hyperpriors. Probability of each pixel symbol in can be simply derived using
with a Gaussian distribution assumption with mean (
) and variance ().
Video coding performance heavily relies on the efficient temporal information representation. It has two folds. One needs to have the most accurate first-order flow for compensation, and the other is to devise second-order statistics for flow prediction.
One-stage Unsupervised Flow Learning
Previous work in  obtained decoded optical flow using typical two-stage methods shown in Fig. 4. It relied on a well pre-trained flow network to generate an uncompressed optical flow that was then compressed using a cascaded autoencoder. But, in our work, we leverage the quantized features between consecutive frames as compact motion representations and directly decode the compressed features for subsequent compensation. There is no need to explicitly derive raw and uncompressed flow in encoding process with supervised guidance as in FlowNet2 and PWC-Net. Thus, it is an one-stage unsupervised flow learning and compensation approach in Fig. 4.
We concatenated two consecutive frames as the input for feature fusion, and the network is consisted of stacked NLAM111Nonlocal attention is used to capture local and global correlations. and downsampling (e.g., 5
5 convolutions with stride 2), generating the fused features withdimension. Quantization is then applied to obtain the quantized features for entropy coding. The decoder mirrors the stacked NLAM with upsampling (e.g., 55 deconvolutions with stride 2) in the feature fusion network, and derived the decoded flow (at a size of for separable horizontal and vertical orientations) for compensation. Here, denotes the height and width of the original frame, respectively.
To avoid quantization induced motion noise,we first pre-train the network with uncompressed consecutive frames and . Then we replace using its decoded correspondence as described in Eq. (3) and (4). Note that we only have not for inter coding in practice. And we have directly utilized the decoded flow for end-to-end training and do not need a flow explicitly at encoding (i.e., implicitly represented by quantized features).
A compressed flow representation of , i.e., quantized features , is encoded into the bitstream for delivery. is then used for warping with reference frame to have for compensation, i.e.,
Here and represent the feature fusion network with quantization and decoder network, respectively. Note that .
Context Adaptive Flow Compression
Previous section One-stage Unsupervised Flow Learning introduced one-stage unsupervised flow learning targeting for the accurate first-order motion representation for compensation, where flow is represented implicitly using quantized features . Efficient representation of is highly desirable
Generally, as shown in Fig. 1, there is not only the first-order correlation (frame-to-frame) that can be exploited by the flow, but also the second-order correlation both spatially and temporally. A way to predict flow efficiently could lead to much better compression performance. Ideally, flow element can be estimated by its spatial neighbors, temporal co-located element, and hyper priors for energy compaction. A duality problem for such flow prediction using neighbors, is flow element entropy coding using adaptive contexts.
Adaptive context modeling can leverage priors from spatial autoregressive neighbors, temporal and hyper information, shown in Fig. 5. For spatial autoregressive prior, we propose to apply the 3D masked convolutions on quantized features ; while for hyper priors, hyper decoder is used to decode corresponding information. Hyperpriors are widely used in VAE structured compression approaches. Note that, temporal correlations are exhibited in video sequence. Instead of applying the only pixel domain frame buffer in traditional video codecs, we propose to embed and propagate flow representation at a frame recurrent way using the ConvLSTM, which is also referred to as the temporal prior buffer. Temporal priors have the same dimension as the current quantized features .
These priors are fused together for context adaptive flow coding, i.e.,
are elements of quantized features for implicit flow representation, is aggregated temporal priors from previous flow representations, which is updated using standard ConvLSTM:
where are updated state at and prepared for the next time step slot with as a memory gate.
For the sake of simplicity, we encode the residual signals using the identical networks as the NLAIC-based intra coding. is obtained by . Here, we do not calculate the loss between and but directly target for overall reconstruction loss for optimizing the residual coding.
We proceed to the details about training strategy and evaluation in this section. More ablation studies are given to verify the effectiveness of our work.
Training & Testing Datasets. We choose COCO  dataset to pre-train NLAIC-based intra coding network. And then we joint train video compression framework on Vimeo 90k  which is a widely used dataset for low-level video processing tasks. All images from these the datasets are randomly cropped into 192192 patches for training.
We gave evaluations on NLAIC using Kodak dataset in ablation studies to understand the efficiency of nonlocal attention transforms. We then evaluated our video compression approaches on standard HEVC dataset and ultra video group (UVG) dataset with different classes, resolution and frame rate.
Loss Function & Training Strategy. It is difficult to train multiple networks on-the-fly at one shot. Thus, we pre-train the intra coding and flow learning and coding networks first, followed by the jointly training with pre-trained network models for an overall optimization, i.e.,
where is measured using MS-SSIM, and is the warping loss evaluated using 1 norm and total variation loss. represents the bit rate of intra frame and is the bit rate of inter frames including bits for residual and flow respectively. Currently, and will be adapted according to the specified overall bit rate and bit consumption percentage of flow information in inter coding.
Besides, entropy rate loss is approximated by conditional probability using Eq. (9), with main payload for context adaptive feature elements, and payload overhead for image height (), width (), number of frames () and GOP length (), e.g.,
Note that bit consumption for “overhead” is less than 0.1% of the entire compressed bitstream, according to our extensive simulations.
To well balance the efficiency of temporal information learning and training memory consumption, we have enrolled 5 frames to train the video compression framework and shared the weights for the subsequent frames. The initial learning rate (LR) is set to 10e-4 and is clipped by half for every 10 epochs. The final models are obtained using a LR of 10e-5. We apply the distributed training on 4 GPUs (Titan Xp) for 5 days.
Evaluation Criteria. For fair comparison, we have applied the same setting as DVC in  for our method and traditional H.264/AVC, and HEVC codecs. We use GOP of 10 and encode 100 frames on HEVC test sequences and use GOP of 12 with 600 frames on UVG dataset. The reconstructed quality are measured in RGB domain using MS-SSIM. Bits per pixel (Bpp) is used for bit rate measure which can be easily translated to the kbps by scaling it with . is the video duration in seconds.
Rate distortion Performance. Our approach outperforms all the existing methods as shown in Fig.6. Here, the distortion is measured by MS-SSIM which is proven to be a more relevant to human visual system, and used widely in learned compression methodologies .
To the best of our knowledge, our work is the first end-to-end method that outperforms H.265/HEVC consistently across a variety of bit rates for all test sequences. In contrast, algorithm in  only presents a similar performance as H.264/AVC. DVC  improves  with better coding efficiency against HEVC at high bit rates. However, a cliff fall of performance is revealed for DVC at low bit rate (e.g., some rates having performance even worse than H.264/AVC). We have also observed that DVC’s performance varies for different test sequences. But, our approach shows consistent gains, across contents and bit rates, leading to the conclusion that our model presents better generalization for practical applications.
We use H.264/AVC as the anchor for BD-Rate calculation as shown in Table 1. Our approach reports 50.36% and 51.67% BD-rate reduction on HEVC test sequences and UVG dataset compared with H.264/AVC, respectively, offering a significant performance improvement margin in contrast to the HEVC or DVC over the H.264/AVC.
Visual Comparison We provide the visual quality comparison with H.264/AVC and H.265/HEVC as shown in Fig. 8.
Traditional codecs usually suffer from blocky artifacts, especially at low bit rate, because of its block based coding strategy. Our results eliminate this phenomenon and provide more visually satisfying quality of reconstructed frames. Meanwhile, we need less bits for similar visual quality.
Non-local Attention Transforms. Most existing image compressions apply Generalized Divisive Normalization (GDN) as non-linear transform to de-correlate spatial-channel redundancy [19, 4]. Alternative non-local attention transform utilizes NLAM to capture both local and global correlations, leading to the state-of-the-art efficiency as reported in . Algorithms in  ranks the second place for coding efficiency. Both methods in  and  apply the VAE structures. Fig. 7 experiments the efficiency of NLAM, revealing that performance can be retained closely to  even by removing all nonlocal operations.
Second-order Flow Correlations In Fig.1, we have shown that the second-order motion representations imply the further temporal redundancy between optical flows. Thus, we present a recurrent state (e.g., ConvLSTM) to aggregate temporal priors for inter coding which can effectively reduce the bits for flow compression.
Temporal priors are fused with autoregressive and hyper priors to improve the context modeling of flow element. Then we use ConvLSTM to combine temporal priors with current quantized features for the updated priors in a recurrent way. Flow prediction provides an effective means to exploit the redundancy between complex motion behaviors. Fig. 9 shows that efficiency variations when removing the ConvLSTM system from our entire framework without exploiting the temporal correlations, where 2% to 10% quality loss is captured.
Generally, bits consumed by motion information varies across different content and bit rates, leading to a variety of percentages to the total bits. More bit saving is revealed for low bit rates, and motion intensive content. For stationary content, such as HEVC Class E, spatial and hyper priors already give a good reference, thus temporal priors are less used.
Conclusions & Future Work
In this paper, we present an end-to-end video compression framework and fully exploit the spatial and temporal redundancies. Key novelty laid on the accurate motion representation for exploiting temporal correlation, via both first-order optical flow learning and second-order flow predictive coding. An one-stage unsupervised flow learning is applied with implicit flow representation using quantized features. These features are then compressed using joint spatial-temporal priors by which the probability model is conditioned adaptively.
We evaluate our methods and report the state-of-the-art performances among all the existing video compression approaches, including traditional H.264/AVC, H.265/HEVC, and learning-based DVC. Our approach offers the consistent gains over existing methods across a variety of contents and bit rates.
As for the future study, an interesting topic is to devise implicit flow without actual bits consumption, such as the decoder-side flow derivation, or frame interpolation and extrapolation. Currently, residual shares the same network with intra coding, which may be worth for deep investigation for network simplification. It is also significant to generalize the whole system to more complex video data sets such as spectural video  and 3D video [20, 5].
This work was supported in part by the National Natural Science Foundation of China under Grant 61571215.
-  (2018) Generative adversarial networks for extreme learned image compression. arXiv preprint arXiv:1804.02958. Cited by: Learned Image Compression.
-  (2016) End-to-end optimized image compression. arXiv preprint arXiv:1611.01704. Cited by: Introduction, Learned Image Compression.
-  (2018) Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436. Cited by: Introduction, Learned Image Compression, Learned Video Compression.
-  (2018) Efficient nonlinear transforms for lossy image compression. arXiv preprint arXiv:1802.00847. Cited by: Ablation Study.
-  (2011) Semi-automatic 2d-to-3d conversion using disparity propagation. IEEE Transactions on Broadcasting 57 (2), pp. 491–499. Cited by: Conclusions & Future Work.
-  (2016) Computational snapshot multispectral cameras: toward dynamic capture of the spectral world. IEEE Signal Processing Magazine 33 (5), pp. 95–108. Cited by: Conclusions & Future Work.
-  (2017) DeepCoder: a deep neural network based video compression. In Visual Communications and Image Processing (VCIP), 2017 IEEE, pp. 1–4. Cited by: Learned Video Compression.
Neural inter-frame compression for video coding.
Proceedings of the IEEE International Conference on Computer Vision, pp. 6421–6429. Cited by: Learned Video Compression.
-  (2019) Video compression with rate-distortion autoencoders. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7033–7042. Cited by: Learned Video Compression.
Bidirectional recurrent convolutional networks for multi-frame super-resolution. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 235–243. External Links: Cited by: Learned Video Compression.
Flownet 2.0: evolution of optical flow estimation with deep networks.
2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 1647–1655. Cited by: Introduction, Learned Video Compression.
-  (2016) Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In European Conference on Computer Vision, pp. 3–10. Cited by: Learned Video Compression.
-  (2017) Learning convolutional networks for content-weighted image compression. arXiv preprint arXiv:1703.10553. Cited by: Introduction.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: Implementation Details.
-  (2019) Non-local attention optimized deep image compression. arXiv preprint arXiv:1904.09757. Cited by: Introduction, Introduction, Learned Image Compression, Figure 2, Framework Overview, Intra Coding, Ablation Study.
-  (2019-06) DVC: an end-to-end deep video compression framework. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Learned Video Compression, One-stage Unsupervised Flow Learning, Implementation Details, Performance Comparison.
UnFlow: unsupervised learning of optical flow with a bidirectional census loss.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Learned Video Compression.
-  (2018) Conditional probability models for deep image compression. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 3. Cited by: Introduction, Learned Image Compression.
-  (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10794–10803. Cited by: Introduction, Figure 7, Performance Comparison, Ablation Study.
-  (2013) 3D high-efficiency video coding for multi-view video and depth data. IEEE Transactions on Image Processing 22 (9), pp. 3366–3378. Cited by: Conclusions & Future Work.
-  (2018) Context-aware synthesis for video frame interpolation. arXiv preprint arXiv:1803.10967. Cited by: Learned Video Compression.
-  (2017) Video frame interpolation via adaptive convolution. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 3. Cited by: Learned Video Compression.
-  (2016) . arXiv preprint arXiv:1601.06759. Cited by: Learned Image Compression.
-  (2017) Real-time adaptive image compression. arXiv preprint arXiv:1705.05823. Cited by: Learned Image Compression.
-  (1998-11) Rate-distortion optimization for video compression. IEEE Signal Processing Magazine 15 (6), pp. 74–90. External Links: Cited by: Learned Image Compression.
-  (2005-01) Video compression - from concepts to the h.264/avc standard. Proceedings of the IEEE 93 (1), pp. 18–31. External Links: Cited by: Introduction.
-  (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on circuits and systems for video technology 22 (12), pp. 1649–1668. Cited by: Introduction.
-  (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: Introduction, Learned Video Compression.
-  (2016) Full resolution image compression with recurrent neural networks. CoRR abs/1608.05148. External Links: Cited by: Learned Image Compression.
-  (2003) Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology 13 (7), pp. 560–576. Cited by: Introduction.
-  (2018) Video compression through image interpolation. In ECCV, Cited by: Learned Video Compression, Performance Comparison.
-  (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810. Cited by: Introduction.
-  (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision 127 (8), pp. 1106–1125. Cited by: Implementation Details.