Hierarchical Autoregressive Modeling for Neural Video Compression

Recent work by Marino et al. (2020) showed improved performance in sequential density estimation by combining masked autoregressive flows with hierarchical latent variable models. We draw a connection between such autoregressive generative models and the task of lossy video compression. Specifically, we view recent neural video compression methods (Lu et al., 2019; Yang et al., 2020b; Agustssonet al., 2020) as instances of a generalized stochastic temporal autoregressive trans-form, and propose avenues for enhancement based on this insight. Comprehensive evaluations on large-scale video data show improved rate-distortion performance over both state-of-the-art neural and conventional video compression methods.



page 6


Insights from Generative Modeling for Neural Video Compression

While recent machine learning research has revealed connections between ...

Planning in Dynamic Environments with Conditional Autoregressive Models

We demonstrate the use of conditional autoregressive generative models (...

Autoregressive Diffusion Models

We introduce Autoregressive Diffusion Models (ARDMs), a model class enco...

Block Neural Autoregressive Flow

Normalising flows (NFS) map two density functions via a differentiable b...

Lossless Compression with Latent Variable Models

We develop a simple and elegant method for lossless compression using la...

Slimmable Video Codec

Neural video compression has emerged as a novel paradigm combining train...

MPEG-2 Prediction Residue Analysis

Based on the use of synthetic signals and autoregressive models to chara...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in deep generative modeling have enabled a surge in applications, including learning-based compression. Generative models have already demonstrated empirical improvements in image compression, outperforming classical codecs (Minnen et al., 2018; Yang et al., 2020d), such as BPG (Bellard, 2014). In contrast, the less developed area of neural video compression remains challenging due to complex temporal dependencies operating at multiple scales. Nevertheless, recent neural video codecs have shown promising performance gains (Agustsson et al., 2020), in some cases on par with current hand-designed, classical codecs, e.g., HEVC. Compared to hand-designed codecs, learnable codecs are not limited to specific data modality, and offer a promising approach for streaming specialized content, such as sports or video chats. Therefore, improving neural video compression is vital for dealing with the ever-growing amount of video content being created.

Source compression fundamentally involves decorrelation, i.e., transforming input data into white noise distributions that can be easily modeled and entropy-coded. Thus, improving a model’s capability to decorrelate data automatically improves its compression performance. Likewise, we can improve the associated entropy model (i.e., the model’s prior) to capture any remaining dependencies. Just as compression techniques attempt to

remove structure, generative models attempt to model structure. One family of models, autoregressive flows, maps between less structured distributions, e.g., uncorrelated noise, and more structured distributions, e.g., images or video (Dinh et al., 2014, 2016). The inverse mapping can remove dependencies in the data, making it more amenable for compression. Thus, a natural question to ask is how autoregressive flows can best be utilized in compression, and if mechanisms in existing compression schemes can be interpreted as flows.

This paper draws on recent insights in hierarchical sequential latent variable models with autoregressive flows (Marino et al., 2020). In particular, we identify connections between this family of models and recently proposed neural video codecs based on motion estimation (Lu et al., 2019; Agustsson et al., 2020). By interpreting this technique as instantiation of a type of autoregressive flow transform, we propose various alternatives and improvements based on insights from generative modeling.

In more detail, our main contributions are as follows:

  1. A new framework. We interpret existing video compression methods through the more general framework of generative modeling, variational inference, and autoregressive flows, allowing us to readily investigate extensions and ablations. In particular, we compare fully data-driven approaches with motion-estimation-based neural compression schemes. This framework also provides directions for future work.

  2. A new model. Our main proposed model is an improved version of Scale-Space Flow (SSF) (Agustsson et al., 2020) model. Following the classical predictive coding approach to video compression, this model uses motion estimation to predict the frame being compressed, and further compresses the residual obtained by subtraction. We show that incorporating a learnable scaling transform improves performance. Augmenting a shift transform by scale-then-shift is inspired by performance gains in extending NICE (Dinh et al., 2014) to RealNVP (Dinh et al., 2016). Structured priors can improve the performance further.

  3. A new dataset. The neural video compression community is lacking large, high-resolution benchmark datasets. While we performed extensive experiments on the publicly available Vimeo-90k dataset (Xue et al., 2019), we also collected and utilized a larger dataset, YouTube-NT111https://github.com/privateyoung/Youtube-NT, which will be made available through executable scripts. As previous works (Agustsson et al., 2020) did not provide their training datasets, YouTube-NT will be a useful resource for making and comparing further progress in this field.

2 Related Work

We divide relevant related work into three categories: neural image compression, neural video compression, and sequential generative models.

Neural Image Compression.

Considerable progress has been made by applying neural networks to image compression. Early works proposed by

Toderici et al. (2017) and Johnston et al. (2018) leveraged LSTMs to model spatial correlations of the pixels within an image. Theis et al. (2017)

first proposed an autoencoder architecture for image compression and used the straight-through estimator

(Bengio et al., 2013) for learning a discrete latent representation. The connection to probabilistic generative models was drawn by Ballé et al. (2017), who firstly applied variational autoencoders (VAEs) (Kingma and Welling, 2013) to image compression. In subsequent work, Ballé et al. (2018) encoded images with a two-level VAE architecture involving a scale hyper-prior, which can be further improved by autoregressive modeling (Minnen et al., 2018; Minnen and Singh, 2020) or by optimization at encoding time (Yang et al., 2020d). Yang et al. (2020e) and Flamich et al. (2019) demonstrated competitive image compression performance without a pre-defined quantization grid.

Neural Video Compression.

Compared to image compression, video compression is a significantly more challenging problem, as statistical redundancies exist not only within each video frame (exploited by intra-frame compression) but also along the temporal dimension. Early works by Wu et al. (2018); Djelouah et al. (2019) and Lombardo et al. (2019)

performed video compression by predicting future frames using a recurrent neural network, whereas

Chen et al. (2019) and Chen et al. (2017) used convolutional architectures within a traditional block-based motion estimation approach. These early approaches did not outperform the traditional H.264 codec and barely surpassed the MPEG-2 codec. Lu et al. (2019) adopted a hybrid architecture that combined a pre-trained Flownet (Dosovitskiy et al., 2015) and residual compression, which leads to an elaborate training scheme. Habibian et al. (2019) and Liu et al. (2019) combined 3D convolutions for dimensionality reduction with expressive autoregressive priors for better entropy modeling at the expense of parallelism and runtime efficiency. Our method extends a low-latency model proposed by Agustsson et al. (2020), which allows for end-to-end training, efficient online encoding and decoding, and parallelism.

Sequential Deep Generative Models.

We drew inspiration from a body of work on sequential generative modeling. Early deep learning architectures for dynamics forecasting involved RNNs

(Chung et al., 2015). Denton and Fergus (2018) and Babaeizadeh et al. (2018) used VAE-based stochastic models in conjunction with LSTMs to model dynamics. Yingzhen and Mandt (2018)

introduced both local and global latent variables for learning disentangled representations in videos. Other video generation models used generative adversarial networks (GANs)

(Vondrick et al., 2016; Lee et al., 2018) or autoregressive models and normalizing flows (Rezende and Mohamed, 2015; Dinh et al., 2014, 2016; Kingma and Dhariwal, 2018; Kingma et al., 2016; Papamakarios et al., 2017). Recently, Marino et al. (2020) proposed to combine latent variable models with autoregressive flows for modeling dynamics at different levels of abstraction, which inspired our models and viewpoints.

3 Video Compression through Deep Autoregressive Modeling

We identify commonalities between hierarchical autoregressive flow models (Marino et al., 2020) and state-of-the-art neural video compression architectures (Agustsson et al., 2020), and will use this viewpoint to propose improvements on existing models.

3.1 Background

We first review VAE-based compression schemes (Ballé et al., 2017; Theis et al., 2017) and formulate existing low-latency video codecs in this framework; we then review autoregressive flows.

Generative Modeling and Source Compression.

Let be a sequence of video frames. Lossy compression seeks to find a compact description of that simultaneously minimizes the description length and information loss . The distortion  measures how much reconstruction error accrues due to encoding into a latent representation and subsequently decoding it back to , while  measures the bit rate (file size). In learned compression methods (Ballé et al., 2017; Theis et al., 2017), the above process is parameterized by flexible functions (“encoder”) and (“decoder”) that map between the input video and its latent representation:

, and minimize a rate-distortion loss with hyperparameter


We adopt the end-to-end compression approach of Ballé et al. (2017); Lombardo et al. (2019), which approximates the rounding operations (

) by uniform noise injection to enable gradient-based optimization during training. With an appropriate choice of probability model

, the approximated version of the above R-D (rate-distortion) objective then corresponds to the VAE objective:


In this model, the likelihood

follows a Gaussian distribution with mean

and diagonal covariance , the approximate posterior is chosen to be a unit-width uniform (thus zero-differential-entropy) distribution whose mean is predicted by an amortized inference network . The prior density interpolates its discretized version, so that it measures the code length of discretized after entropy-coding.

Low-Latency Sequential Compression

We specialize Eq. 1 to make it suitable for low-latency video compression, widely used in both conventional and recent neural codecs (Rippel et al., 2019; Agustsson et al., 2020). To this end, we encode and decode individual frames in sequence. Given the ground truth current frame and the previously reconstructed frames , the encoder is restricted to be of the form , and similarly the decoder computes reconstruction sequentially based on previous reconstructions and the current encoding, . Existing codecs usually condition on a single reconstructed frame, substituting by in favour of efficiency. In the language of variational inference, the sequential encoder corresponds to a variational posterior of the form , i.e., filtering, and the sequential decoder corresponds to the likelihood ; in both distributions, the probabilistic conditioning on is based on the observation that is a deterministic function of . As we show, all sequential compression approaches considered in this work follow this paradigm, and the form of the reconstruction transform determines the lowest hierarchy of the corresponding generative process of video .

Masked Autoregressive Flow (MAF).

As a final component in neural sequence modeling, we discuss MAF (Papamakarios et al., 2017)

, which models the joint distribution of a sequence

in terms of a simpler distribution of its underlying noise variables through the following autoregressive transform and its inverse:


The noise variable

usually comes from a standard normal distribution. While the forward MAF transforms a sequence of standard normal noises into a data sequence, the inverse flow “whitens” the data sequence and removes temporal correlations. Due to its invertible nature, MAF allows for exact likelihood computations, but as we will explain in Section 

3.3, we will not exploit this aspect in compression but rather draw on its expressiveness in modeling conditional likelihoods.

3.2 A General Framework for Generative Video Coding

We now describe a general framework that captures several existing low-latency neural compression methods as specific instances and gives rise to the exploration of new models. To this end, we combine latent variable models with autoregressive flows into a joint framework. We consider a sequential decoding procedure of the following form:


Eq. 3 resembles the definition of the MAF in Eq. 2, but augments this transform with two sets of latent variables . Above, and are functions that transform the previous reconstructed data frame along with into a shift and scale parameter, respectively. The function converts these latent variables into a noise variable that encodes residuals with respect to the mean next-frame prediction .

This stochastic decoder model has several advantages over existing generative models for compression, such as simpler flows or sequential VAEs. First, the stochastic autoregressive transform involves a latent variable and is therefore more expressive than a deterministic transform (Schmidt and Hofmann, 2018; Schmidt et al., 2019). Second, compared to MAF, the additional nonlinear transform enables more expressive residual noise, reducing the burden on the entropy model. Finally, as visualized in Figure 2, the scale parameter

effectively acts as a gating mechanism, determining how much variance is explained in terms of the autoregressive transform and the residual noise distribution. This provides an added degree of flexibility, in a similar fashion to how RealNVP improves over NICE

(Dinh et al., 2014, 2016).

Our approach is inspired by Marino et al. (2020) who analyzed a restricted version of the model in Eq. 3, aiming to hybridize autoregressive flows and sequential latent variable models for video prediction. In contrast to Eq. 3, their model involved deterministic transforms as well as residual noise that came from a sequential VAE.

3.3 Example Models and Extensions

Next, we will show that the general framework expressed by Eq. 3 captures a variety of state-of-the-art neural video compression schemes and gives rise to extensions and new models.

Temporal Autoregressive Transform (TAT).

The first special case among the class of models that are captured by Eq. 3 is the autoregressive neural video compression model by Yang et al. (2020b), which we denote as temporal autoregressive transform (TAT). Shown in Figure 1(a), the decoder implements a deterministic scale-shift autoregressive transform of decoded noise ,


The encoder inverts the transform to decorrelate the input frame into and encodes the result as , where . The shift and scale transforms are parameterized by neural networks,

is a convolutional neural network (CNN), and

is a deconvolutional neural network (DNN) that approximately inverts .

The TAT decoder is a simple version of the more general stochastic autoregressive transform in Eq 3, where and lack latent variables. Indeed, interpreting the probabilistic generative process of , TAT implements the model proposed by Marino et al. (2020), as the transform from to is a MAF. However, the generative process corresponding to compression (reviewed in Section 3.1) adds additional white noise to , with . Thus, the generative process from to is no longer an autoregressive flow. Regardless, TAT was shown to better capture the low-level dynamics of video frames than the autoencoder alone, and the inverse transform decorrelates raw video frames to simplify the input to the encoder (Yang et al., 2020b).


Figure 1: Model Diagrams. Graphical models underlying the generative and inference procedures for current frame

of various neural video compression methods. Random variables are shown in circles, all other quantities are deterministically computed; solid and dashed arrows describe computational dependencies during generation (decoding) and inference (encoding), respectively. Purple nodes correspond to neural encoders (CNNs) and decoders (DNNs), and green nodes implement temporal autoregressive transform. (a) TAT; (b) SSF; (c) STAT or STAT-SSF; optional structured prior shown by red arrow from

to . Hyper latent variables in (b) and (c) are left out for clarity.

Dvc (Lu et al., 2019) and Scale-Space Flow (SSF, Agustsson et al. (2020)).

The second class of models captured by Eq. 3 belong to the conventional video compression framework based on predictive coding (Cutler, 1952; Wiegand et al., 2003; Sullivan et al., 2012); both models make use of two sets of latent variables to capture different aspects of information being compressed, where captures estimated motion information used in warping prediction, and helps capture residual error not predicted by warping.

Like most classical approaches to video compression by predictive coding, the reconstruction transform in the above models has the form of a prediction shifted by residual error (decoded noise), and lacks the scaling factor compared to the autoregressive transform in Eq. 3


where and are DNNs, has the interpretation of an estimated optical flow (motion) field,

is the computer vision technique of warping, and the residual

represents the prediction error unaccounted for by warping. Lu et al. (2019) only makes use of in the residual decoder , and performs simple 2D warping by bi-linear interpretation; SSF (Agustsson et al., 2020) augments the optical flow (motion) field with an additional scale field, and applies scale-space-warping to the progressively blurred versions of to allow for uncertainty in the warping prediction. The encoding procedure in the above models compute the variational mean parameters as , corresponding to a structured posterior . We illustrate the above generative and inference procedures in Figure 1(b).

Proposed: models based on Stochastic Temporal Autoregressive Transform.

Finally, we consider the most general models as described by the stochastic autoregressive transform in Eq. 3, shown in Figure 1(c). We study two main variants, categorized by how they implement and :

  1. STAT uses DNNs for and as in (Yang et al., 2020b), but complements it with the latent variable that characterizes the transform. In principle, more flexible transform functions and should lead to better rate-distortion results; however, we find the following variant more performant and parameter-efficient in practice:

  2. STAT-SSF: a less data-driven variant of the above that still uses scale-space warping (Agustsson et al., 2020) in the shift transform, i.e., . This can also be seen as an extended version of the SSF model, whose shift transform is preceded by a new scale transform .

Besides increasing the flexibility of the reconstruction transform (hence the likelihood model for ), we also consider improving the topmost generative hierarchy in the form of a more expressive latent prior , corresponding to improving the entropy model for compression. Specifically, we note that the prior in the SSF model assumes the factorization , which can be restrictive. We propose a structured prior by introducing conditional dependence between and , so that . This results in variants of the above models, STAT-SP and STAT-SSF-SP, where the structured prior is applied on top of the proposed STAT and STAT-SSF models.


Figure 2: Qualitative visualization of the proposed stochastic autoregressive transform on “ShakeNDry” from the UVG dataset. The proposed scale can be seen to as a gating mechanism in the reconstruction transform , switching off the contribution from the noise where it is costly and instead relying on the warping prediction decoded from .

4 Experiments

In this section, we present performance comparisons of our model with neural and classical codecs across multiple training and evaluation datasets (see Table 2), as well as our new dataset, YouTube-NT. We demonstrate that our proposed model improvements yield state-of-the-art performance at high bitrates, and our new dataset results in improvements over publicly-available datasets. Our model architecture and training scheme largely follows that of Agustsson et al. (2020), incorporating the components described in the previous section. We refer to the Appendix A.4 for further details.

Differing from Agustsson et al. (2020), we implement a more efficient version of the scale-space module. Agustsson et al. (2020) leverages a method with multiple Gaussian kernels and increasing kernel sizes, which can be inefficient especially with larger kernels. Our implementation uses a Gaussian pyramid and interpolation to avoid the issue. Pseudocode is available at Appendix A.3.

4.1 Training Datasets

Vimeo-90k (Xue et al., 2019) consists of 90,000 clips of 7 frames at 448x256 resolution collected from vimeo.com. As this dataset has been used in previous works (Lu et al., 2019; Yang et al., 2020a; Liu et al., 2019), it provides a benchmark for comparing models. While other publicly-available video datasets exist, e.g., Kinetics (Carreira and Zisserman, 2017), such datasets are lower-resolution and primarily intended for human action recognition. Accordingly, previous works that have trained using Kinetics generally report sub-par PSNR-bitrate performance (Wu et al., 2018; Habibian et al., 2019; Golinski et al., 2020). Agustsson et al. (2020) uses a significantly larger and higher-resolution dataset collected from youtube.com, however, this is not publicly available.

YouTube-NT. This is our new dataset. We collected 8,000 nature videos and movie/video-game trailers from youtube.com and processed them into 300k high-resolution (720p) clips, which we refer to as YouTube-NT. In contrast to existing datasets (Carreira and Zisserman, 2017; Xue et al., 2019), we provide YouTube-NT in the form of customizable scripts to enable future compression research. Table 1 compares the current version of YouTube-NT with Vimeo-90k (Xue et al., 2019) and with Google’s proprietary training dataset (Agustsson et al., 2020). In Figure 3(b), we display the evaluation performance of the SSF model architecture after training on each dataset.

! Dataset name Clip length Resolution # of clips # of videos Public Configurable Vimeo-90k 7 frames 448x256 90,000 5,000 [origin=c]125– YouTube-NT (ours) 6-10 frames 1280x720 300,000 8,000 Agustsson 2020 et al. 60 frames 1280x720 700,000 700,000

Table 1: Overview of Training Datasets.

4.2 Training and Evaluation Scheme

Training. All models are trained with three consecutive frames and batchsize 8, which are randomly selected from each clip, then randomly cropped to 256x256. We trained on MSE loss, following similar procedure to Agustsson et al. (2020) (see Appendix A.2 for details).

Evaluation. We evaluate compression performance on UVG (Mercat et al., 2020) and MCL_JCV (Wang et al., 2016) datasets, which both consist of YUV420 format RAW videos. UVG is widely used for testing the HEVC codec and contains seven 1080p videos at 120fps with smooth and mild motions or stable camera moving. MCL_JCV contains thirty 1080p videos at 30fps, which are generally more diverse, with a higher degree of motion and camera movement.

We compute the bit rate (bits-per-pixel, BPP) and the reconstruction quality as measured in PSNR, averaged across all frames. We note that PSNR is a more challenging metric than MS-SSIM (Wang et al., 2003) for learned codecs (Lu et al., 2019; Agustsson et al., 2020; Habibian et al., 2019; Yang et al., 2020a, c). We also note that current neural-based compression methods take video input in 8-bit RGB format (24bits/pixel), whereas FFmpeg also supports YUV420 format (12bits/pixel), allowing classical codecs to effectively halve the input bitrate on the test data. For fair comparison, we report rate-distortion performance for HEVC in RGB mode (Appendix A.1).

! Model Name Category Vimeo-90k Youtube-NT Remark STAT-SSF Proposed Proposed autoregressive transform with efficient scale-space flow model STAT-SSF-SP Proposed Proposed autoregressive transform with efficient scale-space flow model and structured prior SSF Baseline Agustsson et al. 2020 CVPR DVC Baseline Lu et al. 2019 CVPR VCII Baseline Wu et al. 2018 ECCV (Trained on Kinectics dataset (Carreira and Zisserman, 2017)) DGVC Baseline Han et al. 2019 NeurIPS without using future frames TAT Baseline Yang et al. 2020b ICML Workshop HEVC(RGB) Baseline N/A N/A State-of-the-art conventional codec with RGB color format STAT Ablation Replace scale space flow in STAT-SSF with neural network SSF-SP Ablation Scale space flow model with structured prior

Table 2: Overview of models and codecs.




Figure 3: Rate-Distortion Performance of various models and ablations. Results are evaluated on (a) UVG and (b) MCL_JCV datasets. All the neural-based models (except VCII (Wu et al., 2018)) are trained on Vimeo-90k (Xue et al., 2019). STAT-SSF (proposed) achieves the best performance.

4.3 Baseline Analysis

To provide a benchmark comparison with baseline models, listed in Table 2, we primarily focus on the Vimeo-90k (Xue et al., 2019) training dataset. In Figure 2(a), we compare our proposed models (STAT-SSF, STAT-SSF-SP) with previous neural codecs and the classical codec, HEVC, on the UVG evaluation dataset. Our models provide superior performance at bitrates bits/pixel, outperforming state-of-the-art models SSF (Agustsson et al., 2020), as well as DVC (Lu et al., 2019), which leverages a more complicated model and a multi-stage training procedure. The classical codec, HEVC (RGB), barely outperforms the model proposed by Wu et al. (2018). Finally, we note that, as expected, our proposed STAT model improves over TAT (Yang et al., 2020b) by adding stochasticity to the autoregressive transform.

The diversity of the MCL_JCV dataset provides a more challenging evaluation benchmark. However, our STAT-SSF model still maintains a competitive edge over other baseline methods. Our structured prior model, STAT-SSF-SP, provides further performance improvements over STAT-SSF, outperforming all other baselines at bits/pixel. This suggests that our structured prior may further help capture intensive motion and camera moving information.




Figure 4: Ablations & Comparisons. (a) An ablation study on our proposed components. (b) Performance of SSF (Agustsson et al., 2020) trained on different datasets. Both sets of results are evaluated on UVG.

4.4 Ablation Analysis

Using the baseline SSF (Agustsson et al., 2020) model and YouTube-NT training dataset, we demonstrate the improvements of our proposed components, stochastic temporal autoregressive transform (STAT) and structured prior (SP), evaluated on UVG. As shown in Figure 3(a), STAT improves performance to a greater degree than SP, consistent with the results in Section 4.3 Figure 2(a).

To quantify the effect of the training dataset on performance, we compare performance on UVG for the SSF model architecture after training on Vimeo-90k (Xue et al., 2019) and YouTube-NT. We also compare with the reported results from Agustsson et al. (2020), trained on a larger proprietary dataset. This is shown in Figure 3(b), where we see that training on YouTube-NT improves evaluation performance over Vimeo-90k, in some cases bridging the gap with the performance from the larger proprietary training dataset of Agustsson et al. (2020). At higher bitrate, the model optimized with Vimeo-90k(Xue et al., 2019) tends to have a similar performance with YouTube-NT. This is likely because YouTube-NT currently only covers 8000 videos, limiting the diversity of the short clips.

5 Discussion

We provide a unifying perspective on sequential video compression and temporal autoregressive flows (Marino et al., 2020), and elucidate the relationship between the two in terms of their underlying generative hierarchy. From this perspective, we consider several video compression methods, particularly a state-of-the-art method Scale-Space-Flow (Agustsson et al., 2020), as instantiations of a more general stochastic temporal autoregressive transform, which allows us to naturally extend the Scale-Space-Flow model and obtain improved rate-distortion performance on public benchmark datasets. Further, we provide a new high-resolution video dataset, YouTube-NT, which is substantially larger than current publicly-available datasets. Together, we hope that this new perspective and dataset will drive further progress in the nascent yet highly impactful field of learned video compression.

6 Acknowledgements

We gratefully acknowledge extensive contributions from Yang Yang (Qualcomm), which were indispensable to this work. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001120C0021. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Yibo Yang acknowledges funding from the Hasso Plattner Foundation. Furthermore, this work was supported by the National Science Foundation under Grants 1928718, 2003237 and 2007719, as well as Intel and Qualcomm.


  • E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici (2020) Scale-space flow for end-to-end optimized video compression. In

    IEEE Conference on Computer Vision and Pattern Recognition

    pp. 8503–8512. Cited by: Figure 6, §A.2, §A.3, §A.4, Hierarchical Autoregressive Modeling for Neural Video Compression, item 2, item 3, §1, §1, §2, item 2, §3.1, §3.3, §3.3, §3, Figure 4, §4.1, §4.1, §4.2, §4.2, §4.3, §4.4, §4.4, §4, §4, §5.
  • M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine (2018) Stochastic variational video prediction. In International Conference on Learning Representations, Cited by: §2.
  • J. Ballé, V. Laparra, and E. P. Simoncelli (2017) End-to-end optimized image compression. International Conference on Learning Representations. Cited by: §2, §3.1, §3.1.
  • J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston (2018)

    Variational image compression with a scale hyperprior

    International Conference on Learning Representations. Cited by: §2.
  • F. Bellard (2014) BPG image format. External Links: Link Cited by: §1.
  • Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    arXiv preprint arXiv:1308.3432. Cited by: §2.
  • J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §4.1, §4.1, Table 2.
  • T. Chen, H. Liu, Q. Shen, T. Yue, X. Cao, and Z. Ma (2017) DeepCoder: a deep neural network based video compression. In 2017 IEEE Visual Communications and Image Processing (VCIP), Vol. , pp. 1–4. Cited by: §2.
  • Z. Chen, T. He, X. Jin, and F. Wu (2019) Learning for video compression. IEEE Transactions on Circuits and Systems for Video Technology 30 (2), pp. 566–576. Cited by: §2.
  • J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio (2015) A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems, pp. 2980–2988. Cited by: §2.
  • C. C. Cutler (1952) Differential quantization of communication signals. Google Patents. Note: US Patent 2,605,361 Cited by: §3.3.
  • E. Denton and R. Fergus (2018) Stochastic video generation with a learned prior. In

    International Conference on Machine Learning

    pp. 1174–1183. Cited by: §2.
  • L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: item 2, §1, §2, §3.2.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: item 2, §1, §2, §3.2.
  • A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers (2019) Neural inter-frame compression for video coding. In IEEE International Conference on Computer Vision, Vol. , pp. 6420–6428. Cited by: §2.
  • A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In IEEE international conference on computer vision, pp. 2758–2766. Cited by: §2.
  • G. Flamich, M. Havasi, and J. M. Hernández-Lobato (2019) Compression without quantization. In OpenReview, Cited by: §2.
  • A. Golinski, R. Pourreza, Y. Yang, G. Sautiere, and T. S. Cohen (2020) Feedback recurrent autoencoder for video compression. arXiv preprint arXiv:2004.04342. Cited by: §4.1.
  • A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen (2019) Video compression with rate-distortion autoencoders. In IEEE International Conference on Computer Vision, pp. 7033–7042. Cited by: §2, §4.1, §4.2.
  • N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici (2018) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4385–4393. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §2.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751. Cited by: §2.
  • A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523. Cited by: §2.
  • H. Liu, L. Huang, M. Lu, T. Chen, Z. Ma, et al. (2019) Learned video compression via joint spatial-temporal correlation exploration. arXiv preprint arXiv:1912.06348. Cited by: §2, §4.1.
  • S. Lombardo, J. Han, C. Schroers, and S. Mandt (2019) Deep generative video compression. In Advances in Neural Information Processing Systems, pp. 9287–9298. Cited by: §2, §3.1.
  • G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao (2019) Dvc: an end-to-end deep video compression framework. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 11006–11015. Cited by: Hierarchical Autoregressive Modeling for Neural Video Compression, §1, §2, §3.3, §3.3, §4.1, §4.2, §4.3.
  • J. Marino, L. Chen, J. He, and S. Mandt (2020) Improving sequential latent variable models with autoregressive flows. In

    Symposium on Advances in Approximate Bayesian Inference

    pp. 1–16. Cited by: Hierarchical Autoregressive Modeling for Neural Video Compression, §1, §2, §3.2, §3.3, §3, §5.
  • A. Mercat, M. Viitanen, and J. Vanne (2020) UVG dataset: 50/120fps 4k sequences for video codec analysis and development. In ACM Multimedia Systems Conference, pp. 297–302. Cited by: §4.2.
  • D. Minnen, J. Ballé, and G. D. Toderici (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §1, §2.
  • D. Minnen and S. Singh (2020) Channel-wise autoregressive entropy models for learned image compression. In IEEE International Conference on Image Processing, pp. 3339–3343. Cited by: §2.
  • G. Papamakarios, T. Pavlakou, and I. Murray (2017) Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347. Cited by: §2, §3.1.
  • D. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §2.
  • O. Rippel, S. Nair, C. Lew, S. Branson, A. G. Anderson, and L. D. Bourdev (2019) Learned video compression. IEEE International Conference on Computer Vision, pp. 3453–3462. Cited by: §3.1.
  • F. Schmidt and T. Hofmann (2018) Deep state space models for unconditional word generation. In Advances in Neural Information Processing Systems, pp. 6158–6168. Cited by: §3.2.
  • F. Schmidt, S. Mandt, and T. Hofmann (2019)

    Autoregressive text generation beyond feedback loops


    Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing

    pp. 3391–3397. Cited by: §3.2.
  • G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on circuits and systems for video technology 22 (12), pp. 1649–1668. Cited by: §3.3.
  • L. Theis, W. Shi, A. Cunningham, and F. Huszár (2017) Lossy image compression with compressive autoencoders. International Conference on Learning Representations. Cited by: §2, §3.1, §3.1.
  • G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, and M. Covell (2017) Full resolution image compression with recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In Advances in Neural Information Processing Systems, pp. 613–621. Cited by: §2.
  • H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C. J. Kuo (2016) MCL-jcv: a jnd-based h. 264/avc video quality assessment dataset. In IEEE International Conference on Image Processing, pp. 1509–1513. Cited by: §4.2.
  • Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems & Computers, Vol. 2, pp. 1398–1402. Cited by: §4.2.
  • T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra (2003) Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology 13 (7), pp. 560–576. Cited by: §3.3.
  • C. Wu, N. Singhal, and P. Krahenbuhl (2018) Video compression through image interpolation. In European Conference on Computer Vision, pp. 416–431. Cited by: §2, Figure 3, §4.1, §4.3.
  • T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision 127 (8), pp. 1106–1125. Cited by: item 3, Figure 3, §4.1, §4.1, §4.3, §4.4.
  • R. Yang, F. Mentzer, L. Van Gool, and R. Timofte (2020a) Learning for video compression with hierarchical quality and recurrent enhancement. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.1, §4.2.
  • R. Yang, Y. Yang, J. Marino, Y. Yang, and S. Mandt (2020b) Deep generative video compression with temporal autoregressive transforms. ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models. Cited by: Hierarchical Autoregressive Modeling for Neural Video Compression, §3.3, §3.3, §4.3.
  • Y. Yang, G. Sautière, J. J. Ryu, and T. S. Cohen (2020c) Feedback recurrent autoencoder. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3347–3351. Cited by: §4.2.
  • Y. Yang, R. Bamler, and S. Mandt (2020d) Improving inference for neural image compression. arXiv preprint arXiv:2006.04240. Cited by: §1, §2.
  • Y. Yang, R. Bamler, and S. Mandt (2020e) Variational bayesian quantization. In International Conference on Machine Learning, Cited by: §2.
  • L. Yingzhen and S. Mandt (2018) Disentangled sequential autoencoder. In International Conference on Machine Learning, pp. 5670–5679. Cited by: §2.

Appendix A Appendix

a.1 Command for HEVC codec

To avoid FFmpeg package taking the advantage of the input file color format (YUV420), we first need to dump the video.yuv file to a sequence of lossless png files:

    ffmpeg -i video.yuv -vsync 0 video/%d.png

Then we use the default low-latency setting in ffmpeg to compress the dumped png sequences:

    ffmpeg -y -i video/%d.png -c:v libx265 -preset medium \
        -x265-params bframes=0 -crf {crf} video.mkv

where crf is the parameter for quality control.

a.2 Training schedule

Training time is about four days on an NVIDIA Titan RTX. Similar to Agustsson et al. (2020), we use the Adam optimizer (Kingma and Ba, 2014), training the models for 1,050,000 steps. The initial learning rate of 1e-4 is decayed to 1e-5 after 900,000 steps, and we increase the crop size to 384x384 for the last 50,000 steps. All models are optimized using MSE loss.

a.3 Efficient scale-space-flow

Agustsson et al. (2020) leverages a simple implementation of scale-space flow by convolving the previous reconstructed frame with a sequence of Gaussian kernel . However, this may lead to a large kernel size when is increasing and significantly reduce the efficiency. For example, Gaussian kernel with usually requires kernel size 97x97 to avoid artifact (usually ). To alleviate the problem, we leverage an efficient version of Gaussian scale-space by using Gaussian pyramid with upsampling. In our implementation, we use , because by using Gaussian pyramid, we can always use Gaussian kernel with to consecutively blur and downsample the image to build a pyramid

. At the final step, we only need to upsample all the downsampled images to the original size to approximate a scale-space 3D tensor.

Result: sst: Scale-space 3D tensor
Input: input input image; base scale; scale depth;
sst = [input];
kernel = Create_Gaussian_Kernel();
for i=0 to M-1 do
       input = GaussianBlur(input, kernel);
       if i == 0 then
             tmp = input;
             for j=0 to i-1 do
                   tmp = UpSample2x(tmp); {step upsampling for smooth interpolation};
             end for
       end if
      input = DownSample2x(input);
end for
return Concat(sst)
Algorithm 1 An efficient algorithm to build a scale-space 3D tensor

a.4 Architecture

Figure 5 illustrates the low-level encoder, decoder and hyper-en/decoder modules used in our proposed STAT-SSF and STAT-SSF-SP models, as well as in the baseline TAT and SSF models, based on Agustsson et al. (2020). Figure 6 shows the encoder-decoder flowchart for and separately, as well as their corresponding entropy models (priors), in the STAT-SSF-SP model.


Figure 5:

Backbone module architectures, where “5x5/2, 128” means 5x5 convolution kernel with stride 2; the number of filters is 128.


Figure 6: Computational flowchart for the proposed STAT-SSF-SP model. The left two subfigures show the decoder and encoder flowcharts for and , respectively, with “AT” denoting autoregressive transform. The right two subfigures show the prior distributions that are used for entropy coding and , respectively, with , and , with and denoting hyper latents (see (Agustsson et al., 2020) for a description of hyper-priors); note that the priors in the SSF and STAT-SSF models (without the proposed structured prior) correspond to the special case where the HyperDecoder for does not receive and as inputs, so that the entropy model for is independent of : .