Towards Generative Video Compression

07/26/2021 ∙ by Fabian Mentzer, et al. ∙ Google 5

We present a neural video compression method based on generative adversarial networks (GANs) that outperforms previous neural video compression methods and is comparable to HEVC in a user study. We propose a technique to mitigate temporal error accumulation caused by recursive frame compression that uses randomized shifting and un-shifting, motivated by a spectral analysis. We present in detail the network design choices, their relative importance, and elaborate on the challenges of evaluating video compression methods in user studies.



There are no comments yet.


page 4

page 5

page 8

page 13

page 14

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video compression is a challenging task, in which the goal is to reduce the bitrate required to store a video while preserving visual content by leveraging temporal and spatial redundancies. Today, non-neural standard codecs such as H.264/AVC [AVC] and H.265/HEVC [HEVC]

, which build on decades of engineering, are in broad use, but recent research has shown great progress in learned video compression using neural networks. In general, the focus has been on outperforming HEVC in terms of PSNR or MS-SSIM with novel architectures or training schemes, leading to the latest approaches being comparable to HEVC in terms of PSNR 

[agustsson2020scale, yang2020hierarchical] or outperforming it in MS-SSIM [golinski2020feedback]. However, these methods have not been evaluated in terms of subjective visual quality. In fact, few authors release reconstructions.

At the same time, recent work in neural image compression [mentzer2020high, agustsson2019extreme], formalized in “rate-distortion-perception theory” [blau2019rethinking, tschannen2018deep, theis2021coding, theis2021advantages], has shown how generative image compression methods can outperform the HEVC-based image codec BPG [bpgurl] in terms of visual quality as measured by a user study, even when the neural approach uses half the bitrate. Interestingly, this gap in bitrate was not captured by the quantitative image quality metrics, and in fact, PSNR and MS-SSIM predicted the opposite result.

At the time of writing, generative methods that target subjective quality remain unexplored for neural video compression. While the progress in generative image compression is promising, it is non-obvious how to apply the approach to the video domain. One intuitive problem is related to the train/test mismatch in terms of sequence length. Typically, neural approaches are trained on a handful of frames (e.g., three frames for [agustsson2020scale] and five for [liu2019neural]), while evaluation sequences have hundreds of frames. While the unrolling behavior is already a problem for MSE-optimized neural codecs (typically, small GOPs of 8-12 frames are used for evaluation to limit temporal error propagation), it requires even more care when detail is synthesized in a generative setting. Since we aim to synthesize high-frequency detail whenever new content appears, incorrectly propagating that detail will create significant visual artifacts.

With this in mind, we carefully design a neural compression architecture that excels at preserving detail and that works well when unrolled for many more frames than trained for (we train with up to nine frames and show 60 frames in the user study).

In this paper, we advance towards generative neural video coding, with the following contributions:

  1. [leftmargin=*,noitemsep]

  2. To the best of our knowledge, we show the first neural compression approach that is competitive to HEVC in terms of visual quality, as measured in a user study. We show that approaches competitive in terms of PSNR do fare much worse in terms of visual quality.

  3. We propose a technique to mitigate temporal error-accumulation when unrolling via randomized shifting of the residual inputs followed by un-shifting of the outputs, motivated with a spectral analysis, and demonstrate its effectiveness in our system and for a toy linear CNN model.

  4. We explore correlations between visual quality as measured by the user study and available video quality metrics. To facilitate future research, we release reconstructions on the MCL-JCV video data set along with all the data obtained from our user studies (links in Appendix B).

Figure 1: Comparing different methods in a user study on the 30 videos of MCL-JCV. We illustrate the number of videos where each method was preferred by a majority of raters. Below each method we list the average bits per pixel (bpp) and megabits/sec (mbps) of the comparison. We see that our GAN-based approach (Ours) is very similar to HEVC across bitrates, and outperforms our own MSE-only

baseline (same architecture and hyperparameters but no GAN loss) as well as a recent MSE-optimized neural compression method by Agustsson 

et al., “Scale-Space Flow” (SSF[agustsson2020scale].

2 Related Work

Neural Video Compression

Wu et al. [wu2018video]

use frame interpolation for video compression, compressing B-frames by interpolating between other frames. Djelouah 

et al. [djelouah2019neural] also use interpolation, but additionally employ an optical flow predictor for warping frames. This approach of using future frames is commonly referred to as “B-frame coding” for “bidirectional prediction". Other neural video coding methods rely on only using predictive (P) frames, commonly referred to as the “low-delay” setting, since it is more suitable for streaming applications by not relying on future frames. Lu et al. [lu2019dvc] use previously decoded frames and a pretrained optical flow network. Habibian et al. [habibian2019video] do not explicitly model motion, and instead rely on a 3D autoregressive entropy model to capture spatial and temporal correlations. Liu et al. [liu2019neural] build temporal priors via LSTMs, while Liu et al. [liu2020conditional] condition entropy models on previous frames. Rippel et al. [rippel2019learned] support adapting the rate during encoding, and also do not explicitly model motion. Agustsson et al. [agustsson2020scale] propose “scale-space flow” to avoid complex residuals by allowing the model to blur as needed via a pyramid of blurred versions of the image. Yang et al. [yang2020hierarchical] generalize various approaches by learning to adapt the residual scale, and conditioning residual entropy models on flow latents. Golinsky et al. [golinski2020feedback] recurrently connect decoders with subsequent unrolling steps, while Yang et al. [yang2020learning] also add recurrent entropy models.

Non-Neural Video Compression

Hybrid video coding, meaning the combination of transform coding [Go01] using discrete cosine transforms [AhNaRa74] with spatial and/or temporal prediction, emerged in the 1980s as the technology dominating video compression until the present day. Since these early days, the development of video compression methods has largely been centered around joint standardization through the ITU-T and the Motion Picture Experts Group (MPEG) of ISO/IEC. The various standards (mainly H.261 through H.265 [HEVC], as designated by ITU-T), as well as other proprietary codecs, have all remained faithful to the hybrid coding principle, with extensive refinements, e.g., regarding more flexible pixel formats (e.g., bit depth, chroma subsampling), more flexible temporal and spatial prediction (e.g., I-, P-, B-frames, intra block copy), and many more. Much of this development has been driven by patent-pooling agreements between private companies. More recently, there have been significant efforts to develop royalty-free video compression formats, although even this research remains consistent with hybrid coding (e.g., VP8/9 developed by Google, Dirac developed by BBC research, and AV1 developed by the Alliance for Open Media).

Figure 2:

Architecture overview, with some intermediate tensors visualized in the gray box. To the left of the gray

line is the I-frame branch (learned CNNs in blue), to the right the P-frame branch (learned CNNs in green). Dashed lines are not active during decoding, and discriminators are only active during training. The size of CNNs roughly indicates their capacity. SG is a stop gradient operation. Blur is scale space blurring, Warp is bicubic warping (see text). UFlow is a frozen optical flow model from [jonschkowski2020matters].

3 Method

3.1 Architecture

An overview of the architecture we use is given in Fig. 2, for a detailed view with all layers see Appendix A.1. Let be a sequence of frames, where is the initial (I) frame, denoted in the figure and below. We operate in the “low-delay” mode, and hence predict subsequent (P) frames from previous frames. Let be the reconstructed video.

We use the following strategy to obtain high-fidelity reconstructions:

  1. [label=(S0),leftmargin=*,noitemsep]

  2. Synthesize plausible details in the I-frame.

  3. Propagate those details wherever possible and as sharp as possible.

  4. For new content appearing in P-frames, we want to synthesize plausible details.

We note that this is in contrast to purely distortion-optimized neural video codecs, which, particularly at low bitrates, favor blurring to reduce the distortion loss. To address these three points, we design our architecture as described in this section.

The I-frame branch is based on a lightweight version of the architecture used in HiFiC [mentzer2020high], and is used to address 1. In detail, the encoder CNN maps the input image to a quantized latent

, which is entropy coded using a hyperprior 

[minnen2018joint] (not shown in Fig. 2, but which is detailed in App. A.1). From the decoded , we obtain a reconstruction via the I-generator . Following [mentzer2020high], we use an I-frame discriminator that is conditioned on the latent , further explained below in Sec. 3.2.

At a high-level, the P-frame branch follows previous work (e.g. [lu2019dvc, agustsson2020scale], etc.) in having two parts, one auto-encoder for the flow, and one for the residual. To partially address 2, similar to previous work, we employ a powerful optical flow predictor network on the encoder side, UFlow [jonschkowski2020matters]. We found that without UFlow, propagating details is harder, see Fig. 3. The resulting flow is fed to the flow-encoder , which outputs the quantized and entropy-coded flow-latent . The generator predicts both a reconstructed flow , as well as a confidence mask (similar to [agustsson2020scale]). The mask has the same spatial dimensions as , with each value in . Intuitively, this mask predicts for each pixel in the flow how “correct” the flow at that pixel is, and we use it to decide how much to blur in the “scale-space blur” component described next. In practice, we observe predicts where new content appears which is not well captured by the flow + warping. Since the flow is in general relatively easy to compress, we employ shallow networks for and based on networks used in image compression [minnen2018joint].

To further address 2, we first warp the previous reconstruction with using bicubic warping, which is better suited at propagating detail compared to bilinear warping [nehab2014fresh]. After warping, we use scale-space blurring. This is a light variation of the “scale-space flow” approach described by Agustsson et al. [agustsson2020scale], where we first warp, and then (adaptively) blur, instead of the other way around. This has the benefit of allowing a more efficient implementation, see Appendix A.2 for details. Together, bicubic warping and blurring help to propagate sharp detail when needed, while also facilitating smooth blurring when needed (e.g., for focus changes in the video). We denote the resulting warped and potentially blurred previous reconstruction with .

Finally, we calculate the residual and compress it with the residual auto-encoder . To address the last point above, 3, we again employ the light version of the HiFiC architecture for . However, we introduce one important component. We observe that is not able to synthesize high-frequency details from the residual latent . We hypothesize that this is due the generally sparse nature of residuals , which lead to sparse latents.

To address this, we introduce an additional source of information for by relying on , which—in contrast to —is not sparse at all. We feed through the I-frame encoder to obtain the “free latent. Note that this latent does not need to be encoded into the bitstream (hence the name “free”), and thus also does not need to be quantized. Instead, we concatenate it to as a source of information, forming . We also explored using uniform noise instead, but found that this “free latent” yields more detail. One advantage over using noise is that now indirecly has access to the signal that the residual is added to. To train the P-frame branch, we employ a seperate P-frame discriminator , with the same architecture as , conditioned on the full generator input .

Input Output for different training strategies
Using a GAN loss:          ✗
Using the “free latent”:    ✗
Uncompressed Flow Compressed Flow for different training strategies
Using UFlow:                ✗
Using flow loss:             ✗
Figure 3: Top: removing the “free latent” gives a model that is similar to our MSE baseline (MSE-only, i.e., no GAN, no “free latent”), yielding blurrier reconstructions. Bottom: Using supervised optical flow (UFlow) and the flow loss gives more natural flow fields for warping which improve temporal consistency. “Using UFlow: ✗” means that we do not feed (see Fig. 2), i.e., has to learn .

3.2 Formulation and Loss

We base our formulation on HiFiC [mentzer2020high]. We use conditional GANs [goodfellow2014generative, mirza2014conditional], where both the generator and the discriminator have access to additional labels: The general formulation assumes data points and labels

following some joint distribution

. The generator is supposed to map samples to the distribution , and the discriminator is supposed to predict whether a given pair is from rather than from the generator.

In our setting, we are working with sequences of frames and reconstructions. Following HiFiC, we condition both the generators and the discriminators on latents , for the I-frame, for the P-frame. To simplify the problem, we aim for per-frame distribution matching, i.e., for -length video sequences, the goal is to obtain a model s.t.:


To readers more familiar with conditional generative video synthesis (e.g., Wang et al. [wang2018vid2vid]), this simplification may seem sub-optimal as it may seem to lead to temporal consistency issues (i.e., you may imagine that reconstructions are inconsistent). We emphasize that since we are doing compression, we will also have a per-frame distortion loss (MSE), and we have information that we transmit to the decoder via a bitstream. So while the residual generator can in theory produce arbitrarily inconsistent reconstructions, in practice, these two points seem to prevent any temporal inconsistency issues in our models. We nevertheless explored variations where the discriminator is based on more frames, but this did not significantly alter reconstructions. Still, we believe there is further potential here.

Continuing from Eq. 1, we define the overall loss for the I-frame branch and its discriminator as follows. We use the “non-saturating” GAN loss [goodfellow2014generative]. To simplify notation, let :111N.B.: Since we have deterministic auto-encoders, the distribution of is fully defined via , and hence we only have expectations over , in contrast to typical GAN formulations.


where is the adaptive rate controller described in Sec. 3.4, and is a per-frame distortion. We use , and emphasize that we use no perceptual distortion, whereas HiFiC employed . We found no benefit in training with LPIPS, possibly due to a more balanced hyper-parameter selection.

Toy Linear CNN Our GAN model
No shift No shift Random shift No shift Random shift
Figure 4: Visualizing the grid that appears if we do not use random shifting. Left: Using our toy linear CNN, trained for steps. We recursively compress a fixed input, visualizing how MSE grows both with a plot and visually. indicates the number of recursive applications. Right: Our GAN model, trained for steps. Here we do real video compression for frames, but we see how random shifting helps.

For the P-frame branch, let be the distribution of -length clips, where we use as the I-frame, and let


Note that we scale the losses of the -th frame with . This is motivated by the observation that influences all reconstructions following it, and hence earlier frames indirectly have more influence on the overall loss. Scaling with ensures all frames have similar influence.

Additionally, we employ a simple regularizer for the P-frame branch:


where the first part is MSE on the flows, ensuring that learn to reproduce the flow from UFlow. We mask it with the sigma field, since we only require consistent flow where the network actually uses the flow. TV is a total-variation loss [shulman1989regularization] ensuring a smooth sigma field.

3.3 Preventing Error Accumulation when Unrolling via Random Shifts

As mentioned in the introduction, the recurrent nature of the “low latency” setting makes generalization challenging in the time domain, where error propagation can happen. Ideally, we could train with sequences as long as what we evaluate on (i.e., at least frames), but in practice this is infeasible on current hardware due to memory constraints. While we can fit up to into our accelerators, training models then becomes prohibitively slow.

To accelerate prototyping and training new models, as well as work towards preventing unrolling issues, we instead adopt the following training scheme. 1) Train only, on randomly selected frames, for steps. 2) Freeze and initialize the weights of from . Train for steps using staged unrolling, that is, use until steps, until , until , until , until . We split this into steps 1) and 2) since trained can be re-used for many variants of the P-frame branch, and sharing across runs makes them more comparable.

Figure 5: Visualization of errors introduced by the linear 1-D for different shifts of an input signal. In each plot, the black solid line shows the input , possibly shifted, and the colored line shows . A shift by a multiple of yields the same error (See a, d, indicated by the fact that we use the same color), shifts by yield different errors (b, c). The dotted line represents the unshifted Input from a.

However, this alone does not fully prevent unrolling issues. Indeed, starting at roughly frame (depending on the video), we start to see well-known CNN artifacts, i.e., “checkerboard artifacts” or “gridding” [aitken2017checkerboard, odena2016deconvolution, wang2018understanding] in slowly moving and smooth regions of the videos. We do not see those artifacts in early frames.

We found that randomized shifting by of the residual branch inputs (including for the “free latent”) , and un-shifting by , is highly effective at mitigating these issues. This can be motivated at a high level by considering a single 1D convolutional filter

which has a stride of 1. Recursively applying

to an input yields , which becomes


in the Fourier domain.222

For this analysis, we assume the standard Discrete Fourier Transform (DFT) and assume a large enough input such that the boundaries for convolution & shifting are irrelevant.

As can be deduced from the equation, any deviation of from the identity will lead to a geometric multiplication of the frequency response, such that frequencies where , will vanish, whereas they will grow out of bounds for .

In our case, the system is not shift-equivariant (unlike above), but shift-equivariant modulo its total stride (i.e., for us, since we have 4 downsampling convolutions). Thus, when we shift the input by less than , we get a different frequency response (and thus a different error) for each shift, but when we shift by an offset that is a multiple of (i.e., ), we get the same frequency response. This is visualized in Fig. 5, where the output is the same in a) and d).

Applying a random offset to the inputs and undoing it for the output at each iteration effectively permutes the frequency responses within the modulo group, and also permutes the error that each of the responses has with respect to the ideal response. The ideal response is identical for each pixel location (since the training data and task itself is shift equivariant modulo 1). Random offsets may not fully eliminate the error, but will lead to a significant reduction in error accumulation over time as the permuted errors can “cancel out” (see below).

More formally, consider a linear CNN locally approximating our residual branch. Importantly, we still assume there are strided convolutions and de-convolutions, hence we know is equivariant wrt. shifts which are a multiple of [zhang2019making]. Let be a -length, 1-channel input, represented as a matrix. Thus, is also . To apply Eq. 7 to , we need to represent it as a single convolution. This is in general impossible, but it holds that there exists a convolutional filter mapping channels to channels, s.t., , where is transformed via space-to-depth to a -dimensional matrix, and becomes depth-to-space transformed.333Note that this applies since the network is linear, for networks with nonlinearities we can apply the same argument when linearizing the network around the given input .

Now, the above frequency domain analysis applies for each of the

output channels, i.e., (adding a dimension because we have channels), where is the -matrix representing the Fourier Transform of . Since we train for reconstruction starting from randomly initialized filters, we expect , and hence , where

is the identity matrix, and

is a random matrix with zero-mean elements. Two applications of are thus equivalent to multiplying with


where will form the error when applied to , but will dominate, since . Keeping this in mind, we can now look at randomly shifting the input and unshifting the output. This amounts to obtaining a shifted , where is a permuted version of . We obtain,


Intuitively, the difference between the equations is that in Eq. 9, errors accumulate in the same channel, whereas in Eq. 8, they can cancel each other out. To see this we note that since is a permutation of , which implies . Furthermore, so that it is also dominated by the prior term.

To test this explanation, we overfit a toy linear CNN for MSE-based reconstruction (without any quantization) for 10 000 steps on a single image, recursively applied it, and compared randomly shifting to not shifting. The result is visualized in Fig. 4 (see more details in Appendix A.3). As we can see, without shifting, the error grows exponentially, while randomly shifting inputs dampens it significantly, and reduces the output grid. We also show how this significantly reduces artifacts of our GAN model.

3.4 Controlling Rate During Training with a Proportional Controller

Figure 6: Comparing a broad family of models with different hyper-parameters trained for 400k steps. We see how the rate parameter is adapted during training (left) to match the target bpp of 0.05 for all (right). At 80k Steps we drop the target rate, at 325k steps we drop the LR.

The hyperparameter controls the trade-off between bitrate and other loss terms (distortion, GAN loss, etc.). However, since there is no direct relationship between and the bitrate of the model as we vary hyper-parameters such as loss weights, this makes comparison accross models difficult since they end up at different rates. Van Rozendaal et al. [van2020lossy] also observe this and propose targeting a fixed distortion via constraint optimization. Another approach was used in [mentzer2020high], where was dynamically adjusted between a small and a large , depending on whether the model bitrate was below or above a given target. However, this requires still tuning depending on other hyper-parameters. This approach can be interpreted as an "on-off" controller, and a natural generalization of this techinque is to use a proportional-controller, which we adopt here.

In particular, given a target bitrate , we measure the error between the current mini-batch bitrate and the target (in log-space), and apply it with a proportional controller to update as follows:


where for stability and the “proportional gain” is a hyperparameter. We note that if we ignore the log-reparameterization, this corresponds to the “Basic Differential Multiplier Method” [platt1988constrained].

We found this to be highly effective to obtain comparable models in terms of bitrate when changing hyper-parameters such as learning rates, amount of unrolling, loss weights, etc., as visualized in Fig 6. We train with a higher rate for the first k steps by adapting (details in App. A.6).

4 Experiments


Our training data contains approximately spatio-temporal crops of dimension , each of length frames, obtained from public videos from YouTube. These videos are filtered to be at least 1080p in resolution, 16:9 in aspect ratio, and 30 fps. We omit content labeled as “video games” or “computer generated graphics” to further filter non-natural scene content. To do this filtering we rely on YouTube’s category system [ytcat]

, and also motion estimates in order to filter out fully static videos.

We evaluate our model on the 30 diverse videos of MCL-JCV [wang2016mcl], which is available under a permissive license from USC, and contains no objectionable content. This dataset contains a broad variety of content types and difficulty. This includes a wide variety of motion from natural videos, computer animation and classical animation. In contrast to most previous work, we do not constrain our model to use a small GoP size, and instead only use an I-frame for the first frame, since our network performs well even when unrolled over the full MCL-JCV videos.

User Study Protocol

We conducted all user studies with human raters who worked form home due to the COVID-19 pandemic. We advised all raters to ensure to not have any visible light sources behind or in front of their monitors, to reduce glare. Given the heterogeneous home office environments in which raters operated it was not possible to ensure consistent monitor sizes.

Each rater used a web interface to view the videos to be rated. At any time, they were comparing a pair of methods A and B vs. the original (i.e., essentially “two alternative forced choice” (2AFC)). Raters could toggle between A and B in-place, while the original was on the side (see inline Figure), and they were asked to select which of the two methods (A/B) is closest to the original (see App. A.7 for a screenshot of the instructions). This protocol is inspired by previous work in image compression [mentzer2020high, clic2020], and ensures that differences between methods are easy to spot.

Depending on the motion in the video and the bitrate, comparing video compression methods can be extremely challenging. While experts can spot differences, we received feedback that the task may be too hard when methods are very close. Thus, we instructed raters to pause the videos when they “can’t see the difference”. We show in App. A.7 that this was used in particular for higher rates, where methods can become very similar.

In order to ensure consistency, and to be sure that the raters can fit the task in their web browser, we center-crop the videos to . In order to play back the videos in a web browser we transcode all methods with VP9, using a very high quality factor for VP9 to avoid any new artifacts. Since this yields large file sizes, we focus on the first 2 seconds (60 frames) of each video

to ensure smooth playback. This has the additional benefit of lowering variance across raters, and allows the raters to get very familiar with the motion and details. We observed a ramp down in time that it took a rater to perform this task once they got familiar with the video clip. The frames play in a loop.

Our raters were contracted through the “Google Cloud AI Labeling Service” [LabelingService], and we estimate that we were charged for six thousand units for a total of USD. Before we started the studies presented in this paper, we ensured that the raters are capable of performing the task correctly by asking them to differentiate between HEVC-encoded videos with and . To non-experts there is a small difference between the two settings, yet all our raters were able to correctly identify that -encoded videos looked closer to the original, for all videos in MCL-JCV.

Figure 7: Metrics on MCL-JCV. We show Ours at three bitrates, and compare it to the MSE-only baseline (same as Ours but without GAN loss), HEVC, and the recent neural “Scale-Space Flow” (SSF [agustsson2020scale]). Comparing to Fig. 1, we can see that none of the metrics predicts the right ranking.

In order to provide consistency in the results we present, we used twelve raters per method pair and denote a win for a particular method when at least half plus one of the raters were in agreement.


We compare our approach in terms PSNR, MS-SSIM [wang2003multiscale], LPIPS [zhang2018unreasonable], the “Video Multi-Method Assessment Fusion” algorithm VMAF [vmafurl], developed by Netflix to evaluate video codecs, FID [heusel2017gans], but following HiFiC [mentzer2020high] we evaluate it on non-overlapping patches, as well as the recent unsupervised perceptual quality metric PIM [bhardwaj2020unsupervised]. We note that except for VMAF, none of these metrics were designed for video, and contain no motion features.

Models and Baselines

We refer to our model as Ours. We call our baseline “MSE-only”, which uses exactly the same architecture and training schedule as Ours, trained without a GAN loss ( in Eqs. (2), (4)). We also compare to “Scale-Space Flow” (SSF), which is the model from Agustsson et al. [agustsson2020scale], a recent neural compression method that is comparable to HEVC in PSNR. Finally, we compare against the non-learned HEVC, using the “medium” preset in ffmpeg and no B-frames, following [agustsson2020scale] (exact ffmpeg command in App. A.4).

5 Results

We summarize the raters’ preference in Fig. 1, and show our metrics in Fig. 7.444CSV file: We compare Ours against HEVC at three bitrates, seeing that our method is comparable to HEVC at 0.064 bpp (14 “wins” vs. 12), preferred at 0.13bpp (18 vs. 9), and preferred at 0.22bpp, (16 vs. 9). In order to estimate the effect of the GAN loss on visual quality, we compared Ours against MSE-only and SSF [agustsson2020scale] at low rates ( bpp). We see in Fig. 1 that MSE-only is only preferred 4 times out of 30, with 4 ties, showing the importance of a GAN loss, and that SSF is never preferred, with no ties.

We emphasize that MSE-only is comparable to HEVC in terms of PSNR (Fig. 7), yet clearly worse in terms of visual quality (never preferred to Ours, with 4 ties). Overall, we highlight that some metrics would have predicted other outcomes. Apart from MS-SSIM, all metrics favor HEVC. MS-SSIM and PSNR incorrectly favor MSE-only over Ours, while PIM and LPIPS correctly rank Ours vs. MSE-only, but incorrectly favor HEVC. VMAF ranks Ours and MSE-only as being similar, and FID values for Ours and MSE-only are also similar. Overall, none of the metrics would have predicted the results in Fig. 1, but PIM and LPIPS correctly rank some comparisons. This type of result has been observed in the domain of neural image compression [clic2020, mentzer2020high], where the best methods get ranked by humans, since no metric currently exists that can accurately rank these methods according to subjective quality.

Further research in the domain of perceptual metrics for video is needed and we release all our ratings (i.e., rater preference) and videos that were compared to encourage this research, see Appendix B. We hope that this could be the start of a new video benchmark for perceptual quality prediction.

Visual Ablations

We found that the following components are key for the performance of our approach. Not using the free latent (Section 3.1) causes blurry reconstructions, similar to what we see in the MSE-only baseline, see Fig. 3, top. We note that using it but not conditioning the discriminator also causes blurry reconstructions (not visualized but looks similar to not using the free latent). We get inconsistent flow when we do not feed UFlow, and also when feeding it but not using the flow loss regularizer (Eq. 6). Removing either thus hurts temporal consistency, see Fig. 3, bottom.

6 Conclusion

We presented a generative approach to neural video compression that was evaluated through user studies, where we saw that we are competitive to HEVC, while outperforming previous neural video compression codecs. We proposed a technique to mitigate temporal error-accumulation problems, which was crucial for obtaining high visual quality. While existing perceptual quality/similarity metrics might be useful to track progress, we showed that they do not indicate the capability of the algorithm, which can only be assessed through user studies. We released the data we collected in evaluating our method which we hope will encourage other researchers to develop better perceptual quality metrics specifically for neural video compression assessment.

The main limitations of our method in terms of deployment is the decoder architectures. While smaller than previous generative image compression approaches, this architecture is still relatively slow, and if speed is desired, faster architectures should be explored. To further improve our method, some sort of explicit state propagation (e.g., similar to Golinsky et al. [golinski2020feedback]) could be included, as it likely will help reducing bitrates.

Societal impact

We hope researchers will be able to build on top of the proposed method to create a new generation of video codecs, which we believe will have a net-positive impact on society by allowing less data to be transmitted for applications such as video conferences and video streaming.


Appendix A Appendix

a.1 Architecture Details

Figure 8: Detailed view of the architecture, showing the layers in each of the blocks in Fig. 2. “Conv” denotes a 2D convolution with output channels, “-SS” denots the filter size, if that is omitted we use 33. , indicates downsampling and upsampling, respectively, “Norm” is the ChannelNorm layer employed by HiFiC [mentzer2020high]

. The blocks with a color gradient indicate Residual Blocks, we only show the detail in one. “LReLU” is the Leaky ReLU with

. We note that we employ SpectralNorm in both discriminators. The distributions predicted by the Hyperprior are used to encode the latents with entropy coding. Like in Fig. 2, learned I-frame CNNs in blue, learned P-frame CNNs in green, Dashed lines are not active during decoding, SG is a stop gradient operation, Blur is scale space blurring, Warp is bicubic warping. UFlow is a frozen optical flow model from [jonschkowski2020matters].

A zoomed-in version of the architecture from Fig. 2 is given in Fig. 8.

a.2 Scale Space Blur

Input Sigma Field Result
Figure 9: “Scale Space Blurring” is a light variant of Scale Space Warping (SFF, [agustsson2020scale]) which decouples blurring from warping, which we found easier to implement efficiently. Here we visualize how a smooth sigma field adaptively blurs an image.
Figure 10: We compare the R-D performance of Bilinear/Bicubic Warping + Scale-Space Blurring against the Scale-Space Warping of Agustsson et al. [agustsson2020scale], and find it gives comparable results. We also show plain bilinear warping, without any scale space blurring. All models trained for MSE.

We used a light variant of scale-space warping [agustsson2020scale]. Our variant decouples the operation into plain warping and “scale-space blur”, as described below.

Both variants, at their core, use the “scale-space flow” field , which generalizes optical flow by also specifying a "scale" for each target pixel .

To compute a scale-space warped result , the source image is first repeatedly convolved with Gaussian blur kernels to obtain a “scale-space volume” with levels in total. The three coordinates of the scale-space-flow are then used to to jointly warp and blur the source image, retrieving pixels via tri-linear interpolation from the volume.

To simplify the implementation, we decouple the warping and blurring:

where adaptively blurs from the a scale-space volume without any warping. See Fig. 9 for a visualization of how a given input and sigma field get blurred via scale space blur.

We found this easier to implement efficiently, as we can use efficient resampling implementations (and more flexible resamplers such as bicubic resampling) for warping.

In Fig. 10 we show that bilinear/bicubic warping and scale-space blur (our implementation) gives comparable performance to scale-space warping [agustsson2020scale]. For the Figure, we used the architecture of [agustsson2020scale] trained for MSE only, with a slightly accelerated training schedule by skipping the last training stage on larger (384px) crops, instead decaying the learning rate by 10x after 800 000 steps.

a.3 Gridding Error

On representing (linearized) Autoencoders with Space-To-Depth

Why can a sequence of strided convolutions be represented with a space-to-depth followed by a plain convolution in Sec. 3.3? This is a generalization of the argument made by (Shi et al. 2016)555Shi, Wenzhe, et al. “Is the deconvolution layer the same as a convolutional layer?” (2016)., which show that i) a convolution with stride is equivalent to a space-to-depth followed by a non-strided convolution, and ii) a de-convolution with stride is equivalent to a depth-to-space preceded by a non-strided convolution.

If we abuse to mean “can be represented with” (i.e., “is less powerful than”), this can be written as

where / refers to a convolution/deconvolution with kernel size and stride , and S-to-D/D-to-S refer to space-to-depth/depth-to-space.

This can be extended to sequence of strided convolutions (without linearities) as we have if we use a sufficiently large kernel size (and similarly for deconvolutions). This allows us to write

Now, if we look at the encoder combined with the decoder (using only two convolutions in each to simplify the exposition), we can fold together the stride-1 convolutions:

where the the last step follows from the fact that and hence .

On Norms of Sums of Permutations

Following Eq. (9) we claimed that

due to the second term being a permutation of the first one. Since the Frobenius norm is the the euclidean norm of the matrices flattened, consider the vector case where we have some

, and a permutation of it, . Now clearly , so we have


where in (13) we use the Cauchy-Schwarz inequality.

Toy Model Details

For the toy model, we used as an encoder three Conv2D layers with stride 2, kernel size 4 and 32 output channels (meaning the bottleneck had 32 channels). The decoder was mirrored, using Conv2DTranspose also with stride 2, kernel size 4 and 32 channels (except the last one having 3 output channels). We trained (overfitted) it for 10 000 iterations using random crops from the last frame of “BigBucksBunny” from MCL-JCV. We observed the similar results on other images also.

a.4 HEVC encode command

We use ffmpeg, version 4.3.2, to encode videos with HEVC, and use the medium preset, as well as no B-frames (since our algorimth is P-frame only), following previous work [agustsson2020scale]. We compressed with different quality factors $Q for each video, using $Q , which yields bpps that match our models.

ffmpeg -i $INPUT_FILE -c:v hevc -crf $Q -preset medium \
    -x265-params bframes=0 $OUTPUT_FILE

a.5 Training Time

In Table 1 we report the training speed for each of the training stages, which results in a total training time of hours. We note that the first stage (I-frame) trains more than faster than the last stage in terms of steps/s.

Batch size # I frames # P frames Num steps [k] steps/s time [h]
8 1 0 1000 19.7 14.1
8 1 1 80 7.3 3.0
8 1 2 220 3.9 15.7
8 1 3 50 2.6 5.3
8 1 5 50 1.4 9.9
4 1 8 50 0.95 14.6
Totals: 1450 62.7
Table 1: Training speed/time for each stage of our model on a Google Cloud TPU.

a.6 Hyper Parameters

For scale space blurring we set and used levels, which implies that the sequence of blur kernel sizes is .

For rate control we initially swept over a wide range and found that worked well, which we then fixed for all future experiments. We initialized in all cases.

Previous works [minnen2020channel, mentzer2020high] typically initially train for a higher bitrate. This is usually implemented by using a schedule on the R-D weight that is decayed by a factor or early in training. Since the rate-controllor automatically controls this weight, we emulate the approach by instead using a schedule on the targeted bitrate . We use a simple rule and target a higher bitrate for the first of training steps.

For the I-frame loss (Eq. 2), we use , and for rate-control.

For the P-frame loss (Eq. 4), we use , and in . For the three different models we use in the user study, we use . A detail omitted from the equation is that we scale the loss by the constant , as this yields similar magnitudes as no loss scaling.

We use the same learning rate for all networks, and train with the Adam optimizer. We linearly increase the LR from 0 during the first steps, and then drop it to after steps. We train the discriminators for 1 step for each generator training step.

a.7 User Study Details

Figure 11: Instructions shown to the raters.
Figure 12: We compare various metrics for each of the three rate points, for the “Ours vs. HEVC” user studies.

The instructions we gave to raters are presented in Fig. 11 in the form of a screenshot of the instructions they read on screen.

We recorded multiple metrics, summarized in Fig. 12. We see that Rating Time per video increases as we increase the bitrate, likely because methods become closer to the original as we increase the rate. Number of Pause Events by raters is lower for the low rate task. Number of Flips between the methods raters compare also increase towards the high rates.

In Fig. 13, we use rating time to order videos, and show the five videos that took the longest to rate.

Figure 13: Average time spent by raters to determine which method is best, at the three bitrates we compared “Ours vs. HEVC”. We also show the mean time over the three rates, and sort by it. On the right, we show the 5 “hardest” videos according to this ranking.

Appendix B Data Release

For each user study comparison we made between methods, we release the reconstructions as well as a CSV containing all the rater information:

  1. [label=,leftmargin=1cm]

  2. reconstructions_A, reconstructions_B

    1. [label=]

    2. Each folder contains a subfolder for each of the 30 videos of MCL-JCV, and each such video subfolder contains 60 PNGs, the reconstructions of the resp. method.

  3. ratings.csv

    1. [label=]

    2. Each row is a video, and we have the following columns: method_a_wins, method_b_wins indicate the number of times method A or B won, method_a_bpp, method_b_bpp, indicate the per-video bpps, avg_flips, avg_time_ms, avg_num_pauses indicate average flips, average time per video, and average num pauses, respectively.