Adversarial Distortion for Learned Video Compression

by   Vijay Veerabadrany, et al.

In this paper, we present a novel adversarial lossy video compression model. At extremely low bit-rates, standard video coding schemes suffer from unpleasant reconstruction artifacts such as blocking, ringing etc. Existing learned neural approaches to video compression have achieved reasonable success on reducing the bit-rate for efficient transmission and reduce the impact of artifacts to an extent. However, they still tend to produce blurred results under extreme compression. In this paper, we present a deep adversarial learned video compression model that minimizes an auxiliary adversarial distortion objective. We find this adversarial objective to correlate better with human perceptual quality judgement relative to traditional quality metrics such as MS-SSIM and PSNR. Our experiments using a state-of-the-art learned video compression system demonstrate a reduction of perceptual artifacts and reconstruction of detail lost especially under extremely high compression.



There are no comments yet.


page 1

page 4


DVC-P: Deep Video Compression with Perceptual Optimizations

Recent years have witnessed the significant development of learning-base...

The Helmholtz Method: Using Perceptual Compression to Reduce Machine Learning Complexity

This paper proposes a fundamental answer to a frequently asked question ...

New Approach of Estimating PSNR-B For De-blocked Images

Measurement of image quality is very crucial to many image processing ap...

A Perceptual Distortion Reduction Framework for Adversarial Perturbation Generation

Most of the adversarial attack methods suffer from large perceptual dist...

One-to-Many Network for Visually Pleasing Compression Artifacts Reduction

We consider the compression artifacts reduction problem, where a compres...

Deep Convolution Networks for Compression Artifacts Reduction

Lossy compression introduces complex compression artifacts, particularly...

Characterizing Generalized Rate-Distortion Performance of Video Coding: An Eigen Analysis Approach

Rate-distortion (RD) theory is at the heart of lossy data compression. H...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the resolution of digitally recorded and streamed videos keeps growing, there is an increasing demand for video compression algorithms that enable fast transmission of videos without loss in Quality-of-Experience. While current video codecs can encode video at low bitrates, this usually results in unpleasant compression artifacts [34, 22]

. The application of deep neural networks to develop learned video compression algorithms as explored in recent art

[26, 23, 20, 11, 6, 5, 8] produces promising results at solving this issue of perceptual artifacts. However, due to the use of distortion metrics such as MS-SSIM [30] and MSE, the reconstructions tend to be blurry [3].

(a) (b) (c)
Figure 1: Demonstration of the effectiveness of training with adversarial loss. (a) uncompressed frame, learned compression [11] via (b) MS-SSIM distortion, (c) adversarial distortion, at similar bitrates. [See Fig. 5 cation for license information]

Generative Adversarial Networks (GANs) have been shown to be capable of producing highly realistic images and videos from random noise inputs [16, 15, 35, 4, 7]. This suggests that the GAN objective more accurately reflects image/video quality as perceived by humans. Indeed the work of [1] has shown that GANs can be used for low-rate high-quality image compression, by augmenting the rate/distortion loss with an adversarial loss. However, so far there is little work on the application of adversarial losses to video compression due to scaling issues.

We tackle the scaling issue via factorization of our adversarial discriminator into smaller neural network components and show results that demonstrate our compression system’s relatively improved perceptual quality even under extreme compression (see Fig. 1). Our model is based on the one proposed by [11]

, which is a 3D autoencoder with discrete latents and a PixelCNN++


prior, trained end-to-end to optimize a rate/distortion loss. Our contributions shall be applicable to other learned video compression systems in general. We also present an ablation study resulting from our search across various formulations of GANs in terms of their architecture and loss functions within the context of lossy video compression. The contributions of this paper are:

i) we propose adversarial loss to improve the perceptual quality of learned video compression, ii) we study techniques to improve the training stability using adversarial loss, iii) we study a spatial-temporal factorization of discriminator to enable end-to-end training of deep video compression networks.

2 Learned Video Compression

Figure 2: Lossy video compression with adversarial distortion, (a) learned video compression component, (b) adversarial distortion components, where Discs and Disct represent the spatial and the spatio-temporal discriminators.

[Frames by Ambrose Productions CC BY-SA 3.0, via YouTube]

A learned video compression typically consists of an encoder, decoder, entropy estimator and distortion estimator. All the components are trained end-to-end on a collection of videos


Encoder maps an input sequence into a latent representation . Encoder is a stack of convolutional layers with several down-samplings that reduce the input dimension. In the last layer, encoder employs a quantisation function on the activations to reduce the bit-width of latents . Quantizer maps each element, or group of elements, in activations to a discrete symbol . Learning discrete representation, as a non-differentiable function, requires adding uniform noise or soft assignment as an approximation. The decoder, which is a stack of convolutional layers with several up-samplings, reconstructs the video given discrete latents .

Entropy estimator predicts the average number of bits needed to encode latents using a lossless entropy coding schema such as Huffman or arithmetic coding. The bit rate is measured as the cross-entropy between the true distribution of latents and a density estimated by as in Eq. 1. The density estimator is parameterized as a neural network usually with an auto-regressive architecture i.e. PixelCNN++ [28].


Distortion loss measures the difference between the input and reconstructed videos usually by pixel-wise distances, such as and , or by more sophisticated metrics such as MS-SSIM. All the aforementioned components are trained end-to-end by minimizing a rate-distortion trade-off as the loss function:


3 Adversarial Distortion Loss

The distortion, measured by pixel-wise metrics i.e. , , and MS-SSIM, are often not perfectly aligned with perceptual quality. Recent work [3]

mathematically proves that the distortion and perceptual quality are at odds with each other and minimizing the mean distortion leads to a decrease in perceptual quality. Instead of solely relying on pixel-wise distances, we define the distortion as an adversarial loss between the decoder and a discriminator

. This setting can be interpreted as training a conditional GAN, where the decoder learns to generate a video given the encoded latents . The discriminator encourages the decoder to generate videos which reside on the data manifold that improves perceptual quality.

Minimax [10]
Wasserstein [2]
Least Squares [21]
Table 1: Component functions for a few adversarial losses.

3.1 Stable adversarial training

Adversarial loss Adversarial loss can be defined in various formulations depending on how to specify the component functions , and :


Table 1 specifies the component functions for several widely used GAN formulations. We investigate the impact of each formulation on training stability as they have different loss landscapes and gradient behavior; finding the best formulation is hence non-trivial. Minimax loss [10] and Wasserstein loss [2] resulted in fairly decent improvements in terms of reconstruction quality, but we noticed the training to be unstable and time consuming. We also experimented with the Least Squares [21] and Relativistic [14] formulations. Both of these formulations resulted in stable adversarial training. Among these two choices, our best formulation was the Least Squares loss that generated higher quality videos (see section 4.2).

Perceptual loss As a way of further stabilizing our model’s adversarial training, we incorporated a semantic loss [19, 29] that minimizes the of the difference between framewise VGG-19 representations of and . This semantic loss resulted in faster and more stable training of our adversarial video compression model.

3.2 Factorized spatial-temporal discriminators

Recent work in training GANs to generate videos points out the advantage of scaling up the training, i.e., using larger batch sizes, deeper models etc. [7]. However, due to scalability issues relating to working with video data and the models size, we faced difficulty in jointly training all components. In this case, our two choices to scale up our training were: (i) finetune the decoder using an adversarial distortion and fixing the prior and encoder, hence loading only the adversarial distortion components in memory, and (ii) factorizing our model into smaller components that enable large-scale training. While analyzing the compression performance of the above two choices, we observed that the latter produced higher quality reconstruction at the same bit-rates. In order to scale up joint adversarial training for our complete model, we resorted to factorizing the discriminator into two smaller spatial and spatio-temporal discriminators. Both these discriminators were formulated as LSGAN discriminators. The average loss from these discriminators was used to train the decoder.

Putting all together, we train our model end-to-end by optimizing the rate-distortion trade-off Eq. 2 using the following distortion loss:


where represents the VGG-19 features111We used features from the convolution before the 5max-pooling layer of an ImageNet-trained VGG-19 network.. These design choices are summarized in our architecture in Fig. 2.

4 Experiments

Figure 3: Comparison of RDAE and our method on Kinetics validation set.

4.1 Experimental setup

Network architecture We demonstrate the impact of our adversarial training on the Rate-Distortion Autoencoder [11] (RDAE) model which achieves state-of-the-art video compression performance compared to other learned [20, 32] compression methods in terms of MS-SSIM at various bit-rates. Further details on the architecture of RDAE’s encoder, decoder, entropy estimator and quantizer can be found in [11]. We employed a 2D ResNet-34 [13] (trained on ImageNet) and a 3D ResNet [12] (trained from scratch) as our spatial and spatio-temporal discriminators respectively. We spatially downsize our spatio-temporal discriminator’s input by half in order to save memory, however, we did not temporally subsample our spatial discriminator’s input.

Dataset We created a dataset sourced from Kinetics400 [17] by selecting the first 16 frames from each high-quality video and downsampling them to alleviate the existing compression artifacts resulting in a total of 93750 videos for training and 5687 videos for validation. We used random crops for training and used full-size frames for validation. We used UVG [27] as the test dataset for comparisons with other methods.

Figure 4: LPIPS comparisons of H.264, H.265, RDAE, and our method on UVG.

Implementation details

We pretrained our video compression model using rate-loss and MSE distortion loss to speed up adversarial training; this step provided a good initialization for adversarially training decoder weights. Our hyperparameter choices for optimizing Eq. 

5 are: , , and

. We used a batch size of 37, for a total of 12 epochs. We trained our network with 4 values of

to obtain a rate-distortion curve. All models were trained using Adam optimizer [18] with learning rate of , and .

(a) Uncompressed frame
(b) H.265, 0.0294 bpp (c) RDAE, 0.0339 bpp (d) Ours, 0.0309 bpp
Figure 5: Visual comparison of H.265, RDAE, and our method at a comparable bitrate. (a) shows an uncompressed frame (frame 49 of Netflix Tango in Netflix El Fuente [33]), (b), (c), and (d) show the close-ups of the reconstructed frame from H.265, RDAE, and our method, respectively. Our method is void of the compression artifacts present in H.265 and oversmoothing present in RDAE under low bit-rates.

[Frame 49 of Tango video from Netflix Tango in Netflix El Fuente, produced by Netflix, with CC BY-NC-ND 4.0 license:]

4.2 Results

Comparison to state of the art In this section, we compare the performance of our method with a learned video compression method RDAE [11] as well as two non-learned codecs H.265 [25] and H.264 [31]. Fig. 3 shows the comparison of our method with RDAE in terms of Inception Score [24] (IS, ) on Kinetics validation set. Incorporation of adversarial training improves IS, and hence the perceptual salience of decoded videos by a great extent.

Due to the small number of videos present in UVG, we are unable to accurately compute IS on this dataset; we report a framewise Learned Perceptual Image Patch Similarity [36] (LPIPS, ) score on UVG for this reason. Fig. 4 shows the comparison of our method with RDAE, H.264 and H.265 in terms of LPIPS on UVG dataset. We used ffmpeg [9] implementation of H.265 and H.264. From this analysis, we observe that adversarial training improves the perceptual quality of RDAE, resulting in a lower LPIPS score. Our model, albeit underperforms H.265, closes the gap between the learned methods and H.265 in terms of LPIPS to a high extent. In Fig. 5 we compare the visual quality of our lowest rate model with H.265 and RDAE at a similar rate. We can see that our result is relatively free from the compression artifacts usually present in H.265 and RDAE at low bitrate.

Ablation study We trained video compression models separately on each of the 4 GAN loss formulations mentioned in section 3 along with a stabilizing pixel-wise - or -norm of the distance between and (8 GAN models in total, WGAN not reported due to unstable training). We report a summary of these experiments in Table 2 and we decided to use the best performing LSGAN with pixel-wise loss for our video compression model.

GAN Type Pixel Loss MS-SSIM PSNR (dB)
DCGAN [10] L1 0.957 26.655
DCGAN L2 0.958 26.862
RaGAN [14] L1 0.957 26.554
RaGAN L2 0.957 26.625
LSGAN [21] L1 0.96 26.905
LSGAN L2 0.961 27.032
Table 2: Ablation study on different GAN losses, PSNR and MS-SSIM Comparisons on Kinetics validation set.

5 Conclusion

In this paper, we have presented a new deep adversarial lossy video compression algorithm that outperforms state-of-the-art learned video compression systems in terms of visual quality. By employing adversarial training for the decoder, we demonstrate reduction in the perceptual artifacts (especially under very low bit-rates) typically present in the reconstructed output. We have also presented an ablation study of our design choices that resulted in our final adversarial compression model.


  • [1] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. Gool (2019) Generative adversarial networks for extreme learned image compression. In ICCV, Cited by: §1.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv. Cited by: §3.1, Table 1.
  • [3] Y. Blau and T. Michaeli (2018) The perception-distortion tradeoff. In CVPR, Cited by: §1, §3.
  • [4] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv. Cited by: §1.
  • [5] T. Chen, H. Liu, Q. Shen, T. Yue, X. Cao, and Z. Ma (2017) Deepcoder: a deep neural network based video compression. In VCIP, Cited by: §1.
  • [6] Z. Chen, T. He, X. Jin, and F. Wu (2019) Learning for video compression. IEEE Trans. on Circuits and Systems for Video Technology. Cited by: §1.
  • [7] A. Clark, J. Donahue, and K. Simonyan (2019) Efficient video generation on complex datasets. arXiv. Cited by: §1, §3.2.
  • [8] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers (2019) Neural inter-frame compression for video coding. In ICCV, Cited by: §1.
  • [9] Ffmpeg. Note: 2020-03-03 Cited by: §4.2.
  • [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §3.1, Table 1, Table 2.
  • [11] A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen (2019) Video compression with rate-distortion autoencoders. In ICCV, Cited by: Figure 1, §1, §1, §4.1, §4.2.
  • [12] K. Hara, H. Kataoka, and Y. Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In CVPR, Cited by: §4.1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §4.1.
  • [14] A. Jolicoeur-Martineau (2018) The relativistic discriminator: a key element missing from standard gan. arXiv. Cited by: §3.1, Table 2.
  • [15] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv. Cited by: §1.
  • [16] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In CVPR, Cited by: §1.
  • [17] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman (2017) The Kinetics Human Action Video Dataset. arxiv. Cited by: §4.1.
  • [18] D. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In iclr, Cited by: §4.1.
  • [19] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017)

    Photo-realistic single image super-resolution using a generative adversarial network

    In CVPR, Cited by: §3.1.
  • [20] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao (2019) Dvc: an end-to-end deep video compression framework. In CVPR, Cited by: §1, §4.1.
  • [21] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In ICCV, Cited by: §3.1, Table 1, Table 2.
  • [22] R. Pourreza, A. Ghodrati, and A. Habibian (2019) Recognizing compressed videos: challenges and promises. In ICCV Workshops, Cited by: §1.
  • [23] O. Rippel, S. Nair, C. Lew, S. Branson, A. G. Anderson, and L. Bourdev (2019) Learned video compression. In ICCV, Cited by: §1.
  • [24] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen (2016) Improved techniques for training gans. In NeurIPS, Cited by: §4.2.
  • [25] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand (2012) Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans. on Circuits and Systems for Video Technology. Cited by: §4.2.
  • [26] M. Tschannen, E. Agustsson, and M. Lucic (2018) Deep generative models for distribution-preserving lossy compression. In NeurIPS, Cited by: §1.
  • [27] Ultra Video Group test sequences. Note: 2020-03-03 Cited by: §4.1.
  • [28] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. In NeurIPS, Cited by: §1, §2.
  • [29] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy (2018) Esrgan: enhanced super-resolution generative adversarial networks. In ECCV, Cited by: §3.1.
  • [30] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. on Image Processing. Cited by: §1.
  • [31] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra (2003) Overview of the h.264/avc video coding standard. IEEE Trans. on Circuits and Systems for Video Technology. Cited by: §4.2.
  • [32] C. Wu, N. Singhal, and P. Krahenbuhl (2018)

    Video compression through image interpolation

    In ECCV, Cited by: §4.1.
  • [33] video test media [derf’s collection]. Note: 2020-03-03 Cited by: Figure 5.
  • [34] K. Zeng, T. Zhao, A. Rehman, and Z. Wang (2014) Characterizing perceptual artifacts in compressed video streams. In Human Vision and Electronic Imaging XIX, Cited by: §1.
  • [35] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv. Cited by: §1.
  • [36] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    arXiv. Cited by: §4.2.