As the resolution of digitally recorded and streamed videos keeps growing, there is an increasing demand for video compression algorithms that enable fast transmission of videos without loss in Quality-of-Experience. While current video codecs can encode video at low bitrates, this usually results in unpleasant compression artifacts [34, 22]
. The application of deep neural networks to develop learned video compression algorithms as explored in recent art[26, 23, 20, 11, 6, 5, 8] produces promising results at solving this issue of perceptual artifacts. However, due to the use of distortion metrics such as MS-SSIM  and MSE, the reconstructions tend to be blurry .
Generative Adversarial Networks (GANs) have been shown to be capable of producing highly realistic images and videos from random noise inputs [16, 15, 35, 4, 7]. This suggests that the GAN objective more accurately reflects image/video quality as perceived by humans. Indeed the work of  has shown that GANs can be used for low-rate high-quality image compression, by augmenting the rate/distortion loss with an adversarial loss. However, so far there is little work on the application of adversarial losses to video compression due to scaling issues.
We tackle the scaling issue via factorization of our adversarial discriminator into smaller neural network components and show results that demonstrate our compression system’s relatively improved perceptual quality even under extreme compression (see Fig. 1). Our model is based on the one proposed by 
, which is a 3D autoencoder with discrete latents and a PixelCNN++
prior, trained end-to-end to optimize a rate/distortion loss. Our contributions shall be applicable to other learned video compression systems in general. We also present an ablation study resulting from our search across various formulations of GANs in terms of their architecture and loss functions within the context of lossy video compression. The contributions of this paper are:i) we propose adversarial loss to improve the perceptual quality of learned video compression, ii) we study techniques to improve the training stability using adversarial loss, iii) we study a spatial-temporal factorization of discriminator to enable end-to-end training of deep video compression networks.
2 Learned Video Compression
A learned video compression typically consists of an encoder, decoder, entropy estimator and distortion estimator. All the components are trained end-to-end on a collection of videos.
Encoder maps an input sequence into a latent representation . Encoder is a stack of convolutional layers with several down-samplings that reduce the input dimension. In the last layer, encoder employs a quantisation function on the activations to reduce the bit-width of latents . Quantizer maps each element, or group of elements, in activations to a discrete symbol . Learning discrete representation, as a non-differentiable function, requires adding uniform noise or soft assignment as an approximation. The decoder, which is a stack of convolutional layers with several up-samplings, reconstructs the video given discrete latents .
Entropy estimator predicts the average number of bits needed to encode latents using a lossless entropy coding schema such as Huffman or arithmetic coding. The bit rate is measured as the cross-entropy between the true distribution of latents and a density estimated by as in Eq. 1. The density estimator is parameterized as a neural network usually with an auto-regressive architecture i.e. PixelCNN++ .
Distortion loss measures the difference between the input and reconstructed videos usually by pixel-wise distances, such as and , or by more sophisticated metrics such as MS-SSIM. All the aforementioned components are trained end-to-end by minimizing a rate-distortion trade-off as the loss function:
3 Adversarial Distortion Loss
The distortion, measured by pixel-wise metrics i.e. , , and MS-SSIM, are often not perfectly aligned with perceptual quality. Recent work 
mathematically proves that the distortion and perceptual quality are at odds with each other and minimizing the mean distortion leads to a decrease in perceptual quality. Instead of solely relying on pixel-wise distances, we define the distortion as an adversarial loss between the decoder and a discriminator. This setting can be interpreted as training a conditional GAN, where the decoder learns to generate a video given the encoded latents . The discriminator encourages the decoder to generate videos which reside on the data manifold that improves perceptual quality.
3.1 Stable adversarial training
Adversarial loss Adversarial loss can be defined in various formulations depending on how to specify the component functions , and :
Table 1 specifies the component functions for several widely used GAN formulations. We investigate the impact of each formulation on training stability as they have different loss landscapes and gradient behavior; finding the best formulation is hence non-trivial. Minimax loss  and Wasserstein loss  resulted in fairly decent improvements in terms of reconstruction quality, but we noticed the training to be unstable and time consuming. We also experimented with the Least Squares  and Relativistic  formulations. Both of these formulations resulted in stable adversarial training. Among these two choices, our best formulation was the Least Squares loss that generated higher quality videos (see section 4.2).
Perceptual loss As a way of further stabilizing our model’s adversarial training, we incorporated a semantic loss [19, 29] that minimizes the of the difference between framewise VGG-19 representations of and . This semantic loss resulted in faster and more stable training of our adversarial video compression model.
3.2 Factorized spatial-temporal discriminators
Recent work in training GANs to generate videos points out the advantage of scaling up the training, i.e., using larger batch sizes, deeper models etc. . However, due to scalability issues relating to working with video data and the models size, we faced difficulty in jointly training all components. In this case, our two choices to scale up our training were: (i) finetune the decoder using an adversarial distortion and fixing the prior and encoder, hence loading only the adversarial distortion components in memory, and (ii) factorizing our model into smaller components that enable large-scale training. While analyzing the compression performance of the above two choices, we observed that the latter produced higher quality reconstruction at the same bit-rates. In order to scale up joint adversarial training for our complete model, we resorted to factorizing the discriminator into two smaller spatial and spatio-temporal discriminators. Both these discriminators were formulated as LSGAN discriminators. The average loss from these discriminators was used to train the decoder.
Putting all together, we train our model end-to-end by optimizing the rate-distortion trade-off Eq. 2 using the following distortion loss:
4.1 Experimental setup
Network architecture We demonstrate the impact of our adversarial training on the Rate-Distortion Autoencoder  (RDAE) model which achieves state-of-the-art video compression performance compared to other learned [20, 32] compression methods in terms of MS-SSIM at various bit-rates. Further details on the architecture of RDAE’s encoder, decoder, entropy estimator and quantizer can be found in . We employed a 2D ResNet-34  (trained on ImageNet) and a 3D ResNet  (trained from scratch) as our spatial and spatio-temporal discriminators respectively. We spatially downsize our spatio-temporal discriminator’s input by half in order to save memory, however, we did not temporally subsample our spatial discriminator’s input.
Dataset We created a dataset sourced from Kinetics400  by selecting the first 16 frames from each high-quality video and downsampling them to alleviate the existing compression artifacts resulting in a total of 93750 videos for training and 5687 videos for validation. We used random crops for training and used full-size frames for validation. We used UVG  as the test dataset for comparisons with other methods.
We pretrained our video compression model using rate-loss and MSE distortion loss to speed up adversarial training; this step provided a good initialization for adversarially training decoder weights. Our hyperparameter choices for optimizing Eq.5 are: , , and
. We used a batch size of 37, for a total of 12 epochs. We trained our network with 4 values ofto obtain a rate-distortion curve. All models were trained using Adam optimizer  with learning rate of , and .
|(a) Uncompressed frame|
|(b) H.265, 0.0294 bpp||(c) RDAE, 0.0339 bpp||(d) Ours, 0.0309 bpp|
[Frame 49 of Tango video from Netflix Tango in Netflix El Fuente, produced by Netflix, with CC BY-NC-ND 4.0 license: https://media.xiph.org/video/derf/ElFuente/Netflix_Tango_Copyright.txt]
Comparison to state of the art In this section, we compare the performance of our method with a learned video compression method RDAE  as well as two non-learned codecs H.265  and H.264 . Fig. 3 shows the comparison of our method with RDAE in terms of Inception Score  (IS, ) on Kinetics validation set. Incorporation of adversarial training improves IS, and hence the perceptual salience of decoded videos by a great extent.
Due to the small number of videos present in UVG, we are unable to accurately compute IS on this dataset; we report a framewise Learned Perceptual Image Patch Similarity  (LPIPS, ) score on UVG for this reason. Fig. 4 shows the comparison of our method with RDAE, H.264 and H.265 in terms of LPIPS on UVG dataset. We used ffmpeg  implementation of H.265 and H.264. From this analysis, we observe that adversarial training improves the perceptual quality of RDAE, resulting in a lower LPIPS score. Our model, albeit underperforms H.265, closes the gap between the learned methods and H.265 in terms of LPIPS to a high extent. In Fig. 5 we compare the visual quality of our lowest rate model with H.265 and RDAE at a similar rate. We can see that our result is relatively free from the compression artifacts usually present in H.265 and RDAE at low bitrate.
Ablation study We trained video compression models separately on each of the 4 GAN loss formulations mentioned in section 3 along with a stabilizing pixel-wise - or -norm of the distance between and (8 GAN models in total, WGAN not reported due to unstable training). We report a summary of these experiments in Table 2 and we decided to use the best performing LSGAN with pixel-wise loss for our video compression model.
In this paper, we have presented a new deep adversarial lossy video compression algorithm that outperforms state-of-the-art learned video compression systems in terms of visual quality. By employing adversarial training for the decoder, we demonstrate reduction in the perceptual artifacts (especially under very low bit-rates) typically present in the reconstructed output. We have also presented an ablation study of our design choices that resulted in our final adversarial compression model.
-  (2019) Generative adversarial networks for extreme learned image compression. In ICCV, Cited by: §1.
-  (2017) Wasserstein gan. arXiv. Cited by: §3.1, Table 1.
-  (2018) The perception-distortion tradeoff. In CVPR, Cited by: §1, §3.
-  (2018) Large scale gan training for high fidelity natural image synthesis. arXiv. Cited by: §1.
-  (2017) Deepcoder: a deep neural network based video compression. In VCIP, Cited by: §1.
-  (2019) Learning for video compression. IEEE Trans. on Circuits and Systems for Video Technology. Cited by: §1.
-  (2019) Efficient video generation on complex datasets. arXiv. Cited by: §1, §3.2.
-  (2019) Neural inter-frame compression for video coding. In ICCV, Cited by: §1.
-  Ffmpeg. Note: http://ffmpeg.org/Accessed: 2020-03-03 Cited by: §4.2.
-  (2014) Generative adversarial nets. In NeurIPS, Cited by: §3.1, Table 1, Table 2.
-  (2019) Video compression with rate-distortion autoencoders. In ICCV, Cited by: Figure 1, §1, §1, §4.1, §4.2.
-  (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In CVPR, Cited by: §4.1.
-  (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §4.1.
-  (2018) The relativistic discriminator: a key element missing from standard gan. arXiv. Cited by: §3.1, Table 2.
-  (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv. Cited by: §1.
-  (2019) A style-based generator architecture for generative adversarial networks. In CVPR, Cited by: §1.
-  (2017) The Kinetics Human Action Video Dataset. arxiv. Cited by: §4.1.
-  (2015) Adam: A Method for Stochastic Optimization. In iclr, Cited by: §4.1.
Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, Cited by: §3.1.
-  (2019) Dvc: an end-to-end deep video compression framework. In CVPR, Cited by: §1, §4.1.
-  (2017) Least squares generative adversarial networks. In ICCV, Cited by: §3.1, Table 1, Table 2.
-  (2019) Recognizing compressed videos: challenges and promises. In ICCV Workshops, Cited by: §1.
-  (2019) Learned video compression. In ICCV, Cited by: §1.
-  (2016) Improved techniques for training gans. In NeurIPS, Cited by: §4.2.
-  (2012) Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans. on Circuits and Systems for Video Technology. Cited by: §4.2.
-  (2018) Deep generative models for distribution-preserving lossy compression. In NeurIPS, Cited by: §1.
-  Ultra Video Group test sequences. Note: http://ultravideo.cs.tut.fi/Accessed: 2020-03-03 Cited by: §4.1.
-  (2016) Conditional image generation with pixelcnn decoders. In NeurIPS, Cited by: §1, §2.
-  (2018) Esrgan: enhanced super-resolution generative adversarial networks. In ECCV, Cited by: §3.1.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. on Image Processing. Cited by: §1.
-  (2003) Overview of the h.264/avc video coding standard. IEEE Trans. on Circuits and Systems for Video Technology. Cited by: §4.2.
Video compression through image interpolation. In ECCV, Cited by: §4.1.
-  Xiph.org video test media [derf’s collection]. Note: https://media.xiph.org/video/derf/Accessed: 2020-03-03 Cited by: Figure 5.
-  (2014) Characterizing perceptual artifacts in compressed video streams. In Human Vision and Electronic Imaging XIX, Cited by: §1.
-  (2018) Self-attention generative adversarial networks. arXiv. Cited by: §1.
The unreasonable effectiveness of deep features as a perceptual metric. arXiv. Cited by: §4.2.