DVC-P: Deep Video Compression with Perceptual Optimizations

09/22/2021 ∙ by Saiping Zhang, et al. ∙ 5

Recent years have witnessed the significant development of learning-based video compression methods, which aim at optimizing objective or perceptual quality and bit rates. In this paper, we introduce deep video compression with perceptual optimizations (DVC-P), which aims at increasing perceptual quality of decoded videos. Our proposed DVC-P is based on Deep Video Compression (DVC) network, but improves it with perceptual optimizations. Specifically, a discriminator network and a mixed loss are employed to help our network trade off among distortion, perception and rate. Furthermore, nearest-neighbor interpolation is used to eliminate checkerboard artifacts which can appear in sequences encoded with DVC frameworks. Thanks to these two improvements, the perceptual quality of decoded sequences is improved. Experimental results demonstrate that, compared with the baseline DVC, our proposed method can generate videos with higher perceptual quality achieving 12.27 perceptual BD-rate equivalent, on average.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Various video services, taking ultra high-definition videos and panoramic videos as examples, have brought great challenges to video compression methods. In past decades, traditional video coding standards, from H.264/AVC [1] to H.266/VVC [2]

, have achieved tremendous development in saving bit rates and enhancing quality of decoded videos. These achievements mainly rely on very carefully designed modules in block-based hybrid coding framework. Recently, new approaches based on Deep Neural Networks (DNN) adopted a different strategy. In particular, DNN based video compression methods pay more attention to end-to-end optimization instead of carefully designing a specific module in video compression framework. Although still in a research phase, such strategy has a potential to provide better compression and revolutionize video compression field.

In past few years, a number of deep network designs for video compression have been proposed, achieving promising results in terms the trade off between rate and objective distortion (e.g. peak signal to noise ratio, PSNR) performance. Lu et al.

[3] firstly designed a deep end-to-end video compression (DVC) model that established a one-to-one correspondence between modules of conventional hybrid video coding framework and their model. Furthermore, to alleviate error propagation and enable coding adaptation to different types of video content, they proposed an improved DVC model[4]. Lin et al.[5] employed multiple reference frames to help predict current frame more accurately, yielding less residual.

However, optimizing compression towards improving PSNR does not always improve perceptual quality of decoded videos. Considering optimizing a video compression network towards higher perceptual quality, recently proposed methods deploy Generative Adversarial Networks (GANs). Zhu et al.[6] employed GAN to remove the spatial redundancy in video frames and improved the performance of intra prediction in video coding process. But only improving intra-coded frames is insufficient for enhancing the performance of the whole decoded video. Veerabadran et al.[7]

presented an adversarial learned video compression model based on a 3D autoencoder, which tends to eliminate blurred results under extreme video compression. However, 3D convolutions are difficult to train because of a large number of parameters, which put a limitation on improving the perceptual quality of decoded videos.

In this paper, a deep video compression with perceptual optimizations (DVC-P) network is proposed, which aims at optimizing for perceptual quality of decoded videos. The main contributions of our work are summarized as follows:

(1) We optimized DVC with a discriminator network and a mixed loss to enhance perceptual quality of decoded videos.

(2) We eliminated checkerboard artifacts in DVC with nearest-neighbor interpolation, and further improve perceptual quality of decoded videos.

(3) We evaluated performance of the proposed DVC-P in terms of Fréchet video distance (FVD) [8] which is a metric highly correlated to human visual experience of videos and a BD-rate equivalent. The proposed DVC-P has outperformed DVC [17] in terms of FVD scores and achieved 12.27% reduction in a BD-rate equivalent.

Fig. 1: The proposed DVC-P network

Ii Proposed Method

The structure of the DVC-P network is shown in Fig. 1, where three proposed improvements are shown in green. “-P(1/3)” and “-P(2/3)” modules can enhance synthesis of pixels, and “-P(3/3)” module can guide generated frames optimized towards real frames.

Ii-a Baseline Deep Video Compression Network

The structures of residual encoder network, motion vector (MV) encoder network, optical flow network, motion compensation network and bit rate estimation follow those in DVC network


. Specifically, residual encoder network, which encodes residuals between the raw video frame and reconstructed video frame to bit streams, consists of four convolution layers. Each layer downsamples its input with stride=2. There is a rectifier unit (ReLu) after every convolution except the last one. After quantization, the signal is losslessly processed by entropy coding to form the bit stream. Since both are non-differentiable, during training quantization is replaced by additive uniform noise

[9], and entropy coding is bypassed, approximating rate by the entropy of the latent representation. As for MV encoder network, its structure follows the same design as the residual encoder network. A pretrained optical flow estimation network[10] is used to estimate motion between the generated/reference frame and current raw frame. It is fine tuned during the training process. The motion compensation network achieves warp operation and prepares for residual calculation. In terms of estimating the bit rate, the entropy model in[9] is used to calculate it.

Ii-B Perceptual Optimizations

Ii-B1 Proposed Generator and Discriminator

At the encoder side, residuals and MVs are reconstructed using generator networks, with the purpose to generate reference frames for inter coding. Four convolution layers are designed in both generator networks. Instead of using common strided-deconvolution in these generator networks, we use nearest-neighbor interpolation to achieve upsampling and restore the original resolution of signals. Activation function is ReLu. Moreover, in our proposed DVC-P we implement the discriminator from DCGAN

[11] which proposed a set of constraints on the architecture of discriminator networks to make them stable to train in most settings.

Ii-B2 Mixed Loss function

The total loss of the proposed DVC-P is formulated as the weighted sum of MSE loss, adversarial loss, VGG-based loss and bit rate loss as:


where , , and represent MSE loss, adversarial loss, VGG-based loss and bit rate loss, respectively. , , and are the corresponding weights.

Adversarial loss is computed as:


where is the input of the generator. represents the generator, and

represents the discriminator. We use the least squares loss function from LSGAN


which solved vanishing gradients problem during the training process.

VGG-based loss is computed as:


where represents the raw frame, and

represents the features of the 4th convolution before the 5th max-pooling layer of an ImageNet pretrained VGG-19 network


MSE loss is essential for video compression networks to maintain the video content unchanged. On the other hand, adversarial loss can help generators produce decoded videos of higher perceptual quality. Moreover, incorporating VGG-based loss is beneficial to stabilizing the whole training process.

Bit rate loss is computed as:


where and

represent the probabilities of residuals and MVs after quantization.

The loss of discriminator is computed as in LSGAN [12]:


Ii-B3 Elimination of Checkerboard Artifacts

Deconvolution has uneven overlap in two dimensions (i.e., x dimension and y dimension) when the “kernel size” is not divisible by the “stride”, which sometimes leads to checkerboard artifacts in the final outputs. An efficient and effective way to solve this issue is upsampling images by nearest-neighbor interpolation (or Bilinear interpolation) and followed by a convolution layer (stride=1)[14]. Furthermore, adversarial loss can further help improve visual quality of decoded frames. (see Fig. 3 for the visualization results.)

Iii Experimental Results

Iii-a Dataset

We use Vimeo-90k[15] dataset to train our proposed DVC-P. 7 consecutive frames in a video sequence are regarded as a sample and cropped in 256x256 before fed into the network. Frames in the same sample are cropped in the same position, but frames in different samples are cropped randomly. Batch size is 4. For evaluating the performance of our proposed DVC-P, tests are performed on JCT-VC test sequences [16].

Iii-B Training Strategy

Similarly to the baseline DVC in which the framework design consists of various deep models, our proposed DVC-P requires carefully designed joined training strategy. In particular, the training process consists of 700k iterations in total. When , only optical flow network, MV encoder network and MV generator network are trained together. When reaches to , motion compensation network begins to join the training. When reaches to , residual encoder network and residual generator network also begin their joint training. When reaches to , the discriminator begins to be optimized. As for loss function, we only use MSE loss when , VGG-based loss is added when reaches to 40k. Adversarial loss is added when reaches to 400k. Learning rate is set during the whole training.

Iii-C Results

For evaluation of the proposed DVC-P, the following training parameters for Eq. (1) are used: , and . Different in Eq.(1) leads to different rate-distortion-perception trade-off. The GOP size is 10, and the first 100 frames are tested for each sequence.

Iii-C1 Perceptual Video Quality Metric

We test perceptual quality of decoded videos by FVD. When setting for proposed DVC-P and for DVC ( trades off between distortion and bit rate in DVC. corresponds to QP=37 in DVC), we compute FVD for all sequences at almost the same bit rate, as shown in Table I. Smaller FVD values correspond to better performance. We also compute a BD-rate equivalent (referred to “FVD BD-rate”) which indicates how much less bit rate the proposed method needs to achieve the same FVD as DVC for the same FVD, over 4 QP points: 22, 27, 32 and 37 (corresponding to = 2048, 1024, 512 and 256, = , , and ), as shown in Table II. Notice that DVC-P performs worse on BQSquare. It is because on smaller QPs (22 and 27), where FVD is already low, the bit rate is higher. If we just focus on larger QPs (32 and 37), DVC-P still performs better. In addition, We draw “FVD-Bit rate” curves in Fig.2 to compare the performance at 4 QPs, taking sequence RaceHorses(class D) as an example. In general, our proposed method can generate more realistic decoded videos and outperform DVC.

Sequence DVC [3] Proposed

FVD Bitrate (bpp) FVD Bitrate (bpp)
A Traffic 590.02 0.041 458.53 0.044
PeopleOnStreet 593.01 0.073 566.25 0.074
B Kimono 207.02 0.046 156.05 0.054
ParkScene 411.47 0.044 324.74 0.047
Cactus 572.01 0.050 453.19 0.054
BQTerrace 449.83 0.053 369.08 0.055
BasketballDrive 552.21 0.059 435.10 0.062
C RaceHorses 437.78 0.094 385.52 0.099
BQMall 566.18 0.073 425.13 0.076
PartyScene 571.08 0.103 446.83 0.104
BasketballDrill 674.06 0.056 528.84 0.059
D RaceHorses 716.85 0.094 630.10 0.098
BQSquare 1007.97 0.091 905.61 0.094
BlowingBubbles 811.58 0.089 615.89 0.091
BasketballPass 876.21 0.062 623.76 0.065
E FourPeople 289.34 0.029 248.15 0.031
Johnny 278.51 0.021 234.55 0.023
KristenAndSara 213.72 0.024 195.45 0.026
Average 545.49 0.061 444.60 0.064
TABLE I: FVD and Bitrate Comparison of Standard Test Sequences at QP=37 ( and )

Iii-C2 Elimination of Checkerboard Artifacts

Deconvolutions tend to bring checkerboard artifacts (sometimes very strong artifacts), which leads to colorful blurs appear in the decoded video BasketballDrive compressed with DVC (). Nearest-neighbor interpolation and Bilinear interpolation can eliminate this kind of artifacts to some extent. Besides, our proposed GAN () with adversarial loss can further improve perceptual quality of this decoded video. Visualization comparison is shown in Fig.3. Thanks for nearest-neighbor interpolation and adversarial loss, checkerboard artifacts are eliminated, and perceptual quality is satisfactory. Although checkerboard artifacts only appear in BasketballDrive, we test FVD and PSNR values of all sequences to explore the influence of nearest-neighbor and Bilinear interpolation methods, as shown in Table II.

Fig. 2: Performance comparison of the proposed DVC-P with DVC for sequence RaceHorses(class D)

Iii-C3 Visual Comparison

Visual comparison between proposed DVC-P and DVC is shown in Fig.4 for three randomly selected areas in inter-coded frames. It can be seen that DVC has more blurred areas, which penalizes the perceptual quality.

Iv Conclusion

In this paper, we have proposed the DVC-P network aiming at restoring decoded videos in high perceptual quality at the decoder side. By using a discriminator and a mixed loss to guide the whole video compression network to optimize towards generating realistic decoded videos, our proposed DVC-P outperformed DVC in terms of a BD-rate equivalent and visual experience.

Sequence QP = 37 QP = {22, 27,
32, 37}
DVC[3] Bilinear NN NN+Adversarial Loss
A Traffic 590.02 31.96 564.03 31.93 575.08 31.96 458.53 31.82 -16.55%
PeopleOnStreet 593.01 31.28 640.46 30.47 609.66 30.96 566.25 31.13 -2.95%
B Kimono 207.02 34.62 211.62 34.47 220.56 34.62 156.05 34.81 -14.12%
ParkScene 411.47 31.22 407.70 31.06 412.63 31.14 324.74 31.17 -13.74%
Cactus 572.01 30.17 618.71 30.00 612.05 30.15 453.19 29.98 -26.40%
BQTerrace 449.83 30.00 439.58 29.90 449.12 29.98 369.08 29.84 -9.50%
BasketballDrive 552.21 28.74 452.65 30.34 448.62 30.78 435.10 30.70 -7.03%
C RaceHorses 437.78 27.75 468.58 27.44 444.40 27.65 385.52 27.69 -12.46%
BQMall 566.18 27.77 614.02 27.72 590.29 27.88 425.13 27.64 -8.57%
PartyScene 571.08 26.18 600.88 26.07 596.08 26.18 446.83 25.98 -2.43%
BasketballDrill 674.06 29.86 689.69 29.58 666.61 29.80 528.84 29.66 -10.51%
D RaceHorses 716.85 27.54 767.19 27.23 742.58 27.50 630.10 27.51 -16.30%
BQSquare 1007.97 26.97 1013.62 26.95 1028.34 27.08 905.61 26.57 3.62%
BlowingBubbles 811.58 27.15 857.07 27.02 815.75 27.15 615.89 27.00 -13.49%
BasketballPass 876.21 28.84 806.65 28.66 785.32 28.86 623.76 28.70 -18.12%
E FourPeople 289.34 34.48 279.09 34.55 291.06 34.54 248.15 33.92 -20.18%
Johnny 278.51 35.77 275.73 35.87 263.00 35.86 234.55 35.38 -15.37%
KristenAndSara 213.72 35.20 208.42 35.36 214.92 35.30 195.45 34.65 -16.71%
Average 545.49 30.31 550.87 30.26 542.56 30.41 444.60 30.23 -12.27%
TABLE II: FVD and PSNR Comparison and a BD-rate Equivalent of Standard Test Sequences (“NN” refers to “nearest-neighbor”)

Fig. 3: Comparison of elimination of checkerboard artifacts (NN refers to nearest-neighbor)

Fig. 4: Visual comparison of the proposed DVC-P with DVC (The number of bpp represents the corresponding bit rate of the whole frame)