Multitask Learning for VVC Quality Enhancement and Super-Resolution

04/16/2021 ∙ by Charles Bonnineau, et al. ∙ 0

The latest video coding standard, called versatile video coding (VVC), includes several novel and refined coding tools at different levels of the coding chain. These tools bring significant coding gains with respect to the previous standard, high efficiency video coding (HEVC). However, the encoder may still introduce visible coding artifacts, mainly caused by coding decisions applied to adjust the bitrate to the available bandwidth. Hence, pre and post-processing techniques are generally added to the coding pipeline to improve the quality of the decoded video. These methods have recently shown outstanding results compared to traditional approaches, thanks to the recent advances in deep learning. Generally, multiple neural networks are trained independently to perform different tasks, thus omitting to benefit from the redundancy that exists between the models. In this paper, we investigate a learning-based solution as a post-processing step to enhance the decoded VVC video quality. Our method relies on multitask learning to perform both quality enhancement and super-resolution using a single shared network optimized for multiple degradation levels. The proposed solution enables a good performance in both mitigating coding artifacts and super-resolution with fewer network parameters compared to traditional specialized architectures.



There are no comments yet.


page 1

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, the amount of video data has considerably increased due to the massive usage of video in our lives. The recent progress in camera lens and hardware technology has permitted the emergence of new video formats, improving the *QoE of the end users. These video formats, including *HDR, *HFR and *UHD with 4K and 8K resolutions, allow for a more realistic and immersive visual experience by increasing the amount of details about the scene. Therefore, highly efficient coding solutions are needed to deliver all these new video formats on the distribution networks. In response to this challenge, the *JVET, established by the *ITU-T and *MPEG, has developed a new video coding standard, called *VVC [20]. This latter enables around 40% of bit-rate reduction over *HEVC [17] for the same visual quality [16].

In parallel, video processing deep learning-based solutions have been developed in order to reach these requirements and accelerate the deployment of these new services. For instance, the authors in [4] have proposed a backward-compatible solution for 8K and 4K signals broadcast using *VVC and super-resolution. In [18], a feed-forward network is used as an in-loop filter to enhance the quality of the *VVC decoded frames. In practice, these models are trained independently, although some of the computed features can be useful for other tasks. Moreover, several models are generally proposed to optimize the network for different types of input data, e.g., levels of degradation, input channels. All these redundant parameters need to be stored or transmitted several times, thus being inappropriate for devices with limited energy and computing resources (mobile phones, TV chipsets, etc.). In this paper, we propose a multitask learning-based method that performs two tasks: super-resolution and quality enhancement of *VVC intra-coded frames using a single network. We also use a multi-QPs training strategy based on fine-tuning and prior information inspired by [18]. Our approach enables a significant reduction of the number of parameters while maintaining a good performance compared to dedicated solutions.

The rest of this paper is organized as follows. Section II provides a brief overview of deep learning-based post-processing and multitask learning approaches. The proposed solution is described in the Section III. Section IV presents a performance evaluation through different experiments and provides an analysis of the results. Finally, Section V concludes this paper.

Fig. 1: The global pipeline of our multitask approach for super-resolution and quality enhancement.

Ii Related Work

With the recent progress in *ML, a new type of learning-based super-resolution algorithms has emerged [8, 15, 13]

. These models are based on *ANN and aim at learning the non-linear mapping that exists between *LR and *HR images. Those learning methods have outperformed the state-of-the-art up-scaling techniques, including hand-crafted interpolation filters

[11] and other learning approaches ranging from neighbor embedding [7] to sparse coding [21]. These architectures were also used without up-scaling for quality enhancement of reconstructed images to remove coding artifacts generated by lossy compression methods [22]. More advanced models were proposed, for instance, He et al. [10] considered the partition as a prior information to make the network focus on the block boundaries. In [18], an architecture was developed with the use of an attention-based mechanism on pixels and channels to increase the quality enhancement efficiency.

Recently, some deep models proposed the use of *MTL [6] to perform multiple tasks using a single network. Therefore, the learned representations can be accessed by all the tasks in order to exploit redundant features and improve the performance. This concept has been essentially applied to high-level vision tasks [14] where promising results were obtained. A first approach for super-resolution and quality enhancement has been proposed in [23] using a gated fusion module. However, this method mainly focuses on improving super-resolution applied to degraded images and does not maximize the information sharing between the tasks.

Iii Proposed Method

Our approach aims to exploit the similarity between two tasks: *SR and *QE, with hard parameter sharing using a shared network and two task-specific modules and . This method allows the model to benefit from the feature redundancy that exists between both tasks. Consequently, the number of parameters can be reduced while maintaining a good quality of reconstruction, compared to specialized architectures.

To make the model capable of generalizing across several input *QP, we use [18] as prior information to the network. This prior input corresponds to a uniform normalized map computed as:


with are the vertical and horizontal pixel coordinates. The value of is equal to 63 in *VVC.

In the following, let denote a low-resolution image of size and its reconstructed version that may include coding artifacts. We first extract the shared features from the input image concatenated with its corresponding using the shared network as follows:


with denoting the concatenation of and .

The output images and

are then estimated from the shared features

using the task-specific modules and according to the following equations:


We selected L1-loss [24] to compute the task-specific losses and between the estimated images and , and the original images and .

As our architecture is mainly inspired by [13], we first pre-train the network to perform super-resolution on uncompressed images. The pre-trained parameters and are obtained by solving the following optimization problem:


with the high-resolution training images, the corresponding low-resolution versions, the number of training samples and the sample index

Finally, we optimize the overall multitask network by combining both task-specific losses in the multitask loss function

with a weighting parameter as follows:


The schematic in Fig.1 illustrates the structure of the different components of our multitask network. The shared network mainly consists of *RB with short and long skip connections. These operations allow the network to learn the identity function, improving the gradient flow from the deep to the shallow layers during the back-propagation step. It also leads to more sparse feature maps, and thus, better performance. For each convolutional layer, we use 256 filters of size

. We introduce the non-linearity with the activation function ReLU between layers at different stages of the network. This structure is directly inspired by the *EDSR network 

[13], which proposes state-of-the-art performance for super-resolution. We split the network at a very deep stage of the architecture to maximize the parameter sharing between tasks. For the super-resolution module , we use the Pixel-Shuffle upscaling layer [15] at the end of the network. The same structure is used for the quality enhancement module , without the upscaling layer.

Iv Experimental results

Iv-a Training

For the whole experiments, we train the networks with the DIV2K image dataset [1]

. This later consists of 900 *HD PNG pictures with a high diversity of spatial characteristics. To prevent network overfitting, we evaluate the performance on the Set5 image dataset 

[2]. The low-resolution images are generated by a bicubic downscale applied on the high-resolution images . To generate the reconstructed versions of the uncompressed images , we use the *VTM in all-intra configuration with in order to simulate different levels of coding artifacts. We first convert the images from PNG to YUV4:2:0 format. Then, we collect the reconstructed images and convert them back to RGB. For training, we use patches extracted from the training set to reduce GPU memory usage. To test the performance of our network on video sequences, we also generate data from the ClassB and ClassA of the *JVET *CTC [9] using *VVC all-intra, as described above. We also include two 8K videos, selected from the dataset given in [5]. For the whole experiments, the quality is assessed on the luma component using *PSNR and *SSIM [19] image quality metrics computed between the estimated and original images. We also compute -*PSNR and -*SSIM that indicate the gain compared to the decoded images prior post-processing. For Super-Resolution, we use bicubic interpolation as anchor.

We train our model over 250 epochs, with a learning rate of

for from-scratch training and halve it for fine-tuning. For this latter, the pre-trained weights are obtained by training the network for super-resolution on uncompressed image pairs during 1000 epochs with a learning-rate of . We apply a learning rate decay with a gamma of 0.5 every 75 epochs to improve the convergence. We use a batch size of 8 and optimize the model with ADAM [12] by setting , and . The parameter in (1

) was tuned and fixed to 0.9 after a grid search on different values. All the experiments are performed on an NVIDIA Telsa V100 GPU using PyTorch.

Iv-B Multi-QPs optimization

In the first experiment, we want to evaluate the ability of our multitask model to generalize across several QPs through an ablation study. The tested configurations include multi-QPs training, fine-tuning and , as described in Section III. Since less data are available for the training of the QP-specific networks than for a single multi-QPs network, we multiply the number of epochs by the number of tested QPs, i.e., 4, for these QP-specific configurations. We also adjust the learning rate decay to be applied every epochs in this case. Thus, all the presented models are trained with the same number of parameter updates allowing a fair evaluation. The is computed for each tested QP by (1). We set the number of *RB to for all the tested models, leading to around 13 million parameters per network.

QP22 35.80 (+2.66) 43.07 (+0.43) 35.85 (+2.71) 43.19 (+0.55) 35.69 (+2.55) 42.92 (+0.28) 35.67 (+2.53) 42.98 (+0.34)
QP27 34.17 (+1.91) 39.18 (+0.39) 34.18 (+1.92) 39.24 (+0.45) 34.16 (+1.90) 39.21 (+0.42) 34.09 (+1.83) 39.13 (+0.34)
QP32 32.16 (+1.22) 35.62 (+0.40) 32.16 (+1.22) 35.67 (+0.45) 32.14 (+1.20) 35.63 (+0.41) 32.10 (+1.16) 35.58 (+0.36)
QP37 29.85 (+0.70) 32.20 (+0.35) 29.89 (+0.74) 32.27 (+0.42) 29.78 (+0.63) 32.13 (+0.28) 29.81 (+0.66) 32.16 (+0.31)
Average 33.00 (+1.63) 37.52 (+0.39) 33.02 (+1.65) 37.59 (+0.46) 32.94 (+1.57) 37.47 (+0.34) 32.92 (+1.55) 37.46 (+0.33)
TABLE I: Ablation study of our model on Set5 for both tasks in terms of *PSNR (dB) and -*PSNR (dB).
(a) Super resolution (b) Quality enhancement
Fig. 2: Convergence analysis of our multitask approach on Set5 regarding both tasks for different multi-QPs configurations.

Table I shows the performance of our model on Set5 dataset for different input *QP in terms of *PSNR (dB) and -*PSNR (dB) for both super-resolution and quality enhancement. We perform an ablation study to evaluate the contribution of each component of our multi-*QP model in the global performance of the network. We observe that a fine-tuning of the network pre-trained with uncompressed images leads to 0.08dB and 0.06dB of gain for SR and QE, respectively. We notice that even using parameters pre-trained for super-resolution, quality enhancement performs better as well. We also see that the contributes to the performance of our multi-*QP model by increasing the quality of reconstruction by 0.06dB for SR and 0.05dB for QE. It can be noted that the models based on *QP-specific training perform slighly better in terms of quality than our multi-*QP model. However, training one network per *QP requires four times more training time and parameters than a multi-QPs network to reach this level of performance.

Fig. 2 visualizes the convergence of each multi-*QP configuration by assessing the *PSNR on the validation set at each training epoch for both tasks. We clearly notice that fine-tuning offers a more stable training with a faster convergence than from-scratch training for super-resolution. It is not surprising as the network starts to learn with weights that are already tuned for a related task. Although this configuration also leads to better results for quality enhancement, this observation is less pronounced in that case. Moreover, the training is globally less stable for this task. It can be explained by the fact that the loss related to super-resolution is more weighted in the proposed multitask loss . However, we notice that the use of leads to a better convergence for both tasks.

Iv-C Multitask learning

Method Baseline-B Super-Resolution Quality Enhancement
Single-task SR SR-4 32.87 (+1.50) 0.8909 (+0.0195)
SR-8 33.00 (+1.63) 0.8925 (+0.0211)
Single-task QE QE-4 37.43 (+0.30) 0.9552 (+0.0020)
QE-8 37.50 (+0.36) 0.9557 (+0.0025)
Sequential QE-4;SR-4 32.88 (+1.51) 0.8912 (+0.0198) 37.43 (+0.30) 0.9552 (+0.0020)
Multitask MTL-8 33.00 (+1.63) 0.8924 (+0.0210) 37.52 (+0.39) 0.9558 (+0.0026)
TABLE II: Average performance (images, QPs) of the different Baselines in (-)*PSNR (dB) and (-)*SSIM computed on the Set5 dataset. The value of B corresponds to the number of *RB used in the shared network for Baseline-B.

In this experiment, we demonstrate the effectiveness of our multitask approach compared to specialized networks. Single-task architectures derive from our multitask solution with set to 0 and 1 in the multitask loss , defined in (6), for quality enhancement and super-resolution, respectively. We also include a sequential configuration based on these two single-task networks. For an input image , the sequential configuration can be expressed as:


In that case, quality enhancement is applied to the input image before passing through the super-resolution specialized network. For this experiment, we set the number of *RB for each specialized network in both sequential and single-task configurations to , leading to approximately 7.7 million parameters per network. We also include single-task models with to match the performance with the multitask configuration. All the models are trained using fine-tuning and . Table II gives the performance in terms of *PSNR, *SSIM, -*PSNR and -*SSIM for both tasks.

Firstly, we can notice that the sequential configuration does not perform significantly better than the single-task by considering the same total number of parameters. Moreover, super-resolution with enables better performance than the sequential model regarding this task. For our multitask approach, we can notice that the performance of single-task is reached with half parameters. In addition, the multitask performs better in terms of quality than single-task considering the same total number of parameters. It can be explained by the fact that a large number of features are computed twice between the specialized architectures. Thus, a deeper multitask model allows new representations to be learned, increasing the quality of reconstruction for both tasks.

Fig. 3: Average feature maps at different stage of the baselines QE-8, SR-8 and MTL-8 for the input image butterfly.png encoded with *VTM AI (qp27). Convi corresponds to the i-th convolutional layer of the network.

In Fig.3, we display the average feature maps for different convolutional layers of the single-task and multitask architectures. As shown in this figure, the features are globally similar and become more complex and specialized in the deeper layers. For Conv17, the average feature map of multitask is more similar to super-resolution than quality enhancement, mostly because in the multitask loss of the proposed model. This demonstrates that a high correlation exists between the presented single-task models which can be exploited in the multitask architecture.

Iv-D Coding performance

In the last experiment, we investigated the performance of our multitask model applied as post-processing for video delivery against single-task networks. The input signal is first downscaled and encoded. Then, both post-processing tasks are performed on the decoded signal outside the coding loop, as presented in [4] for super-resolution. The bit-rate is assessed on the low-resolution signal. For this experiment, we consider the same total number of parameters for both tested configurations, i.e., for our multitask network and for each single-task network. We use the *BD-Rate method described in [3] to evaluate our approach. Table III presents the results for both super-resolution and quality enhancement.

We can notice that, in average, our multitask model allows 2.8%/2.1% and 2.3%/1.1% of bit-rate savings over specialized networks for the same objective quality, using *PSNR and *SSIM, regarding super-resolution and quality enhancement, respectively. We can also notice that these gains are higher for the sequences where our method performs well against naive anchors. These video sequences including BQTerrace and SubwayTree contain more spatial information and need more powerful models to be accurately reconstructed.

Dataset Sequence Super-Resolution Quality Enhancement
8K (7680x4320) SubwayTree -3.55 (-17.17) -1.15 (-9.12) -2.19 (-5.85) -0.90 (-3.39)
TiergartenParkway -0.88 (-7.90) -0.93 (-5.46) -1.64 (-3.26) -0.80 (-1.99)
ClassA1 (3840x2160) Campfire -1.19 (-9.51) -0.78 (-6.16) -1.30 (-2.92) -0.58 (-1.53)
FoodMarket4 -1.97 (-15.97) -1.55 (-9.66) -2.19 (-5.13) -0.97 (-2.67)
Tango2 -1.47 (-8.41) -1.40 (-5.61) -3.91 (-6.04) -1.19 (-3.23)
ClassA2 (3840x2160) CatRobot1 -2.47 (-16.75) -2.53 (-15.87) -2.41 (-5.62) -1.47 (-3.87)
DaylightRoad2 -1.37 (-10.28) -1.31 (-7.30) -1.89 (-4.77) -0.94 (-2.44)
ParkRunning3 -1.44 (-12.45) -1.34 (-10.15) -1.46 (-3.18) -0.84 (-2.34)
ClassB (1920x1080) BasketballDrive -5.44 (-53.94) -4.22 (-40.42) -1.92 (-4.29) -1.68 (-3.59)
BQTerrace -7.99 (-55.00) -5.43 (-34.09) -2.64 (-4.97) -1.58 (-2.82)
Cactus -3.01 (-33.17) -2.35 (-22.93) -2.16 (-4.55) -1.37 (-3.24)
MarketPlace -3.52 (-28.84) -1.63 (-18.10) -3.10 (-5.25) -0.78 (-2.64)
RitualDance -2.54 (-17.50) -2.55 (-12.82) -3.61 (-7.10) -1.66 (-5.08)
Average -2.83 (-22.07) -2.09 (-15.21) -2.34 (-4.84) -1.14 (-2.99)
TABLE III: BD-rate (%) of our approach computed over single-task regarding *PSNR and *SSIM for different resolution classes. The values in bracket indicate the gain compared to naive anchors, i.e., bicubic upscale and input quality.

V Conclusion

In this work, we presented a multitask learning-bases approach that performs both super-resolution and quality enhancement of *VVC intra-coded frames. We used a multi-QPs training strategy based on fine-tuning and prior information. We demonstrated that our method allows a significant reduction of parameters, while maintaining a good quality of reconstruction compared to specialized solutions. We also showed that our approach offers quality enhancements compared to single-task models when the same total number of parameters is considered. As future work, we plan to include the temporal aspect into our model to ensure temporal consistency and enable inter-coded frame processing. In addition, the integration of our model in the coding loop will also be investigated in order to perform both quality enhancement and super-resolution of the base layer directly into a scalable codec using a single shared network.


  • [1] E. Agustsson and R. Timofte (2017) Ntire 2017 challenge on single image super-resolution: dataset and study. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 126–135. Cited by: §IV-A.
  • [2] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. Cited by: §IV-A.
  • [3] G. Bjontegaard (2001-04) Document VCEG-M33: Calculation of Average PSNR Differences Between RD- Curves. Cited by: §IV-D.
  • [4] C. Bonnineau, W. Hamidouche, J. Travers, and O. Deforges (2020) Versatile video coding and super-resolution for efficient delivery of 8k video with 4k backward-compatibility. arXiv, pp. arXiv–2002. Cited by: §I, §IV-D.
  • [5] B. Bross, H. Kirchhoffer, C. Bartnik, and M. Palkow (2020-13-17 January) Document JVET-Q0791: multiformat berlin test sequences.. Cited by: §IV-A.
  • [6] R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §II.
  • [7] H. Chang, D. Yeung, and Y. Xiong (2004) Super-resolution through neighbor embedding. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., Vol. 1, pp. I–I. Cited by: §II.
  • [8] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pp. 184–199. Cited by: §II.
  • [9] X. L. F. Bossen and K. S. V. Seregin (2019-9-18 January) Document jvet-m1010: common test conditions and software reference configurations for sdr video. Cited by: §IV-A.
  • [10] X. He, Q. Hu, X. Zhang, C. Zhang, W. Lin, and X. Han (2018)

    Enhancing hevc compressed videos with a partition-masked convolutional neural network

    In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 216–220. Cited by: §II.
  • [11] R. Keys (1981) Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing 29 (6), pp. 1153–1160. Cited by: §II.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §IV-A.
  • [13] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: §II, §III, §III.
  • [14] S. Liu, E. Johns, and A. J. Davison (2018) End-to-end multi-task learning with attention. arXiv:1803.10704. Cited by: §II.
  • [15] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. arXiv, pp. arXiv–1609. Cited by: §II, §III.
  • [16] N. Sidaty, W. Hamidouche, O. Déforges, P. Philippe, and J. Fournier (2019) Compression performance of the versatile video coding: hd and uhd visual quality monitoring. In 2019 Picture Coding Symposium (PCS), pp. 1–5. Cited by: §I.
  • [17] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on circuits and systems for video technology 22 (12), pp. 1649–1668. Cited by: §I.
  • [18] M. Wang, S. Wan, H. Gong, and M. Ma (2019) Attention-based dual-scale cnn in-loop filter for versatile video coding. IEEE Access 7, pp. 145214–145226. Cited by: §I, §II, §III.
  • [19] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §IV-A.
  • [20] X. Xu and S. Liu (2019) Recent advances in video coding beyond the hevc standard. APSIPA Transactions on Signal and Information Processing 8. Cited by: §I.
  • [21] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang (2012) Coupled dictionary training for image super-resolution. IEEE transactions on image processing 21 (8), pp. 3467–3478. Cited by: §II.
  • [22] K. Yu, C. Dong, C. C. Loy, and X. Tang (2016) Deep convolution networks for compression artifacts reduction. arXiv:1608.02778. Cited by: §II.
  • [23] X. Zhang, H. Dong, Z. Hu, W. Lai, F. Wang, and M. Yang (2018) Gated fusion network for joint image deblurring and super-resolution. arXiv:1807.10806. Cited by: §II.
  • [24] H. Zhao, O. Gallo, I. Frosio, and J. Kautz (2015) Loss functions for neural networks for image processing. arXiv:1511.08861. Cited by: §III.