Recent years have witnessed the rapid development in video services. Due to the limitation of bandwidth, video compression algorithms and video coding formats such as H.264/AVC  and H.265/HEVC  have been indispensable to remove the spatial and temporal redundancy in videos and reduce bit-rates. However, lossy video compression also degrades the quality of compressed videos and introduces various compression artifacts, such as blocking and blurring, which inevitably lead to the deterioration of the quality of experience (QoE) of the final user . Thus, effective algorithms for video quality enhancement are important for both effective human and machine use.
With the tremendous development of convolution neural networks (CNNs) in recent years, a large number of learning-based works have been proposed aiming at enhancing quality of compressed videos. Some of them[4, 5, 6, 7, 8, 9, 10]
focus on enhancing the objective quality which is most commonly measured by peak signal to noise ratio (PSNR). Although the improvement of PSNR indicates the decrease of the objective distortion (e.g., mean square error (MSE)), it does not necessarily improve the visual experience. Networks trained with adversarial loss, generative adversarial networks (GANs), are typically able to generate high-quality details and have significantly developed in the past few years, some algorithms [13, 14, 15, 16] focus on enhancing the perceptual quality of compressed videos with the help of GANs. However, they have not fully employed the global temporal information between consecutive frames to improve the perceptual quality of compressed videos.
Different from images, videos consist of consecutive frames. So in additional to spatial information, temporal information can be used for frame enhancement. How to make full use of global temporal information is a key issue in the perceptual quality enhancement of compressed videos. One of the most common methods to employ temporal information is motion compensation with estimated optical flows . However, estimated optical flows can hardly be accurate due to various artifacts in compressed frames, which penalizes the performance of temporal alignment. Another popular method to employ temporal information is to use deformable convolutions   to achieve temporal alignment . Compared with optical flows, deformable convolutions are more flexible and have potential to achieve better performance. However, both optical flows and deformable convolutions are based on motions between consecutive frames. In this case, only local temporal information can be employed due to the limited range of motion. As shown in Fig. 1 (middle), only local temporal information can be employed to enhance the pixel on given object (a horse, in this example). Although networks using optical flows and deformable convolutions to capture local temporal information are relatively fast, in specific use cases such as in video production or archiving, it is more important to get as high quality as possible, despite possible complexity increase. Since local areas will suffer from similar artifacts in video compression, it is of great importance to capture global information. Correlated pixels in the adjacent frame should be considered even though they are not in the reach of optical flow (which is coming from different tasks that do not look for such long range correlations), as shown in Fig. 1 (right). For this reason, the proposed network employs non-local attention modules to fully capture global temporal correlations in the consecutive frames and pave the way for the higher perceptual quality enhancement.
Additional challenge for processing compressed videos is that at different quantization parameters (QPs) the videos have different characteristics. Intuitively, if one wants to process compressed videos at multiple QPs, multiple models will need to be correspondingly trained, stored and transmitted to adapt the characteristics of videos compressed at multiple QPs, which is extremely expensive in practical applications. To train only one model to adapt to multiple QPs with negligible performance loss, many QP-conditional adaptation mechanisms have been proposed for deep networks that address different problems. The strategy proposed in this paper is the first to successfully apply QP-conditional adaptation to perceptual quality enhancement of compressed videos, saving multiple sets of model parameters with almost the same enhancement performance compared to using the models trained for specific QPs.
In this paper, we proposed an adaptation- and attention-based GAN, which is referred to as PeQuENet, to enhance the perceptual quality of compressed videos at various QPs by fully employing global temporal information in the consecutive frames. The proposed PeQuENet consists of four main parts: the pre-trained feature extraction module, the attention module, the progressive decoder module and the QP-conditional adaptation module.
Our main contributions are summarized as:
With the goal to increase capability of learned solutions for perceptual quality enhancement of compressed videos, we integrate simplified non-local attention modules for conditional attention (i.e., fusing multi-dimensional features from two input sources - target and reference) with the aim of enhancing the target frame with the most relevant (or similar) parts of the reference frame and propose a set up where it outperforms previously used methods (i.e., optical flow and deformable convolution).
For the first time, we successfully apply QP-conditional adaptation to compressed video perceptual quality enhancement network by feeding encoded QP information to the network to save model parameters without performance loss.
We compare the proposed PeQuENet with state-of-the-art compressed video quality enhancement networks quantitatively and qualitatively. Experimental results demonstrated that the proposed PeQuENet consistently provides better quality frames.
The remainder of the paper is organized as follows. Related work is summarized in Section II. The proposed PeQuENet for the perceptual quality enhancement of compressed videos is detailed in Section III. Experimental results are shown in Section IV to verify expected improvements when the proposed PeQuENet is deployed. Finally, Section V concludes the paper.
Ii Related Work
Ii-a Quality Enhancement of Compressed Videos
In the past few years, extensive works have been focused on the objective quality enhancement of compressed videos. Wang et al.  proposed a deep CNN with ten convolution layers and a residual structure to improve coding efficiency measured by BD-rate for HEVC. Dai et al.  employed a four-layer CNN with variable filter size and residual learning to improve performance of pose-processing in HEVC with low memory cost. Yang et al.  proposed models, referred to as QE-CNN-I and QE-CNN-P, to enhance the objective quality of I frames and P/B frames in compressed videos, respectively. To apply in time-constrained scenarios, they further proposed a scheme named TQEO to maximize the quality enhancement and meet the requirement of the complexity at the same time. Yang et al. 
developed a detector based on Support Vector Machine (SVM) to recognize frames in peak quality and used them to improve neighboring frames in low quality. By employing local temporal information, they also alleviated quality fluctuation in compressed videos. Lu et al. proposed a deep Kalman model to enhance quality of compressed videos. Instead of exploring temporal information in consecutive compressed frames, they used temporal information in consecutive restored frames in a recursive way, which ensured to reduce compression artifacts effectively. Guan et al.  further improved the algorithm in 
. By using a Bidirectional Long Short-Term Memory (BiLSTM)-based detector, multi-scale feature extraction and densely connected mapping construction, they outperformed the methods in[4, 5, 6, 7, 8]. Deng et al.  proposed a fast quality enhancement network for compressed videos by incorporating deformable convolutions. With more flexibility, their proposed network achieves the state-of-the-art performance in terms of enhancing PSNR. Due to the high performance of the methods in  and , they are used for comparison with the proposed network in experiments.
However, enhancing the objective quality of compressed videos does not necessarily improve visual experience of humans . To improve QoE, many works have been proposed to enhance the perceptual quality of compressed videos. Wang et al.  proposed a multi-level wavelet-based GAN to enhance the perceptual quality of compressed videos. By recovering high-frequency sub-bands in the wavelet domain, their proposed network achieved enhanced perceptual quality. Wang et al.  proposed a simple perceptual quality enhancement network for HEVC compressed videos with the help of GAN and residual blocks. Jin et al.  proposed a multi-level progressive refinement network to enhance the perceptual quality for intra coding at the decoder end. By employing a coarse-to-fine refinement manner, their proposed network achieved trade-off between the quality and computational complexity. Zhang et al.  proposed a deformable convolution-based GAN to improve the perception of compressed videos. After the current frame and its adjacent frames aligned with deformable convolutions, the perceptual quality of compressed videos was enhanced by a complex quality enhancement module. [13, 14, 15, 16] are also used for comparison with the proposed network in experiments.
Ii-B Non-Local Attention Mechanism
Non-local attention mechanism  has been widely used in various applications to capture long-range correlations. Chen et al.  proposed an end-to-end deep image compression network based on non-local attention modules. By generating attention masks, they weighed features to achieve bit-rate allocation. Hu et al.  proposed a learning-based video compression network performing major operations in the feature space. With non-local attention blocks embedded in their proposed multi-frame feature fusion module, the coding efficiency has been greatly improved. Li et al.  proposed a pose-guided non-local attention-based GAN to achieve human pose transfer. Tan et al.  proposed a real-time Siamese tracking network. By exploring long-range dependencies between the target branch and the search branch, they captured important features in the target branch which were regarded as reliable guidance to the search branch. Wen et al.  proposed a medical image classification network using non-local attention mechanism to capture global information for better understanding the visual scene and identifying the tiny lesions. Li et al.  proposed a network for fashion landmark detection. With the help of the spatial-aware non-local attention mechanism, they utilized global spatial and semantic information to improve the detection performance. Blanch et al. 
proposed an exemplar-based colorization network. By capturing global correlations between the target image and the reference image with attention modules at different resolutions, they achieved advanced style transfer.
Ii-C QP-Conditional Adaption Mechanism
Compressed videos encoded at different QPs have different characteristics, such as different degrees of artifacts. It is hard to achieve satisfactory performance for the network trained at one QP while employed at another. One way that seems straightforward is to train multiple models to adapt to various QPs. However, it inevitably requires large memory to store all model parameters at the decoder end. To save memory without performance loss, QP-conditional adaptation mechanism has been applied to deep networks addressing different problems. Liu et al.  proposed a QP-adaptive method and applied it to CNN-filters in video coding, improving coding efficiency with only 25% of the parameters (typically the reduction in research papers is to 25% because only four QPs are considered. But that with even more QPs which may be needed in practice, the number of models would be even bigger). Huang et al.  proposed a QP variable CNN-based in-loop filter for intra-coding of Versatile Video Coding (VVC) . A QP attention module was designed and embedded into the residual block. With less model parameters, their proposed model achieved even better performance than that of QP-separate models. Song et al.  also proposed a CNN-based in-loop filter for intra-coding with QP-conditional adaptation. As one of inputs, normalized QP map was fed into the network to make it adaptive to various QPs. Zhang et al.  proposed a residual highway CNN for in-loop filtering in HEVC. By dividing the entire QP range into multiple QP bands, they trained one model for each band to save model parameters and maintain good performance.
Iii The Proposed PeQuENet
The structure of the proposed PeQuENet is shown in Fig. 2. The proposed PeQuENet includes a generator (consisting of the pre-trained feature extraction module (fixed during the training), the attention module, the progressive decoder module and the QP-conditional adaptation module) and a discriminator. The generator and the discriminator are trained in an adversarial manner. The details of the architectures of the proposed PeQuENet is shown in our Supplemental Material. The codes are available at https://github.com/SaipingZhang/PeQuENet.
To capture the temporal information in consecutive frames, we take the preceding frame and the succeeding frame as temporal reference frames to help enhance the perceptual quality of the target frame . Since we improve each compressed frame separately, a sliding-window strategy is employed to process the entire video. The proposed model can be expressed as
where is the enhanced target frame (i.e., the output of the proposed PeQuENet). represents the proposed perceptual quality enhancement. is the set of the learnable model parameters. , and are three consecutive compressed frames (i.e., the inputs of the proposed PeQuENet). It should be noted that when is the first frame, and that when is the last frame in the video.
Iii-A1 Pre-Trained Feature Extraction Module
Considering the strong ability of feature extraction of the VGG-19 network 
pre-trained on ImageNet, we employ it as the pre-trained feature extraction module in the proposed PeQuENet.
Three consecutive compressed frames , and are separately fed into the pre-trained VGG-19 model to obtain features from multiple layers. Specifically, features of the target frame are extracted from six layers of the pre-trained VGG-19 model, i.e., the first convolution layer before the
-th max-pooling (), for , and the fourth convolution layer before the fifth max-pooling (). Corresponding features are represented by , , and . Features of the adjacent frame are extracted from the , and layers, represented by , and .
As shown in Fig. 2, six feature pairs, i.e., , for each combination of and are fed into six attention blocks, to capture temporal correlations at different resolutions in the feature space. Furthermore, extracted features , , and are transmitted to the proposed progressive decoder module to prevent the loss of information caused by downsampling at some extent.
Iii-A2 Attention Module
As shown in Fig. 2, there are six attention blocks in the proposed attention module. Three of them receive feature pairs from the frames and as inputs, and the other three receive feature pairs from the frames and as inputs. Taking the attention block receiving and as the inputs as an example, we show its structure in Fig. 3.
Specifically, and ( and are dimensions of feature maps at given level ) are fed into convolution layers. Then they are reshaped to change the dimensions. After multiplication and softmax operation, the correlation map in is obtained to indicate the temporal correlation between and . Through multiplying the correlation map by the reshaped , the temporal information highly correlated to in are amplified while the other information are suppressed. Finally, there is a reshaping operation to restore the dimension of the output, which prepares for adding the output to the corresponding features of the target frame in the proposed progressive decoder module.
Note that we integrate the simplified non-local attention mechanism for conditional attention (i.e., fusing multi-dimensional features from two input sources - target and reference). This differs from other papers (e.g.) which use the same mechanism as a self-attention block. Specifically, the one in substitutes convolution operations with self-attention blocks to capture long-range dependencies and hence, improve the performance of the original CNN for a particular task. However, our attention block in the proposed PeQuENet operates over pairs of consecutive frames and is used to compute analogies between target and reference frames with the aim of enhancing the target frame with the most relevant (or similar) parts of the reference frame. Besides, our attention block is simpler since features of just two frames (rather than a volume) are input as 2D spatial matrices (3D if considering channels) and used differently as key, value and query where the adjacent frame used as reference, and our attention block directly combines reference samples without the need of being projected, thus saving computational cost.
Iii-A3 Progressive Decoder Module
The proposed progressive decoder module takes extracted features of the target frame and the outputs of attention blocks as the inputs. As shown in Fig. 2, this module has two branches. The first branch is designed for the frames and , and the second branch is designed for the frames and . Specifically, taking the first branch as an example, the output of the attention block is added to . In this case, the temporal information highly correlated to the target frame in the frame is provided to the target frame in the feature space. After that, the sum is fed into a convolution layer. By Bilinear upsampling, the width and the height of feature maps are both doubled to cater for the resolution of . After concatenating the upsampled feature maps and in the dimension of the channel, they are fed into another convolution layer. As mentioned before,
can compensate lost information due to downsampling in the pre-trained feature extraction module at some extent. It should be noted that we employed Bilinear upsampling instead of the strided-deconvolution to increase the resolution of feature maps. This is because the strided-deconvolution has uneven overlap in the horizontal and vertical when the size of “kernel” is not divisible by the “stride”, which sometimes results in severe checkerboard artifacts. Similar operations are repeated until we obtain feature maps in the same resolution as that of the input frame. After two sets of feature maps from the first branch and the second branch are concatenated, they are fed into the QP-conditional adaptation module.
Iii-A4 QP-Conditional Adaptation Module
The structure of the QP-conditional adaptation module is shown in Fig. 4. QP value. Softplus
is used as the activation function, as in, since it ensures the positive outputs which are regarded as weights of features along the dimension of the channel. After performing element-wise multiplication between features and the encoded vector , the weighed features are obtained and fed into the convolution layer. It should be noted that encoded is repeatedly embedded into three convolution layer in the QP-conditional adaptation module to guarantee the proposed PeQuENet is successfully modulated by QP.
Considering the perceptual quality enhancement of compressed videos mostly relies on the recovery of high-frequencies, it is sufficient for the discriminator to focus on local image patches to decide whether the generated high-frequencies are correct. In this case, the PatchGAN  is employed to assist in training the network to produce perceptually enhanced frames. Since the PatchGAN consists of convolution layers only, as shown in Fig. 2, its output is an array where each element represents the realness of the corresponding patch. After averaging the fidelity of all patches, the realness of the image is obtained. The advantage of the PatchGAN is that it processes each patch identically and independently, which indicates the PatchGAN has fewer parameters and lower computational complexity.
Such approach is mathematically equivalent to cropping an image into multiple overlapping patches with appropriate sizes (i.e., the receptive fields of the PatchGAN) and employing a regular discriminator to process each of them independently.
Iii-C Loss Functions
where and are the corresponding weights. We employ in this paper.
The adversarial loss is referred to that proposed in LSGAN :
where is the sequence of three consecutive input frames. is the corresponding QP value. is the output of the generator. is the output of the discriminator.
The VGG loss is computed as
where represents the
-th spatial element of the output tensor of the-th layer from a pre-trained VGG-19 model. and are the number of spatial elements and the number of layers, respectively. is the corresponding raw frame, i.e., the original frame before compression.
Similarly, the feature matching loss is computed as
where represents the -th spatial element of the output tensor of the -th layer selected from the discriminator. and are the number of spatial elements and the number of layers, respectively.
The loss of the discriminator of the proposed PeQuENet is also referred to that proposed in LSGAN:
The generator and the discriminator are trained in an adversarial manner, which indicates that and are minimized alternatively.
Iv Experimental Results
108 sequences representing various content types and resolutions introduced in  are employed for training. All sequences are compressed by H.265/HEVC reference software HM16.5 under Low Delay P (LDP) configuration with QPs set to be 22, 27, 32 and 37. It should be noted that our training dataset includes compressed sequences at all of the four QPs to give the proposed PeQuENet the ability of QP-conditional adaptation. Raw sequences and compressed sequences are cropped into non-overlapping clips in as training samples. Before fed into the network, all training samples are shuffled.
For evaluating the performance, tests are performed on Joint Collaborative Team on Video Coding (JCT-VC)  standard test sequences. All test sequences are also compressed by HM 16.5 under LDP configuration at the four QPs.
Iv-B Implementation Details
Adam optimizer  with , and is adopted to train the model. Learning rate is set to be thorough out the training process. Batch size is 32. For fair comparison with previous work, we only train our model on luminance component, but the proposed algorithm can be also extended to chrominance component. Perceptual quality metrics, i.e., LPIPS  and DISTS , are used to evaluate the performance of the proposed PeQuENet. Smaller LPIPS/DISTS values correspond to better performance. It should be noted that we only train one model to adapt to videos compressed at multiple QPs.
Iv-C Evaluation of Video Quality Enhancement
We compare the proposed PeQuENet with the state-of-the-art compressed video quality enhancement networks, i.e., the MFQE 2.0 , STDF  , MW-GAN , VPE-GAN , MPRNet  and DCNGAN . Among them, the MFQE 2.0 and STDF were proposed to enhance the objective quality (i.e., PSNR) of compressed videos while the MW-GAN, VPE-GAN, MPRNet and DCNGAN were trained with the help of GANs to enhance the perceptual quality of compressed videos. For fair comparison, all networks are trained on the same dataset (including the MW-GAN 111Since our retrained MW-GAN has not achieved the performance presented in  and their pre-trained models for the four QPs have not been published, the performance of the MW-GAN shown in TABLE I is directly taken from  for fair comparison. It should be noted that the performance shown in  is measured on LPIPS which is the difference between the LPIPS of the output and the LPIPS of the input. For better illustration, we simply further calculate LPIPS of the output of MW-GAN by adding the LPIPS to the LPIPS of the input (the input of the MW-GAN is the same as that of the proposed PeQuENet) and show the results in TABLE I. detailed in ). The only difference is that the DCNGAN and the proposed PeQuENet are trained only once with samples compressed at the four QPs, while the other networks use samples compressed at a single QP to train individual models. It should be noted that, when compared with the STDF, we choose their proposed model taking seven consecutive frames as inputs since it achieved better performance than that of their proposed model fed with three consecutive frames by leveraging more temporal information. The overall quantitative performance measured by LPIPS and DISTS is shown in TABLE I where the performance of videos compressed by HM 16.5 (i.e., the input of the proposed network) is shown in the first two columns for a reference.
Specifically, the MFQE 2.0, which was designed for the objective quality enhancement, fails to improve the perceptual quality of compressed videos measured with both LPIPS and DISTS. The STDF achieves perceptual quality enhancement although it was originally also proposed for the objective quality enhancement. Note that this discrepancy between subjective and objective results for the MFQE 2.0 and the STDF indicates the inconsistency between subjective quality and objective quality metrics. On the other hand, although the VPE-GAN can enhance the perceptual quality at higher QPs (i.e., QP 32 and QP 37) to some extent, it fails at lower QPs (i.e., QP 22). As for the MW-GAN, MPRNet and DCNGAN, they can enhance the perceptual quality at all tested QPs with the help of GANs, but their performance is unsatisfactory in terms of slight decrease of LPIPS and DISTS. With QP-conditional adaptation, one trained model of the proposed PeQuENet is tested on the four QPs achieving the best performance and saving 75% model parameters at the same time.
Taking four frames in test sequences compressed at QP 37 as examples, we show qualitative comparison of video enhancement algorithms in Fig. 5. Overall, the MFQE 2.0 and the STDF produce overly smooth results. While smoothness can be desirable in the background, as shown in Fig. 5 (b), but it penalizes the perceptual quality of the areas with complex texture, as shown in Fig. 5 (a), (c) and (d). The VPE-GAN tends to generate severe artifacts in the areas with simple texture to combat with over-smoothness, as shown in Fig. 5 (b), which leads to visual experience deterioration. The MPRNet and the DCNGAN can enhance perceptual quality to some extent, but the problem of over-smoothness has not been solved completely, as shown in Fig. 5 (c) and (d). Compared with the other video enhancement algorithms, the proposed PeQuENet not only generates more realistic details in the areas with complex texture to eliminate over-smoothness, but also avoids the appearance of artifacts in the areas with simple texture, which illustrates the best performance of the proposed PeQuENet.
To further evaluate the subjective quality of the proposed method, a mean opinion score (MOS) test is also conducted at QP 37. Specifically, 15 subjects have rated scores from 1 to 5 following on 18 JCT-VC standard test sequences of H.265/HEVC enhanced by the VPE-GAN, MPRNet, DCNGAN and the proposed PeQuENet. Higher scores correspond to better subjective quality. Average MOS results of all methods are compared in Fig. 6. As can be seen, the proposed PeQuENet achieves the highest average MOS.
GANs help improve the QoE by training the network to generate high-frequency information. Hence, when GANs are employed in the video perceptual quality enhancement networks, the temporal consistency should be examined in case that generated details in each frame are inconsistent. Compared with three perceptual quality enhancement algorithms utilizing GANs, i.e., the VPE-GAN, MPRNet and DCNGAN, temporal consistency performance of the sequence BQSquare compressed at QP 37 is shown in Fig. 7 as an example. There is significant noise in the VPE-GAN and the DCNGAN, which indicates flicking artifacts in the enhanced videos. As for the MPRNet, it performs better but still contains temporal inconsistency. By fully employing global temporal information in the consecutive frames, the proposed PeQuENet shows the smoothest temporal transition in the output videos.
|Models||Average LPIPS||Number of Parameters|
|with QP-conditional adaptation||0.059||26329|
|without QP-conditional adaptation||0.059||105316|
|Deformable Convolution||Optical Flow||Attention|
Iv-D QP-Conditional Adaptation Evaluation
Since the proposed PeQuENet can achieve the QP-conditional adaptation, we only train one model to enhance the perceptual quality of compressed videos at the four different QPs. The performance of this model, referred to as “Trained_4QPs”, has been shown in TABLE I (at the last two columns). To validate the effectiveness of the QP-conditional adaptation strategy, we train the four models, named “Trained_QP22”, “Trained_QP27”, “Trained_QP32” and “Trained_QP37”, for the four QPs. It should be noted that the training dataset for each of these models contains only samples compressed at the corresponding single QP. The performance of the four models tested on the four QPs is shown in TABLE II. To highlight the advantage of the QP-conditional adaption strategy used in the proposed PeQuENet, the average LPIPS performance and the number of parameters of the model with or without QP-conditional adaptation are compared in TABLE III. Besides, the the average LPIPS comparison between “Trained_4QPs” and the other four models is shown in Fig. 8.
Specifically, the model trained at a single QP performs the best when tested on this QP since the model can learn the characteristics of videos compressed at the corresponding QP in the training process. For instance, the model “Trained_QP37” (i.e., the green line in Fig. 8) shows its best performance compared with the other models (especially “Trained_QP22”) when tested on the QP 37. This is because “Trained_QP37” have learned specific characteristics of videos compressed at the QP 37 after training while the other models do not have access to such characteristics. However, “Trained_QP37” has the worst performance when tested on the QP 22. It is not only because it has not learned the characteristics of videos compressed at the QP 22, but also because the differences of characteristics between videos compressed at the QPs 22 and 37 are significant. On the other hand, with training dataset containing samples compressed at the four QPs and encoded QP information fed into the model, “Trained_4QPs” achieves similar results compared to the models trained for specific QPs, which illustrates the effectiveness of the used QP-conditional adaption strategy.
|Sequences||Deformable Convolution||Optical Flow||Attention|
Iv-E Visualization of Global Temporal Correlations
The non-local attention module aims at leveraging global temporal correlations in consecutive frames to enhance the perceptual quality of compressed videos in the proposed PeQuENet. In this section, taking the leftmost attention block drawn in Fig. 2 (six attention blocks in total) as an example, we visualize its captured global temporal correlations, as shown in Fig. 9. It should be noted that, due to down-sampling operations, the resolution of the input of this attention block is a quarter of that of an input frame. To cater for the resolution, frames shown in Fig. 9 are also down-sampled in the same way.
Specifically, the frames shown in the first row are the third frame in the corresponding video sequences compressed at QP 37, and other frames are the corresponding preceding frame (i.e., the second frame). Four cases are selected in each frame to illustrate the captured global temporal correlations. For example, when the pixel is located at the vertical line on the wall (i.e., the red pixel emphasized by a red circle in Fig. 9 (a)), its correlated pixels in the preceding frame are all located at vertical lines. While the pixel is located at the horizontal line on the wall (i.e., the blue pixel emphasized by a blue circle in Fig. 9 (a)), its correlated pixels in the preceding frame are all located at the horizontal line. Moreover, these correlated pixels are distributed globally and not limited by the range of the motion, which provides more temporal information to achieve advanced and robust performance in the perceptual quality enhancement of compressed videos.
Iv-F Global Alignment vs. Local Alignment
In this section, we compare the global alignment, i.e., attention maps, with two common local alignment operations, i.e., deformable convolutions and optical flows, to emphasize the importance of leveraging global temporal information in the perceptual quality enhancement of compressed videos. Note that the deformable convolution is applied in the STDF and DCNGAN, and the optical flow is applied in the MFQE 2.0 and MW-GAN. These four quality enhancement networks have been compared with our proposed PeQuENet in TABLE I and Fig. 5.
Keeping the framework of the network unchanged, we replace attention blocks with deformable convolution blocks or optical flow blocks. For example, the leftmost attention block shown in Fig. 2 is replaced by the deformable convolution block shown in Fig. 10 or the optical flow block shown in Fig. 11. After aligned features are obtained, they are concatenated with the output of the corresponding layer in the progressive decoder module to provide temporal information for the target frame. Full structures of the networks incorporating deformable convolutions and optical flows are shown in Fig. 16 and Fig. 17 in our Supplemental Material. It should be noted that the structure of the deformable convolution block is referred to that in the PCD module in , and that optical flow estimation in the optical flow block is achieved by the pre-trained model of SPynet . The average performance tested on all JCT-VC standard test sequences compressed at QP 37 is compared in TABLE IV.
Essentially, deformable convolutions and optical flows are local temporal alignment operations since the range of the temporal information they captured is determined by the range of the motion. Limited by being relatively local, deformable convolutions and optical flows penalize the robustness of the perceptual quality enhancement, which results in some artifacts, e.g., blocking and ringing, in the output videos, as shown in Fig. 12. On the other hand, the attention performs non-local temporal alignment and averaging. It has access to global temporal information, which ensures its success in the perceptual quality enhancement task.
We also compare the computational complexity of three models utilizing the three alignment modules respectively, as shown in TABLE V. The model with the attention module achieves the fastest speed (almost twice as fast as the model with the optical flow module) on sequences in Class D, which illustrates the highest efficiency of attention maps for sequences in low resolution. While deformable convolutions have advantages in processing sequences in high resolution with a speed faster than that of attention maps. In summary, the speed of the model with the attention module is the fastest averaged on all standard test sequences. It should be noted that we split frames into parts and process each part independently if sequences in high resolution cannot be fed into the model with the attention module at once due to the limitation of memory. After enhancement, we merge enhanced parts into a whole frame. The computational complexity shown in the last column in TABLE V is the complexity of the whole process. Fortunately, there are no artifacts caused by splitting and merging operations.
For better illustration, the relations between the performance (LPIPS) and the runtime (second) of the models with the three alignment modules are shown in Fig. 13. The model with the attention module achieves the advanced performance while maintaining efficiency to some extent.
Iv-G Performance-Speed Tradeoff
Iv-G1 Downsampled Attention Maps
Although the attention mechanism helps the proposed PeQuENet achieve advanced performance, the attention map which is in resolution (i.e., complete attention maps), as shown in Fig. 3, requires a great number of multiplication operations and large memory to store especially for high-resolution sequences. To prompt the proposed PeQuENet to be more efficient, the inputs of attention blocks are downsampled by factors 2, 4 and 6, to obtain attention maps in reduced resolutions. In this case, the proposed PeQuENet can achieve faster speed with limited performance loss. For better comparison, we also test the performance and speed when the attention mechanism is removed from the proposed PeQuENet (i.e., w/o attention maps). The average performance and speed trade-off on all JCT-VC standard test sequences at QP 37 is shown in Fig. 14.
Iv-G2 Ablation Studies of Components
Ablation studies are performed to better understand the contribution and the complexity of each component in the proposed PeQuENet. We start from a baseline and gradually add components on it. The LPIPS performance and speed of each configuration (from A to I) are shown in TABLE VI. It is apparent that each component brings performance improvement (i.e., LPIPS ranging from 0.176 to 0.106) with additional computational complexity (i.e., speed ranging from 17.99 fps to 5.73 fps). By choosing different configurations, the proposed PeQuENet can achieve trade-off between the performance and speed, as shown in Fig. 15.
|MFQE 2.0 ||STDF ||MW-GAN ||VPE-GAN ||MPRNet ||DCNGAN ||Proposed|
|Number of parameters||255K||365K||53839K||11376K||333K||45977K||26329K|
In theory, the proposed PeQuENet can be extended to take more temporal neighboring frames as input and add more attention blocks and decoded branches. However, considering the considerable performance gain when using two adjacent frames and high computational complexity introduced by adding more neighboring frames, we keep the temporal reference frames to two.
Iv-H Ablation Studies of Components in Total Loss
The contribution of each component in total loss to the improvement of perceptual quality is shown in TABLE VII. Optimized only under the guidance of the adversarial loss (i.e., ), the network tends to randomly generate artifacts, which penalizes the perceptual quality of compressed videos. By incorporating a perceptual loss (i.e., ), the network training becomes more stable, and the perceptual quality of compressed videos is improved significantly. This is because the generator learns to match intermediate representations extracted by the VGG model from the ground truth and the generated frames. Similarly, the feature matching loss (i.e., ) helps the generator understand what the natural features extracted by the discriminator from the ground truth frames are like and promotes the generator to produce frames that look real. After adding the , we further obtain a slight performance gain (i.e., an LPIPS decrease from 0.108 to 0.106). In the proposed PeQuENet, we use the total loss composed of all of the three loss components to achieve the best performance. Although the total loss seems complex, it will not increase the complexity of the inference stage.
Iv-I Comparison of the Number of Model Parameters
The comparison of the number of model parameters is shown in TABLE VIII. The models of the MFQE 2.0, MPRNet and STDF have fewer parameters due to simple network architectures. But the models of the proposed PeQuENet, VPE-GAN, DCNGAN and MW-GAN are relatively heavy. However, the proposed PeQuENet can significantly enhance the perceptual quality of compressed videos and achieve the best performance compared with other methods. Thus, the proposed PeQuENet is suitable to be applied in specific use cases such as video production or archiving where it is more important to get as high quality as possible, despite possible complexity increase.
Iv-J Performance at Other QPs
We trained the proposed PeQuENet at four different QPs, i.e., QP 22, 27, 32 and 37, and tested the performance at these four QPs, as shown in TABLE I. To further verify the generalization capability of the proposed PeQuENet, we also test on the video sequences compressed at other QPs (i.e., QP 24 and 35) without fine-tuning. The performance on both LPIPS and DISTS is shown in TABLE X where “Compressed” indicates the compressed videos (i.e., the input of the proposed PeQuENet) while “Proposed” indicates the videos enhanced by the proposed method (i.e., the output of the proposed PeQuENet). As can be seen, the perceptual quality of compressed videos is improved significantly, which implies the strong generalization ability of the proposed method.
Iv-K More Results Under H.264/AVC and H.266/VVC
We also evaluate the performance of the proposed PeQuENet on the standard test sequences of H.264/AVC and H.266/VVC and compare it with the other GAN-based video enhancement methods, i.e., the VPE-GAN, MPRNet and DCNGAN, to further study the effectiveness of our proposed method. Note that we directly use models trained with HEVC without fine-tuning. Specifically, standard test sequences are compressed with the relevant test models (i.e., JM and VTM) under the LDP configuration at QP 27 and 37. The performance is compared in TABLE XI and TABLE XII in our Supplemental Material. As can be seen, the proposed PeQuENet can always achieve the best performance in terms of the average LPIPS and DISTS on the standard test sequences of both H.264/AVC and H.266/VVC compressed at both QPs.
Iv-L Evaluation of Distortion
The proposed PeQuENet aims to enhance the perceptual quality of compressed videos and is trained with only perceptual losses. Thus, its performance measured by PSNR/SSIM is degraded because of the perception-distortion tradeoff . We further evaluate the objective qualities of videos enhanced by the proposed PeQuENet and other compressed video enhancement methods and compare them in TABLE IX. As expected, the MFQE 2.0 and STDF improve PSNR and SSIM to different degrees while the other methods penalize both of them. Compared with the MW-GAN, VPE-GAN, MPRNet and DCNGAN, the proposed PeQuENet achieves the best PSNR/SSIM performance, which indicates the proposed PeQuENet achieves the best perception while having less influence on the distortion. However, the other methods, especially the VPE-GAN, enhance the perception with significantly penalizing the distortion.
In this paper, PeQuENet is proposed to enhance the perceptual quality of compressed videos based on the non-local attention and QP-conditional adaptation modules. By fully leveraging global temporal information between consecutive frames, the proposed PeQuENet can provide enhanced frames in higher perceptual quality consistently. With QP-conditional adaptation, we are the first to successfully train one single model to enhance perceptual quality of videos compressed at various QPs, saving multiple sets of model parameters without performance loss. Experimental results have demonstrated that the proposed PeQuENet outperformed the state-of-the-art compressed video quality enhancement networks quantitatively and qualitatively.
-  T. Wiegand, G. J. Sullivan, G. Bjontegaard and A. Luthra, “Overview of the H.264/AVC Video Coding Standard,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560-576, July 2003, doi: 10.1109/TCSVT.2003.815165.
-  G. J. Sullivan, J. Ohm, W. Han and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649-1668, Dec. 2012, doi: 10.1109/TCSVT.2012.2221191.
-  C. G. Bampis, Z. Li, A. K. Moorthy, I. Katsavounidis, A. Aaron and A. C. Bovik, “Study of Temporal Effects on Subjective Video Quality of Experience,” in IEEE Transactions on Image Processing, vol. 26, no. 11, pp. 5217-5231, Nov. 2017, doi: 10.1109/TIP.2017.2729891.
T. Wang, M. Chen and H. Chao, “A Novel Deep Learning-Based Method of Improving Coding Efficiency from the Decoder-End for HEVC,”2017 Data Compression Conference (DCC), 2017, pp. 410-419, doi: 10.1109/DCC.2017.42.
Y. Dai, D. Liu, and F. Wu, “A Convolutional Neural Network Approach for Post-Processing in HEVC Intra Coding,” inInternational Conference on Multimedia Modeling, pp. 28–39, Springer, 2017.
-  R. Yang, M. Xu, T. Liu, Z. Wang and Z. Guan, “Enhancing Quality for HEVC Compressed Videos,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 7, pp. 2039-2054, July 2019, doi: 10.1109/TCSVT.2018.2867568.
-  R. Yang, M. Xu, Z. Wang and T. Li, “Multi-Frame Quality Enhancement for Compressed Video,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6664-6673, doi: 10.1109/CVPR.2018.00697.
-  G. Lu, X. Zhang, W. Ouyang, D. Xu, L. Chen and Z. Gao, “Deep Non-Local Kalman Network for Video Compression Artifact Reduction,” in IEEE Transactions on Image Processing, vol. 29, pp. 1725-1737, 2020, doi: 10.1109/TIP.2019.2943214.
-  Z. Guan, Q. Xing, M. Xu, R. Yang, T. Liu and Z. Wang, “MFQE 2.0: A New Approach for Multi-Frame Quality Enhancement on Compressed Video,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 949-963, 1 March 2021, doi: 10.1109/TPAMI.2019.2944806.
J. Deng, L. Wang, S. Pu and C. Zhuo, “Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp.10696-10703, 2020.
-  Y. Blau and T. Michaeli, “The Perception-Distortion Tradeoff,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6228-6237, doi: 10.1109/CVPR.2018.00652.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Proceedings of Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
-  J. Wang, X. Deng, M. Xu, C. Chen, and Y. Song, “Multi-Level Wavelet-Based Generative Adversarial Network for Perceptual Quality Enhancement of Compressed Video,” in European Conference on Computer Vision. Springer, 2020, pp. 405–421.
-  T. Wang, J. He, S. Xiong, P. Karn and X. He, “Visual Perception Enhancement for HEVC Compressed Video Using a Generative Adversarial Network,” 2020 International Conference on UK-China Emerging Technologies (UCET), 2020, pp. 1-4, doi: 10.1109/UCET51115.2020.9205459.
-  Z. Jin, P. An, C. Yang and L. Shen, “Post-Processing for Intra Coding Through Perceptual Adversarial Learning and Progressive Refinement,” Neurocomputing, vol. 394, pp. 158-167, 2020.
-  S. Zhang, L. Herranz, M. Mrak, M. G. Blanch, S. Wan and F. Yang, “DCNGAN: A Deformable Convolution-Based GAN with QP Adaptation for Perceptual Quality Enhancement of Compressed Video,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2035-2039, doi: 10.1109/ICASSP43922.2022.9746702.
-  J. Dai et al., “Deformable Convolutional Networks,” 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 764-773, doi: 10.1109/ICCV.2017.89.
-  X. Zhu, H. Hu, S. Lin and J. Dai, “Deformable ConvNets V2: More Deformable, Better Results,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9300-9308, doi: 10.1109/CVPR.2019.00953.
-  X. Wang, R. Girshick, A. Gupta and K. He, “Non-local Neural Networks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern ReC. Liu, H. Sun, J. Katto, X. Zeng and Y. Fancognition, 2018, pp. 7794-7803, doi: 10.1109/CVPR.2018.00813.
-  T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao and Y. Wang, “End-to-End Learnt Image Compression via Non-Local Attention Optimization and Improved Context Modeling,” in IEEE Transactions on Image Processing, vol. 30, pp. 3179-3191, 2021, doi: 10.1109/TIP.2021.3058615.
Z. Hu, G. Lu and D. Xu, “FVC: A New Framework towards Deep Video Compression in Feature Space,”
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1502-1511.
-  K. Li, J. Zhang, Y. Liu, Y. -K. Lai and Q. Dai, “PoNA: Pose-Guided Non-Local Attention for Human Pose Transfer,” in IEEE Transactions on Image Processing, vol. 29, pp. 9584-9599, 2020, doi: 10.1109/TIP.2020.3029455.
-  H. Tan, X. Zhang, Z. Zhang, L. Lan, W. Zhang and Z. Luo, “Nocal-Siam: Refining Visual Features and Response With Advanced Non-Local Blocks for Real-Time Siamese Tracking,” in IEEE Transactions on Image Processing, vol. 30, pp. 2656-2668, 2021, doi: 10.1109/TIP.2021.3049970.
-  Y. Wen et al., “Non-Local Attention Learning for Medical Image Classification,” 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1-6, doi: 10.1109/ICME51207.2021.9428267.
-  Y. Li, S. Tang, Y. Ye and J. Ma, “Spatial-Aware Non-Local Attention for Fashion Landmark Detection,” 2019 IEEE International Conference on Multimedia and Expo (ICME), 2019, pp. 820-825, doi: 10.1109/ICME.2019.00146.
-  M. G. Blanch, I. Khalifeh, A. Smeaton, N. O’Connor and M. Mrak, “Attention-based Stylisation for Exemplar Image Colourisation,” arXiv preprint, arXiv:2105.01705, May 2021.
-  C. Liu, H. Sun, J. Katto, X. Zeng and Y. Fan, “A QP-Adaptive Mechanism for CNN-Based Filter in Video Coding,” arXiv preprint, arXiv:2010.13059, Oct. 2020.
-  Z. Huang, J. Sun, X. Guo and M. Shang, “One-for-all: An Efficient Variable Convolution Neural Network for In-loop Filter of VVC,” in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2021.3089498.
-  B. Bross et al., “‘Overview of the Versatile Video Coding (VVC) Standard and its Applications,” in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2021.3101953.
-  X. Song et al., “A Practical Convolutional Neural Network as Loop Filter for Intra Frame,” 2018 25th IEEE International Conference on Image Processing (ICIP), 2018, pp. 1133-1137, doi: 10.1109/ICIP.2018.8451589.
-  Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong and Q. Dai, “Residual Highway Convolutional Neural Networks for in-loop Filtering in HEVC,” in IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3827-3841, Aug. 2018, doi: 10.1109/TIP.2018.2815841.
-  K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint, arXiv:1409.1556, 2014.
-  J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei-Fei, “ImageNet: A large-scale hierarchical image database,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255, doi: 10.1109/CVPR.2009.5206848.
-  A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” Distill 1.10 (2016): e3.
P. Isola, J. Zhu, T. Zhou and A. A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,”2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5967-5976, doi: 10.1109/CVPR.2017.632.
-  T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz and B. Catanzaro, “High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8798-8807, doi: 10.1109/CVPR.2018.00917.
-  T. Salimans, I. Goodfellow, W. Zaremba and V. Cheung, “Improved Techniques for Training GANs,” in Proceedings of Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
-  X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang and S. P. Smolley, “Least squares generative adversarial networks,” 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2813-2821, doi: 10.1109/ICCV.2017.304.
-  F. Bossen, Common test conditions and software reference configurations, document JCTVC-L1100, ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Joint Collaborative Team on Video Coding (JCT-VC), Jan. 2013.
-  D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Proceedings of the 3rd International Conference for Learning Representations, 2015.
-  R. Zhang, P. Isola, A. A. Efros, E. Shechtman and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 586-595, doi: 10.1109/CVPR.2018.00068.
-  K. Ding, K. Ma, S. Wang and E. P. Simoncelli, “Image Quality Assessment: Unifying Structure and Texture Similarity,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2020.3045810.
-  X. Wang, K. C. K. Chan, K. Yu, C. Dong and C. C. Loy, “EDVR: Video Restoration With Enhanced Deformable Convolutional Networks,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 1954-1963, doi: 10.1109/CVPRW.2019.00247.
-  A. Ranjan and M. J. Black, “Optical Flow Estimation Using a Spatial Pyramid Network,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2720-2729, doi: 10.1109/CVPR.2017.291.
Qunliang Xing, “Pytorch Implementation of STDF,”https://github.com/RyanXingQL/STDF-PyTorch, 2020, [Online; accessed 11-April-2021].
-  K. Seshadrinathan, R. Soundararajan, A. C. Bovik and L. K. Cormack, “Study of Subjective and Objective Quality Assessment of Video,” in IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1427-1441, June 2010, doi: 10.1109/TIP.2010.2042111.
-  H. 264/AVC Software Coordination JM Reference Software, [online] Available: http://iphome.hhi.de/suehring/tml.
-  VVC reference software repository, 2021, [online] Available: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM.