MFQE 2.0: A New Approach for Multi-frame Quality Enhancement on Compressed Video

02/26/2019 ∙ by Zhenyu Guan, et al. ∙ 20

The past few years have witnessed great success in applying deep learning to enhance the quality of compressed image/video. The existing approaches mainly focus on enhancing the quality of a single frame, not considering the similarity between consecutive frames. Since heavy fluctuation exists across compressed video frames as investigated in this paper, frame similarity can be utilized for quality enhancement of low-quality frames by using their neighboring high-quality frames. This task can be seen as Multi-Frame Quality Enhancement (MFQE). Accordingly, this paper proposes an MFQE approach for compressed video, as the first attempt in this direction. In our approach, we firstly develop a Bidirectional Long Short-Term Memory (BiLSTM) based detector to locate Peak Quality Frames (PQFs) in compressed video. Then, a novel Multi-Frame Convolutional Neural Network (MF-CNN) is designed to enhance the quality of compressed video, in which the non-PQF and its nearest two PQFs are the input. In MF-CNN, motion between the non-PQF and PQFs is compensated by a motion compensation subnet. Subsequently, a quality enhancement subnet fuses the non-PQF and compensated PQFs, and then reduces the compression artifacts of the non-PQF. Finally, experiments validate the effectiveness and generalization ability of our MFQE approach in advancing the state-of-the-art quality enhancement of compressed video. The code of our MFQE approach is available at



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 9

page 13

page 15

Code Repositories


Official repository of "MFQE 2.0: A New Approach for Multi-frame Quality Enhancement on Compressed Video", TPAMI 2019.

view repo


Multi-frame Quality Enhancement on Compressed Video

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

During the past decades, there has been a considerable increase in popularity of video over the Internet. According to Cisco Data Traffic Forecast [1], video generates of Internet traffic in 2016, and this figure is predicted to reach by 2020. When transmitting video over the bandwidth-limited Internet, video compression has to be applied to significantly save the coding bit-rate. However, the compressed video inevitably suffers from compression artifacts, which severely degrade the Quality of Experience (QoE) [2, 3, 4, 5, 6]. Besides, such artifacts may reduce the accuracy for tasks of classification and recognition. It is verified in [7, 8, 9, 10] that compression quality enhancement is able to improve the performance of classification and recognition. Therefore, there is a pressing need to study on quality enhancement for compressed video.

Fig. 1: An example for quality fluctuation (top) and quality enhancement performance (bottom).
Fig. 2: Examples of video sequences in our enlarged database.

Recently, extensive works were conducted for enhancing visual quality of compressed image and video [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]. For example, Dong et al. [17] designed a four-layer Convolutional Neural Network (CNN) [26], named AR-CNN, which considerably improves the quality of JPEG images. Then, Denoising CNN (DnCNN) [20]

, which applies residual learning strategy, was proposed for image denoising, image super-resolution and JPEG quality enhancement. Later, Yang

et al. [24, 25] designed a Decoder-side Scalable CNN (DS-CNN) for video quality enhancement. The DS-CNN structure is composed of two subnets, aiming at reducing intra- and inter-coding distortion, respectively. However, when processing a single frame, all existing quality enhancement approaches do not take any advantage of information provided by neighboring frames, and thus their performance is severely limited. As Fig. 1 shows, the quality of compressed video dramatically fluctuates across frames. Therefore, it is possible to use the high-quality frames (i.e., Peak Quality Frames, called PQFs111PQF is defined as the frame whose quality is higher than both its previous frame and subsequent frame.) to enhance the quality of their neighboring low-quality frames (non-PQFs). This can be seen as Multi-Frame Quality Enhancement (MFQE), similar to multi-frame super-resolution [27, 28, 29].

This paper proposes an MFQE approach for compressed video. Specifically, we investigate that there exists large quality fluctuation in consecutive frames, for video sequences compressed by almost all compression standards. Thus, it is possible to improve the quality of a non-PQF with the help of its neighbouring PQFs. To this end, we first train a Bidirectional Long Short-Term Memory (BiLSTM) based model as a no-reference method to detect PQFs. Then, a novel Multi-Frame CNN (MF-CNN) architecture is proposed for non-PQF quality enhancement, which takes both the current non-PQF and its adjacent PQFs as input. Our MF-CNN includes two components, i.e., Motion Compensation subnet (MC-subnet) and Quality Enhancement subnet (QE-subnet). The MC-subnet is developed to compensate motion between current non-PQF and its adjacent PQFs. The QE-subnet, with a spatio-temporal architecture, is designed to extract and merge the features of current non-PQF and compensated PQFs. Finally, the quality of the current non-PQF can be enhanced by QE-subnet which takes advantage of higher quality information provided by its adjacent PQFs. For example, as shown in Fig. 1, the current non-PQF (frame 95) and its nearest two PQFs (frames 92 and 96) are both fed into MF-CNN in our MFQE approach. As a result, the low-quality content (basketball) in non-PQF (frame 95) can be enhanced upon essentially the same but qualitatively better content in neighboring PQFs (frames 92 and 96). Moreover, Fig. 1 shows that our MFQE approach also mitigates the quality fluctuation, due to the considerable quality improvement of non-PQFs.

This work is an extended version of our conference paper [30] (called MFQE 1.0 in this paper) with additional works and substantial improvements, thus called MFQE 2.0 (called MFQE in this paper for simplicity). The extension is as follows. (1) We enlarge our database in [30]

from 70 to 160 uncompressed videos. On this basis, more thorough analysis of the compressed video is conducted. (2) We develop a new PQF-detector, which is based on BiLSTM instead of the support vector machine (SVM) in

[30]. Our new detector is capable of extracting both spatial and temporal information of PQFs, leading to a boost in -score of PQF detection from to

. (3) We advance our QE-subnet by introducing the multi-scale strategy, Batch Normalization

[31] and dense connection [32], rather than the conventional design of CNN in [30]. Besides, we develop a lightweight structure for the QE-subnet to accelerate the speed of video quality enhancement. Experiments show that the average Peak Signal-to-Noise Ratio (PSNR) improvement on 18 sequences selected by [33] largely increases from 0.442 dB to 0.559 dB (i.e., improvement), while the number of parameters substantially reduces from 1,787,547 to 255,422 (i.e., saving), resulting in at least 2 times acceleration of quality enhancement. (4) More extensive experiments are provided to validate the performance and generalization ability of our MFQE approach.

Fig. 3: PSNR curves of compressed video by various compression standards.

2 Related works

2.1 Related works on quality enhancement

Recently, extensive works [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] have focused on enhancing the visual quality of compressed image. Specifically, Foi et al. [12] applied point-wise Shape-Adaptive DCT (SA-DCT) to reduce the blocking and ringing effects caused by JPEG compression. Later, Jancsary et al. [14] proposed reducing JPEG image blocking effects by adopting Regression Tree Fields (RTF). Moreover, sparse coding was utilized to remove the JPEG artifacts, such as [15] and [16]. Recently, deep learning has also been successfully applied to improve the visual quality of compressed image. Particularly, Dong et al. [17] proposed a four-layer AR-CNN to reduce the JPEG artifacts of images. Afterwards, [19] and Deep Dual-domain Convolutional Network (DDCN) [18] were proposed as advanced deep networks for the quality enhancement of JPEG image, utilizing the prior knowledge of JPEG compression. Later, DnCNN was proposed in [20] for several tasks of image restoration, including quality enhancement. Li et al. [21] proposed a 20-layer CNN for enhancing image quality. Most recently, the memory network (MemNet) [23] has been proposed for image restoration tasks, including quality enhancement. In the MemNet, the memory block was introduced to generate the long-term memory across CNN layers, which successfully compensates the middle- and high-frequency signals distorted during compression. It achieves the state-of-the-art quality enhancement performance for compressed image.

There are also some other works [34, 24, 35] proposed for the quality enhancement of compressed video. For example, the Variable-filter-size Residue-learning CNN (VRCNN) [34] was proposed to replace the in-loop filters for HEVC intra-coding. However, the CNN in [34] was designed as a component of video encoder, so that it is not practical for already compressed video. Most recently, a Deep CNN-based Auto Decoder (DCAD), which contains 10 CNN layers, was proposed in [35] to reduce the distortion of compressed video. Moreover, Yang et al. [24] proposed the DS-CNN approach for video quality enhancement. In [24], DS-CNN-I and DS-CNN-B, as two subnetworks of DS-CNN, are used to reduce the artifacts of intra- and inter-coding, respectively. All above approaches can be seen as single-frame quality enhancement approaches, as they do not take any advantage of neighboring frames with high similarity. Consequently, their performance on video quality enhancement is severely limited.

2.2 Related works on multi-frame super-resolution

To our best knowledge, there exists no MFQE work for compressed video. The closest area is multi-frame video super-resolution. In the early years, Brandi et al. [36] and Song et al. [37] proposed to enlarge video resolution by taking advantage of high resolution key-frames. Recently, many multi-frame super-resolution approaches have employed deep neural networks. For example, Huang et al. [38] developed a Bidirectional Recurrent Convolutional Network (BRCN), which improves the super-resolution performance over traditional single-frame approaches. Kappeler et al. proposed a Video Super-Resolution network (VSRnet) [27]

, in which the neighboring frames are warped according to the estimated motion, and then both the current and warped neighboring frames are fed into a super-resolution CNN to enlarge the resolution of the current frame. Later, Li

et al. [28] proposed replacing VSRnet by a deeper network with residual learning strategy. All these multi-frame methods exceed the limitation of single-frame approaches (e.g., SR-CNN [39]) for super-resolution, which only utilize the spatial information within one single frame.

Recently, the CNN-based FlowNet [40, 41] has been applied in [42] to estimate the motion across frames for super-resolution, which jointly trains the networks of FlowNet and super-resolution. Then, Caballero et al. [29] designed a spatial transformer motion compensation network to detect the optical flow for warping neighboring frames. The current and warped neighboring frames were then fed into the Efficient Sub-Pixel Convolution Network (ESPCN) [43] for super-resolution. Most recently, the Sub-Pixel Motion Compensation (SPMC) layer has been proposed in [44] for video super-resolution. Besides, [44] utilized Convolutional Long Short-Term Memory (ConvLSTM) to achieve the state-of-the-art performance on video super-resolution.

The aforementioned multi-frame super-resolution approaches are motivated by the fact that different observations of a same object or scene are highly likely to exist in consecutive frames of video. As a result, the neighboring frames may contain the content missed when down-sampling the current frame. Similarly, for compressed video, the low-quality frames can be enhanced by taking advantage of their adjacent frames with higher quality, because heavy quality fluctuation exists across compressed frames. Consequently, the quality of compressed video may be effectively improved by leveraging the multi-frame information. To the best of our knowledge, our MFQE approach proposed in this paper is the first attempt in this direction.

Metrics MPEG-1 MPEG-2 MPEG-4 H.264 HEVC
SD 2.2175 2.2273 2.1261 1.6899 0.8788
PVD 1.1553 1.1665 1.0842 0.4732 1.1734
SD 0.0717 0.0726 0.0735 0.0552 0.0105
PVD 0.0387 0.0391 0.0298 0.0102 0.0132
Separation (frames)
PS 5.3646 5.4713 5.4123 2.0529 2.6641
TABLE I: Averaged SD, PVD and PS values of our database.

3 Analysis of compressed video

In this section, we first establish a large-scale database of raw and compressed video sequences (Section 3.1) for training the deep neural networks in our MFQE approach. We further analyze our database to investigate the frame-level quality fluctuation (Section 3.2) and similarity between consecutive compressed frames (Section 3.3). The analysis results can be seen as the motivation of our work.

Fig. 4: An example of frame-level quality fluctuation in video Football compressed by HEVC.

3.1 Database

First, we establish a database including 160 uncompressed video sequences. These sequences are selected from the datasets of [45], VQEG [46] and Joint Collaborative Team on Video Coding (JCT-VC) [47]. The video sequences contained in our database are at large range of resolutions: SIF (352240), CIF (352288), NTSC (720486), 4CIF (704576), 240p (416240), 360p (640360), 480p (832480), 720p (1280720), 1080p (19201080), and WQXGA (25601600). Moreover, Fig. 2 shows some typical examples of the sequences in our database, demonstrating the diversity of video content. Then, all video sequences are compressed by MPEG-1 [48], MPEG-2 [49], MPEG-4 [50], H.264/AVC [51] and HEVC [52] at different quantization parameters (QPs)222FFmpeg is used for MPEG-1, MPEG-2, MPEG-4 and H.264/AVC compression, and HM16.5 is used for HEVC compression., to generate the corresponding video streams in our database.

Fig. 5: The average CC value of each pair of adjacent frames in HEVC.

3.2 Frame-level quality fluctuation

Fig. 3 shows the PSNR curves of 8 video sequences, which are compressed by different compression standards. It can be seen that PSNR obviously fluctuates along with the compressed frames. This indicates that there exists considerable quality fluctuation in compressed video sequences for MPEG-1, MPEG-2, MPEG-4, H.264/AVC and HEVC. In addition, Fig. 4 visualizes the subjective results of some frames in one video sequence, which is compressed by the latest HEVC standard. We can see that visual quality varies across compressed frames, also implying the frame-level quality fluctuation.

Moreover, we measure the Standard Deviation (SD) of frame-level PSNR and Structural Similarity (SSIM) for each compressed video sequence, to quality fluctuation throughout the frames. Besides, the Peak-Valley Difference (PVD), which calculates the average difference between peak values and their nearest valley values, is also measured for both PSNR and SSIM curves of each compressed sequence. Note that the PVD reflects the quality difference between frames within a short period. The results of SD and PVD are reported in Table

I, which are averaged over all 160 video sequences in our database. Table I shows that the average SD values of PSNR are above dB for all five compression standards. This implies that compressed video sequences exist heavy fluctuation along with frames. In addition, we can see from Table I that the average PVD results of PSNR are above 1 dB for MPEG-1, MPEG-2, MPEG-4 and HEVC, except that of H.264 (0.4732 dB). Therefore, the visual quality is dramatically different between PQFs and Valley Quality Frames (VQFs), such that it is possible to significantly improve the visual quality of VQFs given their neighboring PQFs. Note that similar results can be found for SSIM as shown in Table I. In summary, we can conclude that the significant frame-level quality fluctuation exists for various video compression standards in terms of both PSNR and SSIM.

Fig. 6: The framework of our proposed MFQE approach.

3.3 Similarity between neighboring frames

It is intuitive that the frames within a short time period are with high similarity. We thus evaluate the Correlation Coefficient (CC) values between each compressed frame and its previous/subsequent 10 frames, for all 160 sequences in our database. The mean and SD of the CC values are shown in Fig. 5, which are obtained from all sequences compressed by HEVC. We can see that the average CC values are larger than 0.75 and the SD values of CC are less than 0.20, when the period of two frames is within 10. Similar results can be found for other four video compression standards. This validates the high correlation of neighboring video frames.

In addition, it is necessary to investigate the number of non-PQFs between these two neighboring PQFs, denoted by the Peak Separation (PS), since the quality enhancement of each non-PQF is based on two neighboring PQFs. Table I also reports the results of PS, which are averaged over all 160 video sequences in our database. We can see from this table333Note that this paper only defines PS according to PSNR rather than SSIM, but similar results can be found for SSIM. that the PS values are considerably smaller than 10 frames, especially for the latest H.264 (PS = 2.0529) and HEVC (PS = 2.6641) standards. Such a short distance, together with the similarity results in Fig. 5

, indicates the high similarity between two neighboring PQFs. Therefore, the PQFs probably contain some useful content which is distorted in their neighboring non-PQFs. Motivated by this, our MFQE approach is proposed to enhance the quality of non-PQFs through the advantageous information of the nearest PQFs.

4 The proposed MFQE approach

4.1 Framework

The framework of our MFQE approach is shown in Fig. 6. As seen in this figure, our MFQE approach first detects PQFs that are used for quality enhancement of non-PQFs. In practical application, raw sequences are not available in video quality enhancement, and thus PQFs and non-PQFs cannot be distinguished through comparison with raw sequences. Therefore, we develop a no-reference PQF detector in our MFQE approach, which is detailed in Section 4.2. Then, we enhance the quality of PQFs and non-PQFs by two different techniques as follows. We adopt DS-CNN [25] to enhance the quality of detected PQFs, which is a single-frame approach for video quality enhancement. It is because the adjacent frames of a PQF are with lower quality and cannot benefit the quality enhancement of this PQF.

Fig. 7: The architecture of our BiLSTM based PQF detector.

For non-PQFs, a novel MF-CNN architecture is proposed to enhance their quality, which takes advantage of the nearest PQFs, i.e., both previous and subsequent PQFs. As shown in Fig. 6, the MF-CNN architecture is composed of the MC-subnet and the QE-subnet. The MC-subnet (introduced in Section 4.3) is developed to compensate the temporal motion between neighboring frames. To be specific, the MC-subnet firstly predicts the temporal motion between the current non-PQF and its nearest PQFs. Then, the two nearest PQFs are warped with the spatial transformer according to the estimated motion. As such, the temporal motion between non-PQF and PQFs can be compensated. Finally, the QE-subnet (introduced in Section 4.4), which has a spatio-temporal architecture, is proposed for quality enhancement. In the QE-subnet, both the current non-PQF and compensated PQFs are the inputs, and then the quality of the non-PQF can be enhanced with the help of the adjacent compensated PQFs. Note that, in the proposed MF-CNN, the MC-subnet and QE-subnet are trained jointly in an end-to-end manner.

4.2 BiLSTM-based PQF detector

In our MFQE approach, the no-reference PQF detector is based on a BiLSTM network. Recall that a PQF is the frame with higher quality than its adjacent frames. Thus, the features of the current and neighboring frames in both forward and backward directions are used together to detect PQFs. As revealed in Section 3.2, the PQF frequently appears in compressed video, leading to the quality fluctuation. Due to this, we apply the BiLSTM network [53] as the PQF detector, in which the long- and short-term correlation between PQF and non-PQF can be extracted and modeled.

Notations. We first introduce the notations for our PQF detector. The consecutive frames in a compressed video are denoted by , where indicates the frame order and is the total number of frames. Then, the corresponding output from BiLSTM is denoted by , in which is the probability of being a PQF. Given , the labels of PQFs for each frame can be determined and denoted by . If is a PQF, then we have ; otherwise, we have .

Feature Extraction. Before training, we extract 38 features for each . Specifically, 2 compressed domain features, i.e., the amount of assigned bits and quantization parameters, are extracted at each frame for detecting the PQF, since they are strongly related to visual quality and can be directly obtained from bitstream. In addition, we follow the no-reference quality assessment method [2] to extract 36 features at pixel domain. Finally, the extracted features are in form of a 38-dimension vector as the input to BiLSTM.

Architecture. The architecture of the BiLSTM is shown in Fig. 7. As seen in this figure, the LSTM is bidirectional, in order to extract and model the dependencies from both forward and backward directions. First, the input 38-dimension feature vector is fed into 2 LSTM cells, corresponding to either forward or backward direction. Each of LSTM cells is composed of 256 units at one time step (corresponding to one video frame). Then, the outputs of the bi-directional LSTM cells are fused and sent to the fully connected layer with a sigmoid activation. Consequently, the fully connected layer outputs , as the probability of being the PQF frame. Finally, the PQF label can be yielded upon .

Fig. 8: The architecture of our MC-subnet.

Postprocessing. In our PQF detector, we further refine the results from BiLSTM according to the prior knowledge of PQF. Specifically, the following two strategies are developed to refine the labels of the PQF detector, where is the total number of frames.

Strategy I: Remove the consecutive PQFs. According to the definition of PQF, it is impossible that the PQFs appear consecutively. Hence, if the consecutive PQFs exist:


we refine the PQF labels according to their probabilities:


so that only one PQF is left.

Strategy II: Break the continuity of non-PQFs. According to the analysis in Section 3, PQFs frequently appear within a limited separation. For example, the average value of PS is 2.66 frames for HEVC compressed sequences. Here, we assume that is the maximal separation between two PQFs. Given this assumption, if the results of yield more than consecutive zeros (non-PQFs):


then one of their corresponding frames need to act as a PQF. Accordingly, we set:


After refining as discussed above, our PQF detector can locate PQFs and non-PQFs in the compressed video.

4.3 MC-subnet

After detecting PQFs, our MFQE approach can enhance the quality of non-PQFs by taking advantage of their neighboring PQFs. Unfortunately, there exists considerable temporal motion between PQFs and non-PQFs. Hence, we develop the MC-subnet to compensate the temporal motion across frames, which is based on the CNN method of Spatial Transformer Motion Compensation [29].

Layers Conv 1 Conv 2 Conv 3 Conv 4 Conv 5
Filter size
Filter number 24 24 24 24 2
Stride 1 1 1 1 1
Function PReLU PReLU PReLU PReLU Tanh
TABLE II: Convolutional layers for pixel-wise motion estimation.
Fig. 9: The architecture of our QE-subnet. In the multi-scale feature extraction component (denoted by C1-C9), the filter sizes of C1/4/7, C2/5/8 and C3/6/9 are , and , respectively, and the filter number is set to be 32 for each layer. In the densely connected mapping construction (denoted by C10-C14), the filter size and number are set to be and 32, respectively. The last layer C15 has only one filter with size of . In addition, the PReLU activation is applied to C1-C14, while BN is applied to C10-C15.

Architecture. The architecture of STMC is shown in Fig. 8. Additionally, the convolutional layers of pixel-wise motion estimation are described in Table II. The same as [29], our MC-subnet adopts the convolutional layers to estimate the and down-scaling Motion Vector (MV) maps, denoted by and . Down-scaling motion estimation is effective to handle large scale motion. However, because of down-scaling, the accuracy of MV estimation is reduced. Therefore, in addition to STMC, we further develop some additional convolutional layers for pixel-wise motion estimation in our MC-subnet, which does not contain any down-scaling process. Then, the output of STMC includes the down-scaling MV map and the corresponding compensated PQF . They are concatenated with the original PQF and non-PQF, as the input to the convolutional layers of the pixel-wise motion estimation. Consequently, the pixel-wise MV map can be generated, which is denoted by . Note that the MV map contains two channels, i.e., horizontal MV map and vertical MV map . Here, and are the horizontal and vertical index of each pixel. Given and , the PQF is warped to compensate the temporal motion. Let the compressed PQF and non-PQF be and , respectively. The compensated PQF can be expressed as



means bilinear interpolation. The reason for interpolation is that

and may be non-integer values.

Component Feature Mapping Overall
extraction construction
MFQE 1.0 3,142,936,808 20,536 3,142,957,344
MFQE 2.0 2,720 100,264,718 100,267,438
TABLE III: Theoretical time complexity (in terms of clock periods) of two QE-subnets for MFQE 1.0 and MFQE 2.0.

Training strategy. Since it is hard to obtain the ground truth of MV, the parameters of the convolutional layers for motion estimation cannot be trained directly. Instead, we can train the parameters by minimizing the MSE between the compensated adjacent frame and the current frame. However, in our MC-subnet, both the input and are compressed frames with quality distortion. Hence, when minimizing the MSE between and the , the MC-subnet learns to estimate the distorted MV, resulting in inaccurate motion estimation. Therefore, the MC-subnet is trained under the supervision of the raw frames. That is, we warp the raw frame of the PQF (denoted by ) using the MV map output from the convolutional layers of motion estimation, and minimize the MSE between the compensated raw PQF (denoted by ) and the raw non-PQF (denoted by

). Mathematically, the loss function of the MC-subnet can be written by


where represents the trainable parameters of our MC-subnet. Note that the raw frames and are not required when compensating motion in test and practical use.

4.4 QE-subnet

Given the compensated PQFs, the quality of non-PQFs can be enhanced through the QE-subnet. To be specific, the non-PQF , together with the compensated previous and subsequent PQFs ( and ), are fed into the QE-subnet. This way, both the spatial and temporal features of these three frames are extracted and fused, such that the advantageous information in the adjacent PQFs can be used to enhance the quality of the non-PQF. It differs from the conventional CNN-based single-frame quality enhancement approaches, which can only handle the spatial information within one single frame.

Architecture. The architecture of QE-subnet is shown in Fig. 9. The QE-subnet consists of two key lightweight components: multi-scale feature extraction (denoted by C1-9) and densely connected mapping construction (denoted by C10-14).

  • Multi-scale feature extraction. The input to the QE-subnet is non-PQF and its neighboring compensated PQFs and . Then, the spatial features of , and are extracted by multi-scale convolutional filters, denoted by C1-9. Specifically, the filter size of C1,4,7 is , while the filter sizes of C2,5,8 and C3,6,9 are and , respectively. The filter numbers of C1-9 are all 32. After feature extraction, 288 feature maps filtered at different scales are obtained. Subsequently, all feature maps from , and are concatenated, and then flow into the dense connection component.

  • Densely connected mapping construction. After obtaining the feature maps from , and , a densely connected architecture is applied to construct the non-linear mapping from feature maps to enhancement residual. Note that enhancement residual refers to the difference between original and enhanced frames. To be specific, there are 5 convolutional layers in the non-linear mapping of the densely connected architecture. Each of them has 32 convolutional filters with size of . In addition, dense connection [32]

    is adopted to encourage feature reuse, strengthen feature propagation and mitigate the vanishing-gradient problem. Moreover, Batch Normalization (BN)

    [31] is applied to all 5 layers after PReLU activation to reduce internal covariate shift, thus accelerating the training process. We denote the composite non-linear mapping as , including Convolution (Conv), PReLU and BN. We further denote the output of the -th layer as , such that each layer can be formulated as follows,


    where refers to the concatenation of the feature maps produced in layers C10-C14. Finally, the enhanced non-PQF is generated by the pixel-wise summation of learned enhancement residual and input non-PQF


    where is defined as the trainable parameters of the QE-subnet.

  • Lightweight yet effective network. MFQE 1.0 is overweight and with low efficiency. To be specific, the QE-subnet can be divided into feature extraction and mapping construction. In MFQE 1.0, the former consists of 3 wide layers with over 64 filters each, while the latter possesses only 2 shallow layers with no more than 32 filters each. Here, we calculate the theoretical time complexity of the QE-subnet in MFQE 1.0, as detailed in the Supporting Document. The results are presented in Table III. We can see that the mapping process takes up only complexity, such that the mapping construction is severely overlooked. Note that the quality is directly enhanced by the constructed mapping, which has the shallow structure with only 2 layers. This leads to inferior performance of quality enhancement. To address this problem, we first deepen the mapping component, such that the nonlinear ability of mapping structure can be greatly improved. Besides, to suppress the excessive growth of the parameter number, we apply the dense strategy and then decrease the number of output feature maps. This is because when the time cost is limited, the layer width is of less importance than the network depth, as proven in [54]. This way, the effectiveness of our QE-subnet can be greatly enhanced, while its complexity remains at a low level. Moreover, we adopt more lightweight layers to further ease the computational burden in the feature extraction component. The overall complexity of the simplified QE-subnet is also presented in Table III, which dramatically drops from to clock periods, i.e., decrease in time. Therefore, MFQE 2.0 succeeds in advancing the QE-subnet with a lightweight yet effective network. The success in both simplicity and effectiveness is further validated in Section 5.3.

Training strategy. The MC-subnet and QE-subnet in our MF-CNN are trained jointly in an end-to-end manner. Recall that and are defined as the raw frames of the previous and incoming PQFs, respectively. The loss function of our MF-CNN can be formulated as


As (4.4) indicates, the loss function of the MF-CNN is the weighted sum of and , which are the -norm training losses of MC-subnet and QE-subnet, respectively. We divide the training into 2 steps. In the first step, we set , considering that and generated by MC-subnet are the basis of the following QE-subnet, and thus the convergence of MC-subnet is the primary target. After the convergence of is observed, we set to minimize the MSE between and . Finally, the MF-CNN model can be trained for video quality enhancement.

Approach QP Precision Recall -score
() () ()
MFQE 2.0 22 98.01 98.09 98.05
27 97.65 98.27 97.96
32 97.92 97.88 97.90
37 97.25 97.47 97.36
42 98.14 97.10 97.62
MFQE 1.0 37 90.68 92.11 91.09
42 93.98 90.86 92.23
TABLE IV: Performance of our PQF detector on test sequences.
Fig. 10: PSNR (dB) and SSIM of PQFs and non-PQFs in test sequences.

5 Experiments

5.1 Settings

In this section, the experimental results are presented to validate the effectiveness of our MFQE 2.0 approach. Note that our MFQE 2.0 approach is called MFQE in this paper, while the MFQE approach of our conference paper [30] is named as MFQE 1.0 for comparison. In our database, except for 18 common test sequences of Joint Collaborative Team on Video Coding (JCT-VC)[33], other 142 sequences are randomly divided into non-overlapping training set (106 sequences) and validation set (36 sequences). We compress all 160 sequences by HM16.5 under Low-Delay configuration, setting the Quantization Parameters (QPs) to 22, 27, 32 ,37 and 42, respectively.

For the BiLSTM-based PQF detector, the LSTM length is 38, and the hyper-parameter of (3) is set to 6 in post-processing444 should be adjusted according to the compression standard and configuration.. It is because the maximal separation between two nearest PQFs in our database is 6 frames. Before training the MF-CNN, the raw and compressed sequences are segmented into patches as the training samples. The batch size is set to be 64. We apply the Adam algorithm [55] with initial learning rate as to minimize the loss function (4.4). It is worth mentioning that the MC-subnet may be unable to converge, if the initial learning rate is oversize, e.g., . For QE subnet, we set and in (4.4) at first to make the MC-subnet convergent. After the convergence, we set and , so that the QE-subnet can converge faster.

Class Sequence Precision Recall -score
() () ()
A Traffic 100.00 100.00 100.00
PeopleOnStreet 75.00 100.00 85.71
B Kimono 98.33 100.00 99.16
ParkScene 100.00 100.00 100.00
Cactus 100.00 100.00 100.00
BQTerrace 100.00 96.75 98.35
BasketballDrive 97.64 97.64 97.64
C RaceHorses 100.00 94.87 97.37
BQMall 98.00 98.00 98.00
PartyScene 100.00 99.20 99.60
BasketballDrill 100.00 92.64 96.18
D RaceHorses 92.50 96.10 94.27
BQSquare 100.00 86.63 92.84
BlowingBubbles 100.00 99.20 99.60
BasketballPass 96.90 95.42 96.15
E FourPeople 96.10 99.33 97.69
Johnny 98.68 98.68 98.68
KristenAndSara 97.39 100.00 98.68
Average 97.25 97.47 97.36
TABLE V: Performance of our PQF detector on test sequences at QP = 37.
Fig. 11: Rate-distortion curves of four test sequences.

5.2 Performance of the PQF detector

The performance of PQF detection is critical, since it is the first process of our MFQE approach. Thus, we evaluate the performance of our BiLSTM-based approach in PQF detection. For evaluation, we measure precision, recall and -score of PQF detection over all 18 test sequences compressed at five QPs (= 22, 27, 32, 37 and 42). The average results are shown in Table IV. In this table, we also list the results of PQF detection by the SVM-based approach of MFQE 1.0 as reported in [30]. Note that the results of only two QPs (=37 and 42) are reported in [30].

We can see from Table IV that the proposed BiLSTM-based PQF detector in MFQE 2.0 performs well in terms of precision, recall and -score. For example, at QP=37, the average precision, recall and -score of our BiLSTM-based PQF detector are , and , considerably better than those of SVM-based approach in MFQE 1.0. More importantly, the PQF detection of our approach is robust to all 5 QPs, since the average values of precision, recall and -score are all above . In addition, Table V shows the performance of our BiLSTM-based PQF detector over each of 18 test sequences compressed at QP = 37. As seen in this table, the high performance is achieved by our PQF detector for almost all sequences, as only the precision of sequence PeopleOnStreet is below . In conclusion, the effectiveness of our BiLSTM-based PQF detector is validated, laying a firm foundation for our MFQE approach.

5.3 Performance of our MFQE approach

In this section, we evaluate the quality enhancement performance of our MFQE approach in terms of PSNR, which measures the PSNR gap between the enhanced and original compressed sequences. In addition, the structural similarity (SSIM) index is also evaluated for performance evaluation. Then, the performance of our MFQE approach is compared with those of AR-CNN [17], DnCNN [20], Li et al. [21], DCAD [35] and DS-CNN [25]. Among them, AR-CNN, DnCNN and Li et al. are the latest quality enhancement approaches for compressed images, while DCAD and DS-CNN are the state-of-the-art video quality enhancement approaches. For fair comparison, all compared approaches are retrained over our training set, the same as our MFQE approach.

Quality enhancement on non-PQFs. Our MFQE approach mainly focuses on enhancing the quality of non-PQFs using the neighboring multi-frame information. Therefore, we first assess the quality enhancement of non-PQFs. Fig. 10 shows the PSNR and SSIM results averaged over PQFs and non-PQFs of all 18 test sequences compressed at 4 different QPs. As shown, our MFQE approach significantly outperforms other approaches on non-PQF enhancement. The average improvement on non-PQF quality is 0.629 dB and 0.0070 in SSIM, while that of the second-best approach is 0.297 dB in PSNR and 0.0039 in SSIM. We can further see from Fig. 10 that our MFQE approach has a considerably larger PSNR improvement for non-PQFs, compared to that for PQFs. By contrast, for compared approaches, the PSNR improvement of non-PQFs is similar to or even less than that of PQFs. In a word, the above results validate the outstanding effectiveness of our MFQE approach in enhancing the quality of non-PQFs.

Fig. 12: Averaged SD and PVD of test sequences.

Overall quality enhancement. Table VII presents the results of PSNR and SSIM, averaged over all frames of each test sequence. As shown in this table, our MFQE approach consistently outperforms all compared approaches. To be specific, at QP = 37, the highest PSNR of our MFQE approach reaches 0.867 dB, i.e., for sequence PeopleOnStreet. The averaged PSNR of our MFQE approach is 0.559 dB, which is higher than that of MFQE 1.0 (0.356 dB), higher than that of Li et al. (0.299 dB), higher than that of DCAD (0.322 dB), and higher than that of DS-CNN (0.327 dB). Even higher PSNR improvement can be observed, when compared with AR-CNN [17] and DnCNN [20]. At other QPs (= 22, 27, 32 and 42), the performance of our MFQE approach is still excellent, consistently outperforming the state-of-the-art video quality enhancement approaches DCAD and DS-CNN. Similar improvement can be found for SSIM in Table VII. This demonstrates the robustness of our MFQE approach in enhancing video quality. This is mainly attributed to the significant improvement on the quality of non-PQFs, which is the majority of compressed video frames.

Approach Test speed (fps) Parameters
WQXGA 1080p 480p 240p 720p
DS-CNN 0.57 1.12 5.92 19.38 2.54 1,344,449
MFQE 1.0 0.36 0.73 3.83 12.55 1.63 1,787,547
MFQE 2.0 0.79 1.61 8.35 25.29 3.66 255,422
TABLE VI: Test speed and parameters of different approaches.
Fig. 13: PSNR curves of HEVC baseline and our MFQE approach.
QP Approach AR-CNN DnCNN Li et al. DCAD DS-CNN MFQE 1.0 MFQE 2.0
[17]* [20] [21] [35] [25]
A Traffic 0.241 0.0025 0.237 0.0027 0.291 0.0030 0.286 0.0032 0.299 0.0031 0.489 0.0047 0.580 0.0052
PeopleOnStreet 0.346 0.0038 0.414 0.0045 0.482 0.0050 0.481 0.0052 0.465 0.0045 0.772 0.0078 0.867 0.0087
B Kimono 0.219 0.0028 0.244 0.0032 0.187 0.0034 0.279 0.0034 0.280 0.0031 0.472 0.0050 0.557 0.0053
ParkScene 0.135 0.0022 0.141 0.0026 0.150 0.0027 0.139 0.0028 0.188 0.0028 0.401 0.0058 0.477 0.0064
Cactus 0.190 0.0022 0.195 0.0025 0.232 0.0030 0.243 0.0032 0.259 0.0029 0.422 0.0050 0.497 0.0055
BQTerrace 0.195 0.0018 0.201 0.0022 0.249 0.0029 0.259 0.0031 0.298 0.0028 0.294 0.0030 0.407 0.0040
BasketballDrive 0.228 0.0026 0.251 0.0027 0.296 0.0032 0.285 0.0032 0.317 0.0031 0.397 0.0041 0.446 0.0043
C RaceHorses 0.218 0.0023 0.252 0.0035 0.275 0.0036 0.261 0.0035 0.305 0.0034 0.340 0.0036 0.400 0.0049
BQMall 0.275 0.0036 0.280 0.0037 0.325 0.0044 0.319 0.0045 0.361 0.0044 0.512 0.0060 0.618 0.0069
PartyScene 0.107 0.0027 0.131 0.0033 0.130 0.0033 0.145 0.0037 0.191 0.0041 0.229 0.0056 0.374 0.0077
BasketballDrill 0.247 0.0031 0.331 0.0042 0.376 0.0049 0.366 0.0047 0.387 0.0042 0.491 0.0056 0.593 0.0067
D RaceHorses 0.267 0.0038 0.311 0.0048 0.328 0.0050 0.317 0.0051 0.358 0.0048 0.394 0.0076 0.594 0.0091
BQSquare 0.080 0.0009 0.129 0.0017 0.085 0.0016 0.177 0.0028 0.190 0.0025 0.005 0.0010 0.313 0.0039
BlowingBubbles 0.164 0.0030 0.184 0.0043 0.207 0.0048 0.194 0.0048 0.243 0.0051 0.396 0.0089 0.509 0.0113
BasketballPass 0.259 0.0037 0.307 0.0045 0.343 0.0050 0.332 0.0050 0.375 0.0048 0.484 0.0085 0.724 0.0096
E FourPeople 0.373 0.0026 0.388 0.0029 0.448 0.0033 0.486 0.0035 0.481 0.0032 0.664 0.0044 0.812 0.0047
Johnny 0.247 0.0009 0.315 0.0020 0.397 0.0024 0.390 0.0024 0.389 0.0020 0.545 0.0028 0.556 0.0030
KristenAndSara 0.409 0.0021 0.421 0.0025 0.485 0.0028 0.505 0.0029 0.500 0.0026 0.655 0.0034 0.738 0.0037
Average 0.233 0.0026 0.263 0.0032 0.299 0.0036 0.322 0.0037 0.327 0.0035 0.442 0.0052 0.559 0.0062
42 Average 0.285 0.0055 0.220 0.0044 0.310 0.0060 0.324 0.0061 0.330 0.0058 0.454 0.0079 0.573 0.0095
32 Average 0.176 0.0011 0.255 0.0019 0.275 0.0020 0.316 0.0023 0.291 0.0020 0.428 0.0031 0.488 0.0035
27 Average 0.177 0.0007 0.272 0.0012 0.295 0.0013 0.316 0.0014 0.284 0.0012 0.392 0.0018 0.464 0.0020
22 Average 0.142 0.0004 0.287 0.0008 0.300 0.0008 0.313 0.0008 0.276 0.0007 0.326 0.0009 0.414 0.0011
  • All compared approaches in this paper are retrained over our training set, the same as MFQE 2.0.

TABLE VII: Overall comparison for PSNR (dB) and SSIM over test sequences at five QPs.
Class Sequence AR-CNN DnCNN Li et al. DCAD DS-CNN MFQE 1.0 MFQE 2.0
A Traffic 7.45 8.52 9.70 9.81 9.64 14.38 16.24
PeopleOnStreet 7.00 8.28 9.66 9.60 9.23 13.33 14.19
B Kimono 6.05 7.31 8.50 8.28 8.38 12.50 13.40
ParkScene 4.47 5.05 5.36 5.55 6.23 12.21 13.56
Cactus 6.13 6.75 8.18 8.50 8.43 12.32 14.27
BQTerrace 6.82 7.56 8.80 9.79 9.89 11.08 14.52
BasketballDrive 5.85 7.32 8.62 8.79 8.72 10.60 12.04
C RaceHorses 5.05 6.74 7.07 7.62 7.48 8.83 9.75
BQMall 5.59 7.00 7.79 8.53 8.13 11.31 13.30
PartyScene 1.88 4.02 3.77 4.72 4.46 6.59 10.45
BasketballDrill 4.67 8.02 8.66 9.71 8.82 10.64 12.23
D RaceHorses 5.60 7.21 7.66 8.06 7.84 10.59 11.72
BQSquare 0.67 4.58 3.58 5.98 4.18 2.44 9.39
BlowingBubbles 3.17 5.10 5.41 6.00 5.87 10.92 14.34
BasketballPass 5.11 7.04 7.78 8.27 7.94 11.82 13.35
E FourPeople 8.42 10.12 11.46 12.09 11.58 15.06 17.44
Johnny 7.66 10.90 13.04 13.58 12.65 16.18 17.88
KristenAndSara 8.94 10.64 12.04 12.81 11.92 15.14 17.73
Average 5.58 7.34 8.17 8.76 8.41 11.44 13.66
TABLE VIII: Overall BD-BR reduction () of test sequences with the HEVC baseline as an anchor.
Calculated at QP = 22,27,32,37,42.

Rate-distortion performance. We further evaluate the rate-distortion performance of our MFQE approach by comparing with other approaches. First, Fig. 11 shows the rate-distortion curves of our and other state-of-the-art approaches over four selected sequences. Note that the results of the DCAD and DS-CNN approaches are plotted in this figure, since they perform better than other compared approaches. We can see from Fig. 11 that our MFQE approach performs better than other approaches in rate-distortion performance. Then, we quantify the rate-distortion performance by evaluating the BD-bitrate (BD-BR) reduction, which is calculated over the PSNR results of five QPs (=22, 27, 32, 37 and 42). The results are presented in Table VIII. As can be seen, the BD-BR reduction of our MFQE approach is on average, while that of the second-best approach DCAD is only on average. In general, the quality enhancement of our MFQE approach is equivalent to improving rate-distortion performance.

Quality fluctuation. Apart from the compression artifacts, the quality fluctuation in compressed video may also lead to degradation of QoE [56, 57, 58]. Fortunately, our MFQE approach is beneficial in mitigating the quality fluctuation, because of its significant quality improvement on non-PQFs as found in Fig. 10. We evaluate the fluctuation of video quality in terms of the SD and PVD results of PSNR curves, which are introduced in Section 3. Fig. 12 shows the SD and PVD values averaged over all 18 test sequences, which are obtained from the quality enhancement approaches and the HEVC baseline. As shown in this figure, our MFQE approach succeeds in reducing the SD and PVD, while other five compared approaches enlarge the SD and PVD values over the HEVC baseline. The reason is that our MFQE approach has considerably larger PSNR improvement for non-PQFs than that for PQFs, thus reducing the quality gap between PQFs and non-PQFs. In addition, Fig. 13 shows the PSNR curves of two selected test sequences, for our MFQE approach and the HEVC baseline. It can be seen that the PSNR fluctuation of our MFQE approach is obviously smaller than the HEVC baseline. In summary, our approach is also capable of reducing the quality fluctuation of video compression.

Subjective quality performance. Fig. 15 shows the subjective quality performance on the sequences Fourpeople at QP = 37, BasketballPass at QP = 37 and RaceHorses at QP = 42. It can be observed that our MFQE approach reduces the compression artifacts much more effectively than other five compared approaches. Specifically, the severely distorted content, e.g., the cheek in Fourpeople, the ball in BasketballPass and the horse’s feet in RaceHorses, can be finely restored in our MFQE approach with multi-frame strategy. By contrast, such compression distortion can hardly be restored in the compared approaches, as they only use the single low-quality frame. Therefore, our MFQE approach also performs well in subjective quality enhancement.

Fig. 14: Overall PSNR (dB) of test sequences in ablation study.

Test speed. We evaluate the test speed of quality enhancement using a computer equipped with a CPU of Intel i7-8700 3.20GHz and a GPU of GeForce GTX 1080 Ti. Specifically, we measure the average frame per second (fps), when testing video sequences at different resolutions. Note that the test set has been divided into 5 classes with different resolutions in [33]. The results averaged over sequences with different resolutions are reported in Table VI. As shown in this table, MFQE 2.0 has around 7 times fewer parameters than MFQE 1.0, achieving at least 2 times acceleration. Similarity, the parameters of MFQE 2.0 are around 5 times fewer than those of DS-CNN, such that MFQE 2.0 is at least 1.3 times faster in speed. In a word, MFQE 2.0 is efficient in vide quality enhancement, and its efficiency is mainly due to its lightweight structure.

5.4 Ablation study

Fig. 15: Subjective quality performance on Fourpeople at QP = 37, BasketballPass at QP = 37 and RaceHorses at QP = 42.

PQF detector. In this section, we validate the necessity and effectiveness of utilizing PQFs to enhance the quality of non-PQFs. To this end, we retrain the MF-CNN model of our MFQE approach to enhance non-PQFs with the help of adjacent frames, instead of PQFs. The MF-CNN network and experiment settings are all consistent with those in Sections 4.3 and 5.1. The retrained model is represented by MFQE_NF (i.e., the MFQE with Neighbouring Frames), and the experimental results are shown in Fig. 14, which are obtained by averaging over all 18 test sequences compressed at QP =37. We can see that our approach without considering PQFs can only result in 0.254 dB for PSNR gain. By contrast, as aforementioned, our approach with PQFs can achieve PSNR = 0.559 dB enhancement in PSNR. Moreover, as validated in Section 5.3, our MFQE approach obtains considerably higher enhancement on non-PQFs, when compared to the single-frame approaches. In a word, the above ablation study demonstrates the necessity and effectiveness of utilizing PQFs in the video quality enhancement task.

Seq. AR-CNN DnCNN Li et al. DCAD DS-CNN MFQE 1.0 MFQE 2.0
1 0.280 0.359 0.459 0.510 0.415 0.655 0.750
2 0.266 0.303 0.387 0.399 0.339 0.492 0.553
3 0.315 0.365 0.422 0.439 0.394 0.629 0.710
4 0.321 0.312 0.401 0.421 0.388 0.599 0.694
5 0.237 0.229 0.287 0.311 0.290 0.414 0.444
6 0.261 0.312 0.392 0.373 0.343 0.659 0.697
7 0.346 0.414 0.482 0.481 0.465 0.772 0.867
8 0.219 0.244 0.187 0.279 0.280 0.472 0.557
9 0.267 0.311 0.328 0.317 0.358 0.394 0.594
10 0.259 0.307 0.343 0.332 0.375 0.484 0.724
Ave. 0.277 0.316 0.369 0.386 0.365 0.557 0.659
1: TunnelFlag 2: BarScene 3: Vidyo1 4: Vidyo3 5: Vidyo4 6: MaD
7: PeopleOnStreet 8: Kimono 9: RaceHorses 10: BasketballPass
TABLE IX: Overall PSNR (dB) of 10 test sequences at QP = 37.

Multi-scale and dense connection strategy. We further validate the effectiveness of the multi-scale feature extraction strategy and the densely connected structure in enhancing video quality. First, we ablate all dense connections in the QE-subnet of our MFQE approach. In addition, we increase the filter number of C11 from 32 to 50, so that the number of trainable parameters can be maintained for fair comparison. The corresponding retrained model is denoted by MFQE_ND (i.e., the MFQE with No Dense connection). Second, we ablate the multi-scale structure in the QE-subnet. Based on the dense-ablated network above, we fix all kernel sizes of the feature extraction component to . Other parts of the MFQE approach and experiment settings are all the same as those in Sections 4 and 5.1. Accordingly, the retrained model is represented as MFQE_GC (i.e., the MFQE with General CNN). Fig. 14 shows the ablation results, which are also averaged over all 18 test sequences at QP =37. As seen in this figure, the PSNR improvement decreases from 0.559 dB to 0.297 dB (i.e., degradation) when disabling the dense connections, and then it reduces to 0.276 dB (i.e, degradation) when further ablating the multi-scale structure. This indicates the effectiveness of our multi-scale strategy and the densely connected structure.

Enlarged database. One of the contributions in this paper is that we enlarge our database from 70 to 160 uncompressed video sequences. Here, we verify the effectiveness of the enlarged database over our previous database [30]. Specifically, we test the performance of our MFQE approach trained over the database in [30]. Then, the performance is evaluated on all 18 test sequences at QP = 37. The retrained model with its corresponding test result is represented by MFQE_PD (i.e., the MFQE with Previous Database) in Fig. 14. We can see that MFQE 2.0 achieves substantial improvement on quality enhancement compared with MFQE_PD. In particular, the performance of MFQE 2.0 improves PSNR from 0.531 dB to 0.641 dB on average. Hence, our enlarged database is effective in improving video quality enhancement performance.

5.5 Generalization ability of our MFQE approach

Transfer to H.264. We verify the generalization ability of our MFQE approach for video sequences compressed by other standards. To this end, we transfer our MFQE approach to H.264 compressed sequences. The transfer is achieved by fine-tuning the MF-CNN model over the same training sequences but compressed by H.264 (the JM encoder with the low-delay P configuration) at QP =37. Then, we test the performance of our and other approaches over all 18 test sequences that are also compressed by H.264 at QP =37. Consequently, the average PSNR increase of test sequences is 0.463 dB, implying the generalization ability of our MFQE approach across different compression standards.

Performance on other sequences. It is worth mentioning that the test set in [30] is different from that in this paper. In our previous work [30], 10 test sequences are randomly selected from the previous database including 70 videos. In this paper, our 18 test sequences are selected by Joint Collaborative Team on Video Coding (JCT-VC)[33], which is a common-used test set for video compression. For fair comparison, we test the performance of our MFQE 2.0 and all compared approaches over the previous test set. The experimental results are presented in Table IX. Note that 4 test sequences among the 10 test sequences overlap with the 18 test sequences of the above experiments. We can see from Table IX that our approach has 0.659 dB improvement in PSNR and again outperforms other approaches. In this table, the results of compared approaches are also better than those reported in [30] and their papers. It is because of retraining over the enlarged database. In conclusion, our MFQE approach has high generalization ability over different test sequences.

6 Conclusion

In this paper, we have proposed a CNN-based MFQE approach to enhance the quality of compressed video by reducing compression artifacts. Differing from the conventional single-frame quality enhancement, our MFQE approach improves the quality of one frame by utilizing its nearest PQFs that has higher quality. To this end, we developed a BiLSTM-based PQF detector to classify PQFs and non-PQFs in compressed video. Then, we proposed a novel CNN framework, called MF-CNN, to enhance the quality of each non-PQF. Specifically, our MF-CNN framework consists of two subnets, i.e., the MC-subnet and QE-subnet. First, the MC-subnet compensates motion between PQFs and non-PQFs. Subsequently, the QE-subnet enhances the quality of each non-PQF by inputting the current non-PQF and the nearest compensated PQFs. Finally, extensive experimental results showed that our MFQE approach significantly improves the quality of non-PQFs, superior to other state-of-the-art approaches. Consequently, the overall quality can be significantly enhanced, with considerably higher quality and less quality fluctuation than other approaches.

There may exist two research directions for the future work. (1) Our work in this paper only takes PSNR and SSIM as the objective metrics to be enhanced. The potential future work may further embrace perceptual quality metrics in our approach to improve the Quality of Experience (QoE) in video quality enhancement. (2) Our work mainly focuses on the quality enhancement at the decoder side. To further improve the performance of quality enhancement, information from encoder, such as the partition of coding unit, can be utilized. This is a promising future work.


  • [1] I. Cisco Systems, “Cisco visual networking index: Global mobile data traffic forecast update,”
  • [2] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, “Study of subjective and objective quality assessment of video,” IEEE transactions on image processing, vol. 19, no. 6, pp. 1427–1441, 2010.
  • [3] S. Li, M. Xu, X. Deng, and Z. Wang, “Weight-based r- rate control for perceptual hevc coding on conversational videos,” Signal Processing: Image Communication, vol. 38, pp. 127–140, 2015.
  • [4] T. K. Tan, R. Weerakkody, M. Mrak, N. Ramzan, V. Baroncini, J.-R. Ohm, and G. J. Sullivan, “Video quality evaluation methodology and verification testing of hevc compression performance,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 1, pp. 76–90, 2016.
  • [5] C. G. Bampis, Z. Li, A. K. Moorthy, I. Katsavounidis, A. Aaron, and A. C. Bovik, “Study of temporal effects on subjective video quality of experience,” IEEE Transactions on Image Processing, vol. 26, no. 11, pp. 5217–5231, 2017.
  • [6] R. Yang, M. Xu, Z. Wang, Y. Duan, and X. Tao, “Saliency-guided complexity control for hevc decoding,” IEEE Transactions on Broadcasting, 2018.
  • [7] M. D. Gupta, S. Rajaram, N. Petrovic, and T. S. Huang, “Restoration and recognition in a loop,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.   IEEE, 2005, pp. 638–644.
  • [8] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar, “Simultaneous super-resolution and feature extraction for recognition of low-resolution faces,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.   IEEE, 2008, pp. 1–8.
  • [9] M. Nishiyama, H. Takeshima, J. Shotton, T. Kozakaya, and O. Yamaguchi, “Facial deblur inference to improve recognition of blurred faces,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   IEEE, 2009, pp. 1115–1122.
  • [10] H. Zhang, J. Yang, Y. Zhang, N. M. Nasrabadi, and T. S. Huang, “Close the loop: Joint blind image restoration and recognition with sparse representation prior,” in Computer Vision (ICCV), 2011 IEEE International Conference on.   IEEE, 2011, pp. 770–777.
  • [11] A.-C. Liew and H. Yan, “Blocking artifacts suppression in block-coded images using overcomplete wavelet representation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 4, pp. 450–461, 2004.
  • [12] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shape-adaptive DCT for high-quality denoising and deblocking of grayscale and color images,” IEEE Transactions on Image Processing, vol. 16, no. 5, pp. 1395–1411, 2007.
  • [13] C. Wang, J. Zhou, and S. Liu, “Adaptive non-local means filter for image deblocking,” Signal Processing: Image Communication, vol. 28, no. 5, pp. 522–530, 2013.
  • [14] J. Jancsary, S. Nowozin, and C. Rother, “Loss-specific training of non-parametric image restoration models: A new state of the art,” in Proceedings of the European Conference on Computer Vision (ECCV).   Springer, 2012, pp. 112–125.
  • [15] C. Jung, L. Jiao, H. Qi, and T. Sun, “Image deblocking via sparse representation,” Image Communication, vol. 27, no. 6, pp. 663–677, 2012.
  • [16] H. Chang, M. K. Ng, and T. Zeng, “Reducing artifacts in JPEG decompression via a learned dictionary,” IEEE Transactions on Signal Processing, vol. 62, no. 3, pp. 718–728, 2014.
  • [17] C. Dong, Y. Deng, C. Change Loy, and X. Tang, “Compression artifacts reduction by a deep convolutional network,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 576–584.
  • [18] J. Guo and H. Chao, “Building dual-domain representations for compression artifacts reduction,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016, pp. 628–644.
  • [19] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang, “D3: Deep dual-domain based fast restoration of JPEG-compressed images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2764–2772.
  • [20] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [21] K. Li, B. Bare, and B. Yan, “An efficient deep convolutional neural networks model for compressed image deblocking,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2017, pp. 1320–1325.
  • [22] L. Cavigelli, P. Hager, and L. Benini, “CAS-CNN: A deep convolutional neural network for image compression artifact suppression,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2017, pp. 752–759.
  • [23] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4539–4547.
  • [24] R. Yang, M. Xu, and Z. Wang, “Decoder-side HEVC quality enhancement with scalable convolutional neural network,” in Multimedia and Expo (ICME), 2017 IEEE International Conference on.   IEEE, 2017, pp. 817–822.
  • [25] R. Yang, M. Xu, T. Liu, Z. Wang, and Z. Guan, “Enhancing quality for hevc compressed videos,” IEEE Transactions on Circuits and Systems for Video Technology, 2018.
  • [26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [27] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos, “Video super-resolution with convolutional neural networks,” IEEE Transactions on Computational Imaging, vol. 2, no. 2, pp. 109–122, 2016.
  • [28] D. Li and Z. Wang, “Video super-resolution via motion compensation and deep residual learning,” IEEE Transactions on Computational Imaging, vol. PP, no. 99, pp. 1–1, 2017.
  • [29] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi, “Real-time video super-resolution with spatio-temporal networks and motion compensation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [30] R. Yang, M. Xu, Z. Wang, and T. Li, “Multi-frame quality enhancement for compressed video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6664–6673.
  • [31] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [32] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, 2017, p. 3.
  • [33] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand, “Comparison of the coding efficiency of video coding standards—including high efficiency video coding (hevc),” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1669–1684, 2012.
  • [34] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approach for post-processing in hevc intra coding,” in Proceedings of the International Conference on Multimedia Modeling (MMM).   Springer, 2017, pp. 28–39.
  • [35] T. Wang, M. Chen, and H. Chao, “A novel deep learning-based method of improving coding efficiency from the decoder-end for HEVC,” in Proceedings of the Data Compression Conference (DCC), 2017.
  • [36] F. Brandi, R. de Queiroz, and D. Mukherjee, “Super resolution of video using key frames,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS).   IEEE, 2008, pp. 1608–1611.
  • [37] B. C. Song, S.-C. Jeong, and Y. Choi, “Video super-resolution algorithm using bi-directional overlapped block motion compensation and on-the-fly dictionary training,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 3, pp. 274–285, 2011.
  • [38] Y. Huang, W. Wang, and L. Wang, “Video super-resolution via bidirectional recurrent convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 1015–1028, 2018.
  • [39] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2016.
  • [40] A. Dosovitskiy, P. Fischery, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. V. D. Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2758–2766.
  • [41] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of optical flow estimation with deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [42] O. Makansi, E. Ilg, and T. Brox, “End-to-end learning of video super-resolution with motion compensation,” in Proceedings of the German Conference on Pattern Recognition (GCPR), 2017, pp. 203–214.
  • [43] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883.
  • [44] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia, “Detail-revealing deep video super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4472–4480.
  • [45], “ video test media,”
  • [46] VQEG, “VQEG video datasets and organizations,”
  • [47] F. Bossen, “Common test conditions and software reference configurations,” in Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, 5th meeting, Jan. 2011, 2011.
  • [48] D. J. Le Gall, “The mpeg video compression algorithm,” Signal Processing: Image Communication, vol. 4, no. 2, pp. 129–140, 1992.
  • [49] R. Schafer and T. Sikora, “Digital video coding standards and their role in video communications,” Proceedings of the IEEE, vol. 83, no. 6, pp. 907–924, 1995.
  • [50] T. Sikora, “The MPEG-4 video standard verification model,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 1, pp. 19–31, 2002.
  • [51] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H. 264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, 2003.
  • [52] G. J. Sullivan, J. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012.
  • [53] S. Hochreiter and M. C. Mozer, A Discrete Probabilistic Memory Model for Discovering Dependencies in Time.   Springer Berlin Heidelberg, 2001.
  • [54] K. He and J. Sun, “Convolutional neural networks at constrained time cost,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5353–5360.
  • [55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.
  • [56] Z. He, Y. K. Kim, and S. K. Mitra, “Low-delay rate control for DCT video coding via -domain source modeling,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 8, pp. 928–940, 2001.
  • [57] F. D. Vito and J. C. D. Martin, “Psnr control for GOP-level constant quality in H.264 video coding,” in Proceedings of the IEEE International Symposium on Signal Processing and Information Technology, 2005, pp. 612–617.
  • [58] S. Hu, H. Wang, and S. Kwong, “Adaptive quantization-parameter clip scheme for smooth quality in H.264/AVC,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1911–1919, 2012.