Lossy video compression is widely applied due to its effectiveness in bit-rate saving and critical visual information preservation. However, these two goals are contradictory and it is non-trivial to optimized them jointly. Modern video compression standards such as High Efficient Video Coding (HEVC)  still suffer from various kinds of degradation for the sake of block-wise processing and quantization.
To remove these artifacts, an in-loop filter module consisting of Deblocking Filter (DF)  and Sample Adaptive Offset (SAO)  is applied to suppress blocking and ringing artifacts. The in-loop filter not only effectively enhances the quality of the reconstructed frames and further benefits the subsequent inter-coding procedure via providing high quality reference frames. Lots of efforts are put into this field, improving the quality of the reconstructed frames in the coding loop, and a series of works are proposed based on handcrafted filters , Markov random filed , nonlocal filters , low-rank minimization , etc. However, these methods built on shallow models offer limited performance.
In recent years, deep learning brings in new progresses in related fields, firstly image and video restorations for low-level visions, and leads to impressive performance gains. A series of milestone network architectures and basic blocks are proposed, e.g.Super-Resolution Convolutional Neural Network (SRCNN) , Very Deep Super-Resolution network (VDSR) , Denoising Convolutional Neural Network (DnCNN) , and Dual-domain Multi-scale Convolutional Neural Network (DMCNN)  for compression artifacts removal, etc. The latest methods become more advanced, and usually make use of the power of residual learning, dense connections, or their combinations, in a cascaded or recurrent manner. For example, Lim et al.  proposed to cascade multiple residual blocks as Enhanced Deep Super-Resolution Network (EDSR). Later on, Zhang et al.  embedded the dense connections  into a residual network . Inspired by the recent development of these works, many deep-learning based in-loop filtering methods and post-processing methods are proposed [10, 38, 20, 3], from the simplest cascaded CNN  to the combination of residual learning and dense connections .
Besides the network architecture evolution, video coding scenario also provides rich context side information (side-info) to improve the quality of the reconstructed frames. For example, the partition structure of the coding process implicitly reveals the structural complexity of local regions and indicates the relative quality loss after the compression. For convenience, based on whether inferred with adjacent frames, we classify the side-infos into intra-frame and inter-frame side-info. Inspired by Kalman Filter, Luet al.  proposed a Deep Kalman Filtering Network (DKFN) to take the extracted quantized prediction residual image from the codec as another input. When it comes to the in-loop filter, there is also useful side-info proposed in HEVC codecs. For example, He et al.  proposed a post-processing network taking the partition mask, inferred based on the partition tree of HEVC, as the side-info. In , an EDSR-like network takes the unfiltered and prediction frames as side information and is trained with weight normalization. For inter-frame side-info, it is intuitive to make use of temporal redundancy to obtain useful information from the adjacent frames to benefit the processing of the current frame. In , the reference frames (the nearest adjacent frames or the peak quality frame) are warped by optical flow or designed motion compensation modules, and taken as another input to improve the quality of the current frame.
Although achieving significant performance improvements compared to previous works, these methods still have ignored issues from the perspectives of model design, coding context perception, and side-info utilization.
At the model design level, the most popular network architectures [48, 38] for in-loop filtering and the low-level tasks combine the power of residual learning and dense connections by stacking several basic blocks. The channel dimensions of the output features across blocks are usually compressed to make the output feature compact to prevent introducing too many parameters. However, this compression also leads to the information loss and limits the modeling capacity.
In the video coding scenarios, the temporal modeling is quite different from that in video restoration/enhancement tasks from two aspects. First, the video frames might be reordered based on different coding configurations. Second, the quality of the reconstructed frames varies a lot. Previous works make use of temporal redundancies by taking the warped frame as input. This way does not exhaust the potential of modeling capacities, which is buried in the complex dependencies of video frames in the coding scenario.
For side information utilization, some side-infos are not considered closely with the coding context and their potentials are not fully explored. For example, the partition masks used in  are only inferred from the leaf nodes of the partition tree. In fact, the nodes on different levels of the partition tree can provide regional context information at different granularity.
In this paper, we aim to address the three issues mentioned above. Specifically, we develop a deep network with both progressive rethinking and collaborative learning mechanisms to improve quality of the reconstructed intra-frames and inter-frames, respectively. The progressive rethinking mechanism improves the modeling capacity of the baseline network for in-loop filter of intra-frames and inter-frames. Inspired by the human decision mechanism, a Progressive Rethinking Block (PRB) and its stacked Progressive Rethinking Network (PRN) are designed. They are different from typical cascaded deep networks, where at the end of each basic block, the dimension size of the feature is reduced to generate the summarization of the past experiences. Our PRN takes a progressive rethinking. The PRB introduces an additional inter-block connection to bypass a high-dimensional informative feature across blocks to review the complete past memorized experiences. The collaborative learning mechanism tries to fully explore the potential of temporal modeling in the video coding scenario. It acts like the collaboration of human being, where information is exchanged and refined progressively, and there is usually a leader providing more guidance information for other parters. The current reconstructed frame interacts with the reference frames (peak quality frame and the nearest adjacent frame) progressively at the feature level. Therefore, they complement for each other’s information deeply. Furthermore, novel intra-frame and inter-frame side-infos are designed for a better context modeling. A coarse-to-fine partition map based on HEVC partition trees is built as the intra-frame side-info. Besides, the warped features of the reference frames are offered as the inter-frame side-info.
In summary, our contributions are three-fold:
We design a Progressive Rethinking Block based on residual learning and dense connection. An additional inter-block connection is proposed to compensate for the lost information caused by dimension compression, which improves the modeling capacity for in-loop filters of both intra-frames and inter-frames.
We propose a Progressive Rethinking Recurrent Neural Network (PR-RNN) for collaborative learning to effectively utilize temporal redundancies in the video coding scenario. Motivated by the collaboration among human beings, we update the states of the current frame as well as reference frames synchronously by information sharing progressiveness.
We exploit context side-infos from HEVC codecs to better adapt to the coding scenario. We extract Multi-scale Mean value of CU (MM-CU) maps from the partition tree to guide the network restoration. By fusing MM-CU to the baseline network we establish our Progressive Rethinking Convolutional Neural Network (PR-CNN) as an effective single frame filter under All-Intra (AI) configuration.
The remainder of the paper is organized as follows. In Section II, we provide a brief review of related works. In Section III, we introduce the methodology of our Progressive Rethinking Networks (PRN). Section IV provides the implementation. Experimental results are shown in Section V. Finally we will make a conclusion in Section VI.
Ii Related Works
Ii-a Deep Learning Based Video Coding
Modern video coding standards like HEVC consist of multiple modules working together to compress the given videos. With the development of deep learning, researchers begin to utilize the strong non-linear mapping capability to substitute or enhance the original module of the codecs.
In , Li et al. developed a fully-connected neural network for intra prediction (IPFCN). The IPFCN takes the neighbouring pixels as input to predict the current block pixel values. Hu et al. proposed a Progressive Spatial Recurrent Neural Network (PS-RNN)  to progressively pass information along from preceding contents to the blocks to be encoded.
Methods benefiting inter prediction were also proposed from many aspects. Yan et al. proposed a Fractional Pixel Reference generation CNN (FRCNN)  to predict the fractional pixels inside the frame by adopting a three-layer CNN. Further, Liu et al. proposed a Group Variation CNN (GVCNN)  which can tackle multiple quantization parameters and sub-pixel positions in one model. Zhao et al. proposed a method  to enhance the inter-prediction quality by utilizing a CNN to combine two prediction blocks rather than a linear combination. Beyond PU-level combination,  and  directly exploited the learning capability of neural network to generate a new reference frame so that the residue of motion compensation can be greatly decreased.
Many efforts have also been made to in-loop filtering or post-processing. Park et al. trained a shallow CNN for in-loop filtering firstly . The network is inserted into HEVC codecs after DF with SAO off. Since then, many attempts have been made to enhance the representative capability of in-loop filtering network. Dai et al. proposed a Variable-Filter-Size Residual-Learning CNN (VRCNN)  as the post-processing component with variable convolutional kernels to perceive multi-scale feature information. In , He et al. proposed a CNN adopting residual blocks for post-processing. In , Li et al. proposed a multi-frame in-loop filter network (MIF-Net) based on Dense Block . Wang et al. proposed a Dense Residual CNN (DRN)  taking advantage of both dense shortcuts and residual learning. Also, many methods take intra-frame or inter-frame side-info into consideration. In , not only the decoded frames are sent into the network but also respective block partition side-info. In , Jia et al. proposed a Spatial-Temporal Residue Network (STResNet) which aggregates temporal information by concatenating the feature maps of co-located block and current block together. In 
, a delicate reference frame selector was designed and the reference frames are warped by motion vectors predicted by neural network.
Ii-B Deep Learning Based Video Restoration
With the surge of deep learning, video restoration also ushers in an outbreak. Many methods were first proposed to tackle image restoration such as denoising [43, 44, 7], super-resolution [5, 17, 18, 21, 48], deblocking [4, 2, 47] and so on. And that can be treated as single-frame restoration algorithms which don’t utilize temporal redundancy of videos. To better fit the video scenario, many methods are proposed to utilize temporal information to help video restoration.
Most common way to utilize temporal redundancy now is to warp reference frames to the current one by optical flow [40, 25, 32, 20]. After that, the aligned frames will be send to neural networks to further reconstruct the current frame. While in , Haris et al. proposed a framework based on back-projection algorithm. Rather than aligning frames by flow,  directly sends the flow along with the reference frame together into the network without explicit alignment.
Another popular way to process temporal information is to pass hidden state frame by frame through a RNN module like LSTM  or ConvLSTM . In , Tao et al. proposed a sub-pixel motion compensation layer to provide finer motion compensation. Further, they proposed a ConvLSTM layer inside their network to pass temporal information. Beyond that, many variants from classic structures are proposed. In , the RNN cell is an Auto-Encoder structure which consists of multiple residual blocks. The hidden state is represented by the transformed feature maps extracted from the bottleneck in each RNN cell.
Besides the mentioned methods, there exist other ways to handle temporal information. For example, in , Jo et al. utilized multi-frames to generate a dynamic upsampling filters to upsample low resolution frames. Lu et al. used deep modules to substitute the original ones in kalman filter  to process video sequences.
Iii Progressive Rethinking Networks
for In-Loop Filter
In this section, we at first present the motivation and design methodology of our proposed Progressive Rethinking Networks, i.e. PR-CNN and PR-RNN. Then, we discuss their detailed architectures step by step.
In this paper, we aim to address the three issues of deep learning-based in-loop filters: 1) Effective network design for feature learning; 2) Side information extraction and injection; 3) Joint spatial and temporal modeling in the coding context. Our motivations to address these issues are three-fold:
Representative Feature Refinement via Progressive Review. The basic blocks in previous advanced networks, e.g. residue dense network (RDN) , perform the progressive feature refinement. However, at the end of each basic block, the feature dimension is compressed to avoid excessive growth of the model parameters, which at the same time inevitably brings about the information loss across blocks. However, this information is also important. Intuitively, the high-dimensional feature is more informative (a record of total past experiences). After the compression, only most critical information (knowledge and principle) is preserved. When learning from new information, it will be helpful if the total past experiences are available. To this end, we introduce an inter-block connection that bypasses more information across blocks, which enables the model to learn by reviewing a compact representation of the complete past memorized experiences, namely “rethinking”.
Hierarchical Side-Info in the Coding Context. The coding process is performed block by block as the coding tree unfolds. Thus, the guidance side-info should contain the partition-related side-info to better represent the coding context, and guide the network to perform restoration from coarse to fine. In this work, we extract MM-CU as side-info to boost the proposed network for better in-loop filtering.
Collaborative Learning Mechanism. Previous works make use of the temporal redundancy at the frame level (the aligned reference frame) unidirectionally. That is, they only make the information flow from the reference frames to the current one to improve its quality without updating the state and feature of the reference frames. In this paper, we propose a collaborative learning mechanism and maintain three learning paths to absorb useful information from reference frames (the nearest adjacent frame and peak quality frame) progressively and collaboratively. This design benefits acquiring useful information from the three kinds of resource and leads to a better restoration of the current reconstructed frame.
Iii-B Methodology Overview
Based on the common configurations of modern codecs, we adopt two types of processing for reconstructed frames, using the information from adjacent frames or not, and our PRNs also include two versions: PR-CNN and PR-RNN. We will first introduce the pipeline of our method and then model these two versions step by step to develop the model more clearly.
We first classify all frames into two categories, high-quality frame (H-frame) and low-quality frame (L-frame):
H-frame. These frames include all I-frames, and each P-frame or B-frame whose POCs are multiples of 4. Based on the configuration of the codecs, these frames are usually compressed by lower quantization parameters (QP) and own higher quality.
L-frame. Other frames that do not belong to the first category, i.e. P-frames and B-frames whose POCs are not multiples of 4, fall into L-frames as they are usually coded with fewer bits than H-frames.
Our network uses PR-CNN and PR-RNN to filter H-frames and L-frames, respectively. Our pipeline under LD configuration is shown in Fig. 1. Similarly, the pipeline is easily extended to handle AI and RA configurations. Under the AI configuration, all frames are fed into PR-CNN with their MM-CU maps while under RA configuration, the scenario is similar but in a different coding order. The reason to process two kinds of frames differently is that, for H-frames, reference frames often have lower quality and may consequently mislead its restoration. The nearest H-frame is denoted as . PR-CNN takes and its MM-CU map as input and outputs the filtered result . After that, is taken as the reference frame of the successive frames. Besides , for each L-frame, PR-RNN also takes the filtered neighboring frame as another input. Under the RA configuration, the model also chooses the nearest filtered frame in temporal domain as the reference frame (unnecessarily the exactly previous frame).
2) Modeling PRN Step by Step
To provide a better understanding on our model design, we construct our deep network step by step.
Residual Dense Network. We take a previous excellent work residual dense network (RDN)  as the starting point of our model. As shown in Fig. 2(a), a series of residual dense blocks (RDB) are stacked. There is an additional bypass connection to link the first and later layers to better trade-off between the local and global signal modeling.
PR-CNN. As shown in Fig. 2(b), different from RDN , inter-block connections (red line) are added to bypass richer information across blocks. These connections are non-trivial, as they make the successive blocks “rethink”, namely, learning to extract more representative features guided by the previous information without dimension compression. Furthermore, we inject the side-info into the network to facilitate in-loop filter. MM-CU maps are extracted and used as another input to guide the restoration process.
Frame-Level Temporal Fusion. To exploit the temporal redundancy of video frames, the commonly used way in previous methods is shown in Fig. 2(c). The network takes the warped reference frame as the input. However, this way might not make full use of the temporal dependencies.
PR-RNN (Feature-Level Aggregation and Collaborative Learning). Beyond taking the aligned reference frame as the input, we further develop a collaborative learning mechanism to exploit temporal dependencies bidirectionally at the feature level. Specifically, the feature map is also feed-forwarded from the reference frame to the current one as shown in Fig. 2(d). It is noted that, in our implementation, we use recurrent neural modules to update the feature maps and Fig. 2(d) is a simplified unfolding version of our proposed PR-RNN.
In the following, we will present our PRN in details, including its basic module PRB, and PR-CNN as well as PR-RNN.
Iii-C Progressive Rethinking Block
To fully utilize the past memory for current restoration, we design a Progressive Rethinking Block (PRB), which has an additional inter-block connection to forward more informative feature representations across blocks. The structure of our proposed PRB is shown in Fig. 3(b).
In an RDB, the input feature map
is first feed-forwarded to a series of convolutional layers to extract rich hierarchical features, and the ReLU activation layers are injected between convolutional layers to model nonlinearity. The procedure is formulated as follows,
where is the corresponding nonlinear transform procedure. The concatenated hierarchical feature (accumulated in the way of dense connections from ) is denoted as . As the channel dimension of is greatly larger than the input , we compress the channel dimension via a 1
1 convolutional layer, and the residual connection can be utilized to accelerate convergence. However, the dimension compression inevitably causes information loss.
To compensate for this loss, we introduce another path to send the feature map of the previous block to that of the current block simply by concatenating it with denoted by red lines in Fig. 3(b). This connection is non-trivial as the module with it is more consistent to the learning experience of human being. People often learn knowledge from past experience and memory (high-dimensional informative feature) rather than only utilizing the information in the current stage guided by some compact rules and principles (compressed feature). Therefore, the inter-block connection can be understood as a long-term memory that reminds the current block of past information. We generate by a convolutional layer as follows,
where is the corresponding process, and denotes the concatenation operation. Similarly, we can generate and add a local residual learning for better gradient back-propagation as follows,
where is also a convolutional function.
As a summary, we can conclude the process of PRB as and for the k-th PRB, there exists
Iii-D Progressive Rethinking Convolutional Neural Network
To process H-frames, which are usually with high quality, we only make use of spatial redundancy and the related side-info for in-loop filter of the corresponding reconstructed frames. The overall architecture of our PR-CNN is shown in Fig. 3(a). It has two branches: the main brunch, i.e. the PR-CNN baseline network without MM-CU maps, and side-info feature extractor (SIFE). We will illustrate their architectures in details.
1) Architecture of Main Branch of PR-CNN
PR-CNN takes the unfiltered frame and MM-CU maps as its input. is fed to the main brunch, and MM-CU maps are first fed into SIFE and then fused to the main brunch. PR-CNN can be roughly divided into 3 parts: Low-Level Feature Extractor (LFE), High-Level Feature Extractor (HFE) with MM-CU Fusion and Reconstruction Sub-Network.
Low-Level Feature Extractor.
The input frame is first fed into a Low-level Feature Extractor for low-level feature extraction. The LFE consists of two convolutional layers. The corresponding process is formulated denoted as:
where is the generated feature maps.
High-Level Feature Extractor with MM-CU Fusion. is further feed-forwarded into sequential PRBs, namely High-level Feature Extractor. It is noted that, each PRB indeed needs two inputs: and as shown in Eqn. (4). We initially set . After a certain number of PRBs, we fuse the feature maps of a Mean value of CU (M-CU) into the main brunch by element-wise addition. We use to denote feature map of the k-th M-CU, and it is inserted to the main branch after the -th PRB. The process is denoted as follows,
Reconstruction Sub-network. After PRBs, we concatenate all feature maps together and use a 11 convolutional layer denoted as to compress them as follows:
We then append a global residual connection from the first convolutional layer to the last one as follows,
Finally, we construct the output by a 33 convolutional layer denoted as :
2) MM-CU Generation and Fusion
In addition to only utilizing the frame information, we further fuse intra-frame side-info extracted from the HEVC codec into our network. As HEVC encodes a frame at the CU level independently with different coding parameters, the partition information contains a lot of extra important side-info which is beneficial for in-loop filter.
MM-CU Generation. Different from only generating M-CU at the bottom layer (leaf node of the partition tree) of the quadtree , we also extract M-CU in the intermediate layers (every node of the partition tree). Namely, we calculate the mean value of a CU every time a partition happens. Consequently, the side-info includes the information related to the entire coding partition architecture, and therefore guides the network to remove the coding artifacts from coarse to fine.
We calculate the mean value of each CU at different levels from coarse to fine to derive the corresponding side-info maps. As shown in Fig. 4(b), blocks in the yellow dotted box are four CTUs and their corresponding M-CU side-info maps, and the coarsest ones are surrounded by a yellow border. Then, every time the CUs are divided into four smaller CUs, we recursively calculate the mean value of each partitioned CU. If the CU is not divided, we keep its side-info value the same as that at the upper level, namely that the side-info value of that CU is unchanged. The recursive process stops when the CU cannot be partitioned anymore. Finally, the multi-scale M-CU side-info maps, MM-CUs, are obtained.
MM-CU Fusion. The information of MM-CU is first transformed into the feature map, and then injected into different layers of the main branch. The feature of each M-CU is extracted by a simple shallow CNN named Side-Info Feature Extractor (SIFE), whose structure is shown in Fig. 3(c). The M-CU first goes through a convolutional layer and two stacked PRBs. After that, a residual connection is added. At last, a convolutional layer generates the final output feature map of the M-CU. It is intuitive that, the information of finer M-CU maps reflects local details of the coding architecture more while that of coarser ones contains more global coding structure information. Thus, we inject coarser M-CU maps to the main branch in deeper layers so that the global information can play a more important role in guiding the network training when larger areas are perceived in deeper layers. We choose the element-wise addition as the fusion operation.
Iii-E Progressive Rethinking Recurrent Neural Network
Besides exploiting the current frame information, we further develop a Progressive Rethinking Recurrent Neural Network (PR-RNN) to effectively utilize inter-frame side-info with a collaborative learning. We will first provide the network architecture of PR-RNN and then introduce the collaborative learning in detail.
1) Architecture of PR-RNN
The architecture of our PR-RNN is shown in Fig. 5(a). To clearly show the relationship of PR-RNN with PR-CNN, we show an unfolding version of PR-RNN. Different from PR-CNN, PR-RNN generates a filtered frame with the information of both current frame and reference frames (nearest adjacent frame and peak quality frame) to further improve frame quality when inter-prediction is available. Specifically, PR-RNN takes three kinds of frames as its input:
Current Frame .
Neighboring Frame. We select the nearest filtered frame in the temporal domain as another input of the network as it is often the most similar frame in all reconstructed frames to the current frame. Under the LD configuration, it will be the last filtered frame as shown in Fig. 5. Under RA configuration, it is a little more complex because the frames are not coded in a sequential order. We just still choose the nearest filtered frame as one of the reference frame.
Peak Quality Frame. We also choose the nearest filtered H-frame as another reference frame. More high-frequency information is preserved in this frame, which benefits the restoration of the current frame. It is denoted as in Fig. 5 and is denoted as follows,
To apply the in-loop filter frame-by-frame along the temporal dimension, the three input frames at different temporal steps make up three queues, which we abstract into three states: State C, State N and State Q to denote the Current frame, Neighbouring frame, peak Quality frame respectively. Therefore, we can also use , and to represent the three frames in these state queues, respectively.
PR-RNN can be divided into four parts:
Flow Estimation, Low-Level Feature Extractor, Recurrent Module with collaborative Mechanism, and Reconstruction Sub-Network.
Flow Estimation. Because the three frames are not aligned, we estimate their optical flow results and apply warping operations. We adopt SpyNet  to generate the optical flow maps. We use and to represent any two states and we can get:
where the first parameter of the function is the target frame and the second one is the source frame.
Low-level Feature Extractor. Instead of just warping all frames to the current frame, we need to extract the low-level feature maps of the three inputs and further warping them in our recurrent module by a collaborative learning mechanism. The extraction of low-level feature is same as the one in PR-CNN. However, we name the corresponding process as to highlight that it belongs to PR-RNN. Therefore, we can get:
where denotes the feature map of after times unfolding. is the frame that corresponds to .
Recurrent Module with Collaborative Learning. After the flow estimation and low-level feature extraction, the feature maps of these three states and their flow maps are fed into the recurrent module for collaborative learning. Namely, the input feature maps are processed by the Collaborative Learning Module (CLM) and then pass sequential PRBs to further update the state. The detailed process of this collaborative learning will be introduced in the next subsection. One thing needed to mention is that the feature map of is temporarily kept in each time-step and they are concatenated together and fed into the successive layers as follows,
where is the total unfolding times of our PR-RNN.
Reconstruction Sub-network. At last, we reconstruct the frame through two convolutional layers. The first convolutional layer is to compress the channel number. Then, a global residual connection is used to connect the first convolutional layer and the last layer as follows,
The final output result is reconstructed by a convolutional layer. The process of the overall reconstruction sub-network can be denoted as follows,
where stands for the convolutional function.
|Class||Sequence||All-Intra||Low-Delay B||Low-Delay P||Random-Access|
2) Collaborative Learning
We will illustrate the collaborative learning in detail. We apply the collaborative learning mechanism through a Collaborative Learning Module as Fig. 5(a) shows. The detailed structure of CLM is shown in Fig. 5(b). In the time-step , the feature maps that correspond to the three states are updated as follows,
where denotes the feature maps of in time-step . Similarly, and stand for the feature maps of and in the time-step , respectively. is the mapping of our recurrent module.
To be specific, the three states are first warped with the estimated flow maps as follows,
where and represent two arbitrary states, respectively.
Then, the warped feature maps are concatenated to interact and share information with each other as follows,
Intuitively, the collaborative learning can be understood as the process of a group mountaineering. The three states have different restoration quality can be regarded as a mountaineering team whose members are at different altitudes. By sharing information, three states benefit each other just like the way teammates collaborate with each other to climb higher. Through the procedure, three states can reach higher altitudes, namely owning higher quality.
After collaborative information sharing, the features of the three states are first compressed by a convolutional layer and further refined by several PRBs. We denote the corresponding process as . Therefore, the procedure can be formulated as follows,
Till now, three states are all updated. is further fed into the next recurrence to improve the restoration quality progressively like the mountaineering team aim for higher altitudes.
Iv Implementation Details
Iv-a Network Implementation
The PR-CNN is made up of 10 PRBs. As our anchor HEVC codec is HM 16.15, which only provides us a 4 layer partition tree, our MM-CU consists of 4 different scale M-CUs. We insert the feature maps of the M-CUs after 2-nd, 4-th, 6-th, 8-th PRB respectively from fine to coarse.
The PR-RNN has three states as we have mentioned above. For each state, the respective recurrent module is made up of 3 PRBs. Therefore, there are 9 PRBs in PR-RNN in all. The folding time is set to 2.
All activation functions in our PRNs are ReLU. The kernel of each convolution layer is 33 except that the kernel of the emphasized channel compression module after concatenation layer is .
We train PR-CNN and PR-RNN on DIV2K  and Vimeo-90K , respectively. The DIV2K dataset contains 800 diverse high-resolution images while Vimeo-90K contains 89,800 clips with 7 frames. We randomly extract 18,345 clips from Vimeo-90K with 4 frames in each clip.
We crop the image into 6464 and 128128 patches for the training of PR-CNN and PR-RNN. We apply random flipping both vertically and horizontally for augmentation.
The network is implemented in Pytorch and Adam is used as the optimizer with
. The learning rate is first set to 0.0001 and adaptively decreased until convergence. We train one model for each QP. We first train PR-CNN and PR-RNN suffering the worst degradation (QP 37) for 75 and 40 epochs respectively and then finetune other models from them for 20 epochs.
We insert our PR-CNN and PR-RNN between DF and SAO modules. Only luma component is filtered by our method.
We also adopt CTU level RDO under LDP, LDB and RA configurations to choose to use filtered results or unfiltered results. While under AI configuration, we simply substitute our filered frame for the unfiltered frame without RDO.
V Experimental Results
In this section, we show the experimental results of our model. As mentioned in the previous section, we utilize PR-CNN to filter H-frames and PR-RNN to filter L-frames. The testing QPs include 22,27,32 and 37.
V-a Overall performance
Table I shows the overall performance of our proposed method for classes A, B, C, D and E. Our method has obtained on average , , , BD-rate savings, respectively under AI, LDB, LDP and RA configurations. For the test sequence Johnny, up to BD-rate saving is obtained for the luma component under LDP configuration. For further verification, we provide rate-distortion (R-D) curves under four configurations as shown in Fig. 6. It can be seen that our method is superior to HEVC at all QPs. More significant superiorities are observed especially at higher QP points.
V-B Comparison with Existing Methods
Furthermore, we compare our method with some state-of-the-art methods under AI, LDP and RA configurations. The results are shown in Table II, III and IV respectively to validate the superiority of our PRNs.
Under AI configuration, we choose VRCNN  and DCAD  targeting at post-processing instead of in-loop filtering for comparison. However, as all frames are encoded with no reference frames available during the coding process, the comparison is quite fair. It can be observed that, our PR-CNN outperforms all three compared methods with gains at least 2.7%.
When the inter prediction is available, PR-RNN is used to collaborate with PR-CNN to exploit temporal redundancy. Here, we choose Non-local , RHCNN  and MIF  as the compared methods. In , Li et al. implemented all three methods on HM 16.5. To make it fair, we also implement our networks based on HM 16.5 and test on class C and class D to show the superiority of our method. It is noted that the compared anchor is HM 16.5 without DF and SAO. It can be observed that our PR-CNN significantly outperforms all three methods under LDP and RA configurations.
V-C Ablation Study
We also conduct some ablation study to verify the necessities and rationality of our design.
Verification of Inter-Block Connection. To verify the superiority of our PRB, we train a RDN  and a PR-CNN baseline model without MM-CU maps input (denoted as PR-CNN-B) with same block number following the same training procedure. We test the BD-rate under AI configuration. As shown in Table V, original RDN provides a 8.7% performance gain. An additional inter-block connection can further enhance the coding by a margin of 0.4%.
Verification of MM-CU Maps. We conduct an experiment to compare the performance between PR-CNN-B and PR-CNN to verify the necessity of MM-CU maps. As shown in Table V, we can find that with MM-CU maps guidance, the performance further increases by 0.5% under AI configuration. And in class E, the performance gain is up to 0.8%.
Verification of Collaborative Learning Mechanism. Our PR-RNN applies a collaborative learning mechanism to transfer information between states. To verify the effectiveness of collaborative learning mechanism, we train a model which simply aligns the neighbouring frames to current frame by optical flow and concatenates them as the input of a CNN with same PRBs as PR-RNN. The comparison between PRN-Warp and PR-RNN-N can validate the effectiveness collaborative learning mechanism. The result is also shown in Table VI. From the last two columns, we can find that, PR-RNN-N performs better than PRN+WarpN, which demonstrates that our collaborative mechanism method indeed benefits the restoration.
V-D Subjective Results
We compare the subjective quality of HEVC anchor and our proposed method. Fig. 7 illustrates the some examples which are compressed under AI, LDP and RA configurations respectively when QP is 37. For RaceHorses, it can be observed that the rein is blurry in the results of the HM anchor but becomes more clear after being filtered by our proposed method. In BasketballPass, the bottom of the gate is missed in the results of HM anchor but appears in that of our PR-RNN. For BlowingBubbles, the girl’s face is degraded by multiple artifacts and our filtered result shows better visual quality. All these examples show that, our approach is superior to HEVC in subjective visual qualities.
In this paper, we propose Progressive Rethinking Networks with Collaborative Learning Mechanism. We design a Progressive Rethinking Block to introduce inter-block connections to compensate for possible information lost across blocks. Furthermore, we extract context side-info from HEVC codecs to facilitate restoration. We generate Multi-scale Mean value of Coding Unit maps by calculating the mean value of the CU each time a partition happens and replacing the original pixel value with the mean value. The MM-CU map is fused to a convolutional neural network consisting of PRBs called Progressive Rethinking Convolutional Neural Network. Beyond that, we develop a collaborative learning mechanism to effectively utilize temporal side-info. In our collaborative learning mechanism, not only the state of current frame but also the states of reference frames are updated. We implement our collaborative learning mechanism through a Recurrent Neural Network called PR-RNN. Experimental results show that our PR-CNN outperforms HEVC baseline by 9.0% and PR-RNN outperforms HEVC baseline by 9.0%, 10.6% and 8.0% under LDB, LDP and RA configurations.
-  (2017) Ntire 2017 challenge on single image super-resolution: Dataset and study. In , Cited by: §IV-B.
-  (2017-05) CAS-cnn: a deep convolutional neural network for image compression artifact suppression. In 2017 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 752–759. External Links: Cited by: §II-B.
-  (2017) A convolutional neural network approach for post-processing in HEVC intra coding. In Proc. International MultiMedia Modeling Conf., Cited by: §I, §II-A, §V-B.
-  (2015) Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 576–584. Cited by: §II-B.
-  (2015) Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: §I, §II-B.
-  (2012) Sample Adaptive Offset in the HEVC standard. IEEE Transactions on Circuits and Systems for Video Technology 22 (12), pp. 1755–1764. Cited by: §I.
-  (2019) Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1712–1722. Cited by: §II-B.
-  (2019) Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3897–3906. Cited by: §II-B.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: §I.
-  (2018) Enhancing HEVC compressed videos with a partition-masked convolutional neural network. In Proc. IEEE Int’l Conf. Image Processing, Cited by: 3rd item, §I, §I, §II-A, §III-D.
-  (1997-12) Long short-term memory. Neural computation 9, pp. 1735–80. External Links: Cited by: §II-B.
-  (2019) Progressive spatial recurrent neural network for intra prediction. IEEE Transactions on Multimedia (), pp. 1–1. External Links: Cited by: §II-A.
-  (2017) Densely connected convolutional networks. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: §I, §II-A.
-  (2017-12) Spatial-temporal residue network based in-loop filter for video coding. In 2017 IEEE Visual Communications and Image Processing (VCIP), Vol. , pp. 1–4. External Links: Cited by: §II-A, §II-B.
-  (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3224–3232. Cited by: §II-B.
-  (2016) Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging 2 (2), pp. 109–122. Cited by: §II-B.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654. Cited by: §I, §II-B.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §II-B.
-  (2018-07) Fully connected network-based intra prediction for image coding. IEEE Transactions on Image Processing 27 (7), pp. 3236–3247. External Links: Cited by: §II-A.
-  (2019-11) A deep learning approach for multi-frame in-loop filter of hevc. IEEE Transactions on Image Processing 28 (11), pp. 5663–5678. External Links: Cited by: §I, §I, §II-A, §II-B, §V-B.
-  (2017-07) Enhanced deep residual networks for single image super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §I, §II-B.
One-for-all: grouped variation network-based fractional interpolation in video coding. IEEE Transactions on Image Processing 28 (5), pp. 2140–2151. External Links: Cited by: §II-A.
-  (2019) CODING prior based high efficiency restoration for compressed video. In Proc. IEEE Int’l Conf. Image Processing, Cited by: §I.
-  (2018-09) Deep kalman filtering network for video compression artifact reduction. In The European Conference on Computer Vision (ECCV), Cited by: §I, §II-B.
-  (2017) End-to-end learning of video super-resolution with motion compensation. In German conference on pattern recognition, pp. 203–214. Cited by: §II-B.
-  (1995-04) An optimization approach for removing blocking effects in transform coding. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I.
-  (2019-06) Recurrent neural networks with intra-frame iterations for video deblurring. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-B.
-  (2012) HEVC Deblocking Filter. IEEE Transactions on Circuits and Systems for Video Technology 22 (12), pp. 1746–1754. Cited by: §I.
-  (2016-07) CNN-based in-loop filtering for coding efficiency improvement. In 2016 IEEE 12th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), Vol. , pp. 1–5. External Links: Cited by: §II-A.
-  (2017) Optical flow estimation using a spatial pyramid network. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: §III-E.
-  (2013) Image blocking artifacts reduction via patch clustering and low-rank minimization. In Proc. Data Compression Conference, Cited by: §I.
-  (2018) Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6626–6634. Cited by: §II-B.
Convolutional lstm network: a machine learning approach for precipitation nowcasting. In International Conference on Neural Information Processing Systems, Cited by: §II-B.
-  (2012) Overview of the High Efficiency Video Coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology 22 (12), pp. 1649–1668. Cited by: §I.
-  (2007) Postprocessing of low bit-rate block DCT coded images based on a fields of experts prior. IEEE Transactions on Image Processing. Cited by: §I.
-  (2017) Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4472–4480. Cited by: §II-B.
-  (2017-04) A novel deep learning-based method of improving coding efficiency from the decoder-end for hevc. In 2017 Data Compression Conference (DCC), Vol. , pp. 410–419. External Links: Cited by: §V-B.
-  (2018) Dense Residual Convolutional Neural Network based In-Loop Filter for HEVC. In Proc. IEEE Int’l Conf. Image Processing, Cited by: 1st item, §I, §II-A.
-  (2019-Sep.) Deep inter prediction via pixel-wise motion oriented reference generation. In 2019 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 1710–1774. External Links: Cited by: §II-A.
-  (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision 127 (8), pp. 1106–1125. Cited by: §II-B.
-  (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision 127 (8), pp. 1106–1125. Cited by: §IV-B.
-  (2019-03) Convolutional neural network-based fractional-pixel motion compensation. IEEE Transactions on Circuits and Systems for Video Technology 29 (3), pp. 840–853. External Links: Cited by: §II-A.
-  (2017-07) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. External Links: Cited by: §I, §II-B.
-  (2018-Sep.) FFDNet: toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing 27 (9), pp. 4608–4622. External Links: Cited by: §II-B.
-  (2017-10) Low-rank-based nonlocal adaptive loop filter for high-efficiency video compression. IEEE Transactions on Circuits and Systems for Video Technology 27 (10), pp. 2177–2188. External Links: Cited by: §V-B.
-  (2012) Reducing blocking artifacts in compressed images via transform-domain non-local coefficients estimation. In Proc. IEEE Int’l Conf. Multimedia and Expo, Cited by: §I.
-  (2018) DMCNN: dual-domain multi-scale convolutional neural network for compression artifacts removal. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 390–394. Cited by: §I, §II-B.
-  (2018) Residual dense network for image super-resolution. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: 1st item, §I, §II-B, Fig. 2, 1st item, 1st item, 2nd item, §V-C, TABLE V.
-  (2018-08) Residual highway convolutional neural networks for in-loop filtering in hevc. IEEE Transactions on Image Processing 27 (8), pp. 3827–3841. External Links: Cited by: §V-B.
-  (2018-10) Enhanced ctu-level inter prediction with deep frame rate up-conversion for high efficiency video coding. In 2018 25th IEEE International Conference on Image Processing (ICIP), Vol. , pp. 206–210. External Links: Cited by: §II-A.
-  (2018) Enhanced bi-prediction with convolutional neural network for high efficiency video coding. IEEE Transactions on Circuits and Systems for Video Technology (), pp. 1–1. External Links: Cited by: §II-A.