The past few years have witnessed increasing interests in applying deep learning to video compression. However, the existing approaches compress a video frame with only a few number of reference frames, which limits their ability to fully exploit the temporal correlation among video frames. To overcome this shortcoming, this paper proposes a Recurrent Learned Video Compression (RLVC) approach with the Recurrent Auto-Encoder (RAE) and Recurrent Probability Model (RPM). Specifically, the RAE employs recurrent cells in both the encoder and decoder. As such, the temporal information in a large range of frames can be used for generating latent representations and reconstructing compressed outputs. Furthermore, the proposed RPM network recurrently estimates the Probability Mass Function (PMF) of the latent representation, conditioned on the distribution of previous latent representations. Due to the correlation among consecutive frames, the conditional cross entropy can be lower than the independent cross entropy, thus reducing the bit-rate. The experiments show that our approach achieves the state-of-the-art learned video compression performance in terms of both PSNR and MS-SSIM. Moreover, our approach outperforms the default Low-Delay P (LDP) setting of x265 on PSNR, and also has better performance on MS-SSIM than the SSIM-tuned x265 and the slowest setting of x265.READ FULL TEXT VIEW PDF
A framework that you can easily test videos using various plugins, and even you can write your own plugin.
HLVC submodule for VCtest
RLVC submodule for VCtest
Nowadays, video contributes to the majority of mobile data traffic . The demands of high resolution and high quality video are also increasing. Therefore, video compression is essential to enable the efficient transmission of video data over the band-limited Internet. Especially, during the COVID-19 pandemic, the increasing data traffic used for video conferencing, gaming and online learning forced Netflix and YouTube to limit video quality in Europe. This further shows the essential impact of improving video compression on today’s social development.
were standardized. These standards are handcrafted, and the modules in compression frameworks cannot be jointly optimized. Recently, inspired by the success of Deep Neural Networks (DNN) in advancing the rate-distortion performance of image compression[40, 29, 22], many deep learning-based video compression approaches [55, 36, 10, 16, 59] were proposed. In these learned video compression approaches, the whole frameworks are optimized in an end-to-end manner.
However, both the existing handcrafted [28, 54, 45] and learned video compression [55, 36, 10, 16, 59] approaches utilize non-recurrent structures to compress the sequential video data. As such, only a limited number of references can be used to compress new frames, thus limiting their ability for exploring temporal correlation and reducing redundancy. Adopting a recurrent compression framework enables to fully take advantage of the correlated information in consecutive frames, and thus facilitates video compression. Moreover, in the entropy coding of previous learned approaches [55, 36, 10, 16, 59], the Probability Mass Functions (PMF) of latent representations are also independently estimated on each frame, ignoring the correlation between the latent representations among neighboring frames. Similar to the reference frames in the pixel domain, fully making use of the correlation in the latent domain benefits the compression of latent representations. Intuitively, the temporal correlation in the latent domain also can be explored in a recurrent manner.
Therefore, this paper proposes a Recurrent Learned Video Compression (RLVC) approach, with the Recurrent Auto-Encoder (RAE) and Recurrent Probability Model (RPM). As shown in Fig. 1, the proposed RLVC approach uses recurrent networks for representing inputs, reconstructing compressed outputs and modeling PMFs for entropy coding. Specifically, the proposed RAE network contains recurrent cells in both the encoder and decoder. Given a sequence of inputs , the encoder of RAE recurrently generates the latent representations , and the decoder also reconstructs the compressed outputs from in a recurrent manner. As such, all previous frames can be seen as references for compressing the current one, and therefore our RLVC approach is able to make use of the information in a large number of frames, instead of the very limited reference frames in the non-recurrent approaches [55, 36, 10, 59].
Furthermore, the proposed RPM network recurrently models the PMF of conditioned on all previous latent representations . Because of the recurrent cell, our RPM network estimates the temporally conditional PMF , instead of the independent PMF as in previous works [55, 36, 10, 16, 59]. Due to the temporal correlation among , the (cross) entropy of conditioned on the previous information is expected to be lower than the independent (cross) entropy. Therefore, our RPM network is able to achieve lower bit-rate to compress . As Fig. 1 illustrates, the proposed RAE and RPM networks build up a recurrent video compression framework. The hidden states for representation learning and probability modeling are recurrently transmitted from frame to frame, and therefore the information in consecutive frames can be fully exploited in both the pixel and latent domains for compressing the upcoming frames. This results in efficient video compression.
The contribution of this paper can be summarized as:
We propose employing the recurrent structure in learned video compression to fully exploit the temporal correlation among a large range of video frames.
We propose the recurrent auto-encoder to expand the range of reference frames, and propose the recurrent probability model to recurrently estimate the temporally conditional PMF of the latent representations. This way, we achieve the expected bit-rate as the conditional cross entropy, which can be lower than the independent cross entropy in previous non-recurrent approaches.
The experiments validate the superior performance of the proposed approach to the existing learned video compression approaches, and the ablation studies verify the effectiveness of each recurrent component in our framework.
In the following, Section II presents the related works. The proposed RAE and RPM are introduced in Section III. Then, the experiments in Section IV validate the superior performance of the proposed RLVC approach to the existing learned video compression approaches. Finally, the ablation studies further demonstrate the effectiveness of the proposed RAE and RPM networks, respectively.
Auto-encoders and RNNs. Auto-encoders  have been popularly used for representation learning in the past decades. In the field of image processing, there are plenty of auto-encoders proposed for image denoising [12, 18], enhancement [35, 41]
and super resolution[60, 53]
. Besides, inspired by the development of Recurrent Neural Networks (RNNs) and their applications on sequential data, e.g., language modeling [39, 24] and video analysis , some recurrent auto-encoders were proposed for representation learning on time-series tasks, such as machine translation [11, 46] and captioning , etc. Moreover, Srivastava et al. 
proposed learning for video representations using an auto-encoder based on Long Short-Term Memory (LSTM), and verified the effectiveness on classification and action recognition tasks on video. However, as far as we know, there is no recurrent auto-encoder utilized in learned video compression.
Learned image compression. In recent years, there are increasing interests in applying deep auto-encoders in the end-to-end DNN models for learned image compression [48, 49, 1, 47, 3, 4, 40, 37, 30, 23, 29, 22]. For instance, Theis et al.  proposed a compressive auto-encoder for lossy image compression, and reached competitive performance with JPEG 2000 . Later, various probability models were proposed. For instance, Ballé et al. [3, 4] proposed the factorized prior 
and hyperprior probability models to estimate entropy in the end-to-end DNN image compression frameworks. Later, based on them, Minnen et al.  proposed the hierarchical prior entropy model to improve the compression efficiency. Besides, Mentzer et al.  utilized 3D-CNN as the context model for entropy coding, and proposed learning an importance mask to reduce the redundancy in latent representation. Recently, the context-adaptive  and the coarse-to-fine hyper-prior  entropy models were designed to further advance the rate-distortion performance, and successfully outperform the traditional image codec BPG .
Learned video compression. Deep learning is also attracting more and more attention in video compression. To improve the coding efficiency of handcrafted standard, many approaches [57, 34, 13, 15, 31, 32] were proposed to replace the components in H.265 by DNN. Among them, Liu et al. 
utilized DNN in the fractional interpolation of motion compensation, and Choiet al.  proposed a DNN model for frame prediction. Besides, [15, 31, 32] employed DNNs to improve the in-loop filter of H.265. However, these approaches only advance the performance of one particular module, and the video compression frameworks cannot be jointly optimized.
Inspired by the success of learned image compression, some learning-based video compression approaches were proposed [8, 9]. However, [8, 9] still adopt some handcrafted strategies, such as block matching for motion estimation and compensation, and therefore they fail to optimize the whole compression framework in an end-to-end manner. Recently, several end-to-end DNN frameworks have been proposed for video compression [55, 36, 10, 16, 19, 33, 59]. Specifically, Wu et al.  proposed predicting frames by interpolation from reference frames, and compressing residual by the image compression model . Later, Lu et al.  proposed the Deep Video Compression (DVC) approach, which uses optical flow for motion estimation, and utilizes two auto-encoders to compress the motion and residual, respectively. Then, Djelouah et al.  employs bi-directional prediction in to learned video compression. Liu et al.  proposed a deep video compression framework with the one-stage flow for motion compensation. Most recently, Yang et al.  proposed learning for video compression with hierarchical quality layers and adopted a recurrent enhancement network in the deep decoder. Nevertheless, none of them learns to compress video with a recurrent model. Instead, there are at most two reference frames used in these approaches [55, 36, 10, 16, 33, 59], and therefore they fail to exploit the temporal correlation in a large number of frames.
Although Habibian et al.  proposed taking a group of frames as inputs to the 3D auto-encoder, the temporal length is limited as all frames in one group have to fit into GPU memory at the same time. Instead, the proposed RLVC network takes as inputs only one frame and the hidden states from the previous frame, and recurrently moves forward. Therefore, we are able to explore larger range of temporal correlation with finite memory. Also,  uses a PixelCNN-like network  as an auto-regressive probability model, which makes decoding slow. On the contrary, the proposed RPM network benefits our approach to achieve not only more efficient compression but also faster decoding.
The framework of the proposed RLVC approach is shown in Fig. 2. Inspired by traditional video codecs, we utilize motion compensation to reduce the redundancy among video frames, whose effectiveness in learned compression has been proved in previous works [36, 59]. To be specific, we apply the pyramid optical flow network  to estimate the temporal motion between the current frame and the previously compressed frame, e.g., and . The large receptive field of the pyramid network  benefits to handle large and fast motions. Here, we define the raw and compressed frames as and , respectively. Then, the estimated motion is compressed by the proposed RAE, and the compressed motion is applied for motion compensation. In our framework, we use the same motion compensation method as [36, 59]. In the following, the residual () between and the motion compensated frame can be obtained and compressed by another RAE. Given the compressed residual as , the compressed frame can be reconstructed. The details of the proposed RAE is described in Section III-B.
In our framework, the two RAEs in each frame generate the latent representations of and for motion and residual compression, respectively. To compress and into a bit stream, we propose the RPM network to recurrently predict the temporally conditional PMFs of and . Due to the temporal relationship among video frames, the conditional cross entropy is expected to be lower than the independent cross entropy used in non-recurrent approaches [55, 36, 10, 59]. Hence, utilizing the conditional PMF estimated by our RPM network effectively reduces bit-rate in arithmetic coding . The proposed RPM is detailed in Section III-C.
As mentioned above, we apply two RAEs to compress and . Since the two RAEs share the same architecture, we denote both and by in this section for simplicity. Recall that in the non-recurrent learned video compression works [36, 10, 59], when compressing the -th frame, the auto-encoders map the input to a latent representation
through an encoder parametrized with . Then, the continuous-valued is quantized to the discrete-valued . The compressed output is reconstructed by the decoder from the quantized latent representation, i.e.,
Taking the inputs of only the current and to the encoder and decoder, they fail to take advantage of the temporal correlation in consecutive frames.
down-sampling convolutional layers with the activation function of GDN in the encoder of RAE. In the middle of the four convolutional layers, we insert a ConvLSTM  cell to achieve the recurrent structure. As such, the information from previous frames flows into the encoder network of the current frame through the hidden states of the ConvLSTM. Therefore, the proposed RAE generates latent representation based on the current as well as previous inputs. Similarly, the recurrent decoder in RAE also has a ConvLSTM cell in middle of the four up-sampling convolutional layers with IGDN , and thus also reconstructs from both the current and previous latent representations. In summary, our RAE network can be formulated as
In (3), all previous frames can be seen as reference frames for compressing the current frame, and therefore our RLVC approach is able to make use of the information in a large range of frames, instead of the very limited number of reference frames in the non-recurrent approaches [55, 36, 10, 59].
To compress the sequence of latent representations , the RPM network is proposed for entropy coding. First, we use and to denote the true and estimated independent PMFs of . The expected bit-rate of is then given as the cross entropy
Note that arithmetic coding  is able to encode at the bit-rate of the cross entropy with negligible overhead. It can be seen from (4) that if has higher certainty, the bit-rate can be smaller. Due to the temporal relationship among video frames, the distribution of in consecutive frames are correlated. Therefore, conditioned on the information of previous latent representations , the current is expected to be more certain. That is, defining and as the true and estimated temporally conditional PMF of , the conditional cross entropy
Specifically, adaptive arithmetic coding  allows to change the PMF for each element in , and thus we estimate different conditional PMFs for different elements . Here, is defined as the element at the -th 3D location in , and the conditional PMF of can be expressed as
in which denotes the number of 3D positions in . As shown in Fig. 4, we model of each element as discretized logistic distribution in our approach. Since the quantization operation in RAE quantizes all to a discrete value , the conditional PMF of the quantized can be obtained by integrating the continuous logistic distribution  from to :
in which the logistic distribution is defined as
and its integral is the sigmoid distribution, i.e.,
It can be seen from (10), the conditional PMF at each location is modelled with parameters and , which are varying for different locations in . The RPM network is proposed to recurrently estimate and in (10). Fig. 5 demonstrates the detailed architecture of our RPM network, which contains a recurrent network with convolution layers and a ConvLSTM cell in the middle. Due to the recurrent structure, and are generated based on all previous latent representations, i.e.,
where represents the trainable parameters in RPM. Because takes previous latent representations as inputs, and learn to model the probability of each conditioned on according to (10). Finally, the conditional PMFs are applied to the adaptive arithmetic coding  to encode into a bit stream.
In this paper, we utilize the Multi-Scale Structural SIMilarity (MS-SSIM) index and the Peak Signal-to-Noise Ratio (PSNR) to evaluate compression quality, and train two models optimized for MS-SSIM and PSNR, respectively. The distortionis defined as when optimizing for MS-SSIM, and as the Mean Square Error (MSE) when training the PSNR model. As Fig. 2 shows, our approach uses the uni-directional Low-Delay P (LDP) structure. We follow  to compress the I-frame with the learned image compression method  for the MS-SSIM model, and with BPG  for the PSNR model. Because of lacking previous latent representation for the first P-frame , and are compressed by the spatial entropy model of , with the bit-rate defined as and , respectively. The following P-frames are compressed with the proposed RPM network. For , the actual bit-rate can be calculated as
in which is modelled by the proposed RPM according to (6) to (11). Note that, assuming that the distribution of the training set is identical with the true distribution, the actual bit-rate is expected to be the conditional cross entropy in (5). In our approach, two RPM networks are applied to the latent representations of motion and residual, and their bit-rates are defined as and , respectively.
Our RLVC approach is trained on the Vimeo-90k  dataset, in which each training sample has 7 frames. The first frame is compressed as the I-frame and the other 6 frames are P-frames. First, we warm up the network on the first P-frame
in a progressive manner. At the beginning, the motion estimation network is trained with the loss function of
in which is the output of the motion estimation network (as shown in Fig. 2) and is the warping operation. When is converged, we further include the RAE network for compressing motion and the motion compensation network into training, using the following loss function
After the convergence of , the whole network is jointly trained on by the loss of
In the following, we train our recurrent model in an end-to-end manner on the sequential training frames using loss function of
During training, quantization is relaxed by the method in  to avoid zero gradients. We follow  to set as 8, 16, 32 and 64 for MS-SSIM, and as 256, 512, 1024 and 2048 for PSNR. The Adam optimizer  is utilized for training. The initial learning rate is set as for all loss functions (13), (14), (15) and (16). When training the whole model by the final loss of (16), we decade the learning rate after convergence by the factor of 10 until .
|Dataset||Video||||||||||(Ours)||LDP def.||default||SSIM def.||slowest||SSIM slowest|
|Average on all videos||-||-|
The experiments are conducted to validate the effectiveness of our RLVC approach. We evaluate the performance on the same test set as , i.e., the JCT-VC  (Classes B, C and D) and the UVG  datasets. The JCT-VC Class B and UVG are high resolution () datasets, and the JCT-VC Classes C and D are with the resolution of and , respectively. We compare our RLVC approach with the latest learned video compression methods: HLVC  (CVPR’20), Liu et al.  (AAAI’20), Habibian et al.  (ICCV’19), DVC  (CVPR’19), Cheng et al.  (CVPR’19) and Wu et al.  (ECCV’18). To compare with the handcrafted video coding standard H.265 , we first include the LDP very fast setting of x265 into comparison, which is used as the anchor in previous learned compression works [36, 59, 33]. We also compare our approach with the LDP default, the default and the slowest settings of x265. Moreover, the SSIM-tuned x265 is also compared with our MS-SSIM model. The detailed configurations of x265 are listed as follows:
x265 (LDP very fast):
ffmpeg (input) -c:v libx265
-preset veryfast -tune zerolatency
-x265-params "crf=CRF:keyint=10" output.mkv
x265 (LDP default):
ffmpeg (input) -c:v libx265
-x265-params "crf=CRF" output.mkv
ffmpeg (input) -c:v libx265
-x265-params "crf=CRF" output.mkv
x265 (SSIM default):
ffmpeg (input) -c:v libx265 -tune ssim
-x265-params "crf=CRF" output.mkv
ffmpeg (input) -c:v libx265
-preset placebo111 Placebo is the slowest setting among the 10 speed levels in x265.
-x265-params "crf=CRF" output.mkv
x265 (SSIM slowest):
ffmpeg (input) -c:v libx265
-preset placebo -tune ssim
-x265-params "crf=CRF" output.mkv
In above settings, “(input)” is short for “-pix_fmt yuv420p -s WidthxHeight -r Framerate -i input.yuv”. CRF indicates the compression quality, and lower CRF corresponds to higher quality. We set CRF = 15, 19, 23, 27 for the JCT-VC dataset, and set CRF = 7, 11, 15, 19, 23 for the UVG dataset.
Please refer to the Supporting Document for the experimental results on more datasets, such as the conversational video dataset and the MCL-JCV  dataset.
Comparison with learned approaches. Fig. 6 illustrates the rate-distortion curves of our RLVC approach in comparison with previous learned video compression approaches on the UVG and JCT-VC datasets. Among the compared approaches, Liu et al.  and Habibian et al.  are optimized for MS-SSIM. DVC  and Wu et al.  are optimized for PSNR. HLVC  trains two models for MS-SSIM and PSNR, respectively. As we can see from Fig. 6 (a) and (b), our MS-SSIM model outperforms all previous learned approaches, including the state-of-the-art MS-SSIM optimized approaches Liu et al.  (AAAI’20), HLVC  (CVPR’20) and Habibian et al.  (ICCV’19). In terms of PSNR, Fig. 6 (c) and (d) indicate the superior performance of our PSNR model to the PSNR optimized models HLVC  (CVPR’20), DVC  (CVPR’19) and Wu et al.  (ECCV’18).
We further tabulate the Bjøntegaard Delta Bit-Rate (BDBR)  results calculated by MS-SSIM and PSNR with the anchor of x265 (LDP very fast) in Tables I and II, respectively.222Since [55, 33] do not release the result on each video, their BDBR values cannot be obtained. Note that, BDBR calculates the average bit-rate difference in comparison with the anchor. Lower BDBR value indicates better performance, and negative BDBR indicates saving bit-rate in comparison with the anchor, i.e., outperforming the anchor. In Tables I and II, the bold numbers are the best results in learned approaches. As Table I shows, in terms of MS-SSIM, the proposed RLVC approach outperforms previous learned approaches on all videos in the high resolution datasets UVG and JCT-VC Class B. In all the 20 test videos, we achieve the best results in learned approaches on 18 videos, and have the best average BDBR performance among all learned approaches. Moreover, Table II shows that, in terms of PSNR, our PSNR model has better performance than all existing learned approaches on all test videos.
Note that, the latest HLVC  (CVPR’20) approach introduces bi-directional prediction, hierarchical structure and post-processing into learned video compression, while the proposed RLVC approach only works in the uni-directional IPPP model without post-processing (as shown in Fig. 2). Nevertheless, our approach still achieves better performance than HLVC , validating the effectiveness of our recurrent compression framework with the proposed RAE and RPM networks.
Comparison with x265. The rate-distortion curves compared with different settings of x265 are demonstrated in Fig. 7. As Fig. 7 (a) and (b) show, the proposed MS-SSIM model outperforms x265 (LDP very fast), x265 (LDP default), x265 (default) and x265 (SSIM default) on both the UVG and JCT-VC datasets from low to high bit-rates. Besides, in comparison with the slowest setting of x265, we also achieve better performance on UVG and at high bit-rates on JCT-VC. Moreover, at high bit-rates, we even have higher MS-SSIM performance than the SSIM-tuned slowest setting of x265, which can be seen as the best (MS-)SSIM performance that x265 is able to reach.
Similar conclusion can be obtained from the BDBR results calculated by MS-SSIM in Table I. That is, our RLVC approach averagely reduces bit-rate of the anchor x265 (LDP very fast), and outperform x265 (LDP default), x265 (default), x265 (SSIM default) and x265 (slowest). In comparison with x265 (SSIM slowest), we achieve better performance on 8 out of the 20 test videos. We also have better average BDBR result than x265 (SSIM slowest) on JCT-VC Class B, and reach almost the same average performance as x265 (SSIM slowest) on JCT-VC Class D.
|Ave. (Class B)|
|Ave. (Class C)|
|Ave. (Class D)|
|Ave. (all videos)|
In terms of PSNR, Fig. 7 (c) and (d) show that our PSNR model outperforms x265 (LDP very fast) from low to high bit-rates on both the UVG and JCT-VC test sets. Besides, we are superior to x265 (LDP default) at high bit-rates on UVG and in a large of bit-rates on JCT-VC. The BDBR results calculated by PSNR in Table II also indicate that our approach achieves less bit-rate than x265 (LDP very fast), and reduces more bit-rate than x265 (LDP default). We do not outperform the default and the slowest settings of x265 on PSNR. However, x265 (default) and x265 (slowest) apply advanced strategies in video compression, such as bi-directional prediction and hierarchical frame structure, while our approach only utilizes the uni-directional IPPP mode. Note that, as far as we know, there is no learned video compression approach beats the default setting of x265 in terms of PSNR. The proposed RLVC approach advances the state-of-the-art learned video compression performance and contributes to catching up with the handcrafted standards step by step.
Visual results. The visual results of our MS-SSIM and PSNR models are illustrated in Fig. 8, comparing with the default setting of x265. It can be seen from Fig. 8 that our MS-SSIM model reaches higher MS-SSIM with lower bit-rate than x265, and produces the compressed frame with less blocky artifacts. For our PSNR model, as discussed above, we do not beat the default setting of x265 in terms of PSNR. However, as Fig. 8 shows, our PSNR model also achieves less blocky artifacts and less noise than x265, and is able to reach similar or even higher MS-SSIM than the default setting of x265 in some cases.
Computational complexity. We measure the complexity of the learned approaches on one NVIDIA 1080Ti GPU. The results in terms of frame per second (fps) are shown in Table III. As Table III shows, due to the recurrent cells in our auto-encoders and probability model, the superior performance of our approach is at the cost of the higher encoding complexity than previous approaches. Nevertheless, we have faster decoding than [59, 19], and achieve the real-time decoding on 240p videos with frame rate . Note that, HLVC  adopts an enhancement network in the decoder to improve compression quality, which increases decoding complexity. Our RLVC approach (without enhancement) still reaches higher compression performance than HLVC , and also has faster decoding speed. Besides, the auto-regressive (PixelCNN-like) probability model used in  leads to slow decoding, while the proposed RPM network is more efficient.
The ablation studies are conducted to verify the effectiveness of each recurrent component in our approach. We define the baseline (BL) as our framework without recurrent cells, i.e., without recurrent cells in auto-encoders and replacing our RPM network with the factorized spatial entropy model . In the following, we enable the recurrent cell in the encoder (BL+RE) and in the decoder (BL+RD), respectively. Then, both of them are enabled, i.e., the proposed RAE network (BL+RAE). Finally, our RPM network is further applied to replace the spatial model  (BL+RAE+RPM, i.e., our full model). Besides, we also compare our RPM network with the hyperprior spatial entropy model .
The proposed RAE. As Fig. 9 shows, the rate-distortion curves of BL+RE and BL+RD are both above the baseline. This indicates that the recurrent encoder and the recurrent decoder are both able to improve the compression performance. Moreover, combining them together in the proposed RAE, the rate-distortion performance is further improved (shown as BL+RAE). The probable reason is that, because of the dual recurrent cells in both the encoder and decoder, it learns to encode the residual information between the current and previous inputs, which reduces the information content represented by each latent representation, and then the decoder reconstructs the output based on the encoded residual and previous outputs. This results in efficient compression.
The proposed RPM. It can be seen from Fig. 9 that the proposed RPM (BL+RAE+RPM) significantly reduces the bit-rate in comparison with BL+RAE, which uses the spatial entropy model . This proves the fact that at the same compression quality, the temporally conditional cross entropy is smaller than the independent cross entropy, i.e.,
Besides, Fig. 9 shows that our RPM network further outperforms the hyperprior spatial entropy model , which generates the side information to facilitate the compression of . This indicates that when compressing video at the same quality, the temporally conditional cross entropy is smaller than the spatial conditional cross entropy (with the overhead cross entropy of ), i.e.,
The proposed RPM has two benefits over . First, our RPM does not consume overhead bit-rate to compress the prior information, while  has to compress into bit stream. Second, our RPM uses the temporal prior of all previous latent representations, while there is only one spatial prior in  with much smaller size, i.e., of . In conclusion, these studies verify the benefits of applying temporal prior to estimate the conditional probability in a recurrent manner.
It is worth pointing out that the proposed RPM network is flexible to be combined with various spatial probability models, e.g., [4, 29, 22]. As an example, we train a model combining the proposed approach with the hyperprior spatial probability model , which is illustrated in Fig. 10. This combined model only slightly improves our approach, i.e., bit-rate reduction on the JCT-VC dataset. On the one hand, such slight improvement indicates that due to the high correlation among video frames, the previous latent representations are able to provide most of the useful information, and the spatial prior, which leads to bit-rate overhead, is not very helpful to further improve the performance. This validates the effectiveness of our RPM network. On the other hand, it also shows the flexibility of our RPM network to combine with spatial probability models, e.g., replacing the spatial model in Fig. 10 with [4, 29] or 333Since [29, 22] do not release the training codes, we are not able to learn the model combining RPM with [29, 22]., and the possibility to further advance the performance.
This paper has proposed a recurrent learned video compression approach. Specifically, we proposed recurrent auto-encoders to compress motion and residual, fully exploring the temporal correlation in video frames. Then, we showed how modeling the conditional probability in a recurrent manner improves the coding efficiency. The proposed recurrent auto-encoders and recurrent probability model significantly expands the range of reference frames, which has not been achieved in previous learned as well as handcrafted standards. The experiments validate that the proposed approach outperforms all previous learned approaches and the LDP default setting of x265 in terms of both PSNR and MS-SSIM, and also outperforms x265 (slowest) on MS-SSIM. The ablation studies verify the effectiveness of each recurrent component in our RLVC approach, and show the flexibility of the proposed RPM network to combine with spatial probability models.
In this paper, our approach works in the IPPP mode. Combining our approach with bi-directional prediction and hierarchical frame structure can be seen as promising future works. Besides, the recurrent framework of the proposed approach still relies on the warping operation and motion compensation to reduce the temporal redundancy. Therefore, another possible future work is designing a fully recurrent deep video compression network to automatically learn to explore the temporal redundancy without adopting optical flow based motion.
Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1141–1151. Cited by: §II.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Cited by: §II.
Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images. In
Proceedings of the International Conference on Machine Learning (ICML), pp. 432–440. Cited by: §II.
A convolutional neural network approach for post-processing in HEVC intra coding. In Proceedings of the International Conference on Multimedia Modeling (MMM), pp. 28–39. Cited by: §II.
Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §I, §II, §IV-D, footnote 3.