HLVC
None
view repo
The past few years have witnessed increasing interests in applying deep learning to video compression. However, the existing approaches compress a video frame with only a few number of reference frames, which limits their ability to fully exploit the temporal correlation among video frames. To overcome this shortcoming, this paper proposes a Recurrent Learned Video Compression (RLVC) approach with the Recurrent AutoEncoder (RAE) and Recurrent Probability Model (RPM). Specifically, the RAE employs recurrent cells in both the encoder and decoder. As such, the temporal information in a large range of frames can be used for generating latent representations and reconstructing compressed outputs. Furthermore, the proposed RPM network recurrently estimates the Probability Mass Function (PMF) of the latent representation, conditioned on the distribution of previous latent representations. Due to the correlation among consecutive frames, the conditional cross entropy can be lower than the independent cross entropy, thus reducing the bitrate. The experiments show that our approach achieves the stateoftheart learned video compression performance in terms of both PSNR and MSSSIM. Moreover, our approach outperforms the default LowDelay P (LDP) setting of x265 on PSNR, and also has better performance on MSSSIM than the SSIMtuned x265 and the slowest setting of x265.
READ FULL TEXT VIEW PDFNone
A framework that you can easily test videos using various plugins, and even you can write your own plugin.
HLVC submodule for VCtest
RLVC submodule for VCtest
Nowadays, video contributes to the majority of mobile data traffic [14]. The demands of high resolution and high quality video are also increasing. Therefore, video compression is essential to enable the efficient transmission of video data over the bandlimited Internet. Especially, during the COVID19 pandemic, the increasing data traffic used for video conferencing, gaming and online learning forced Netflix and YouTube to limit video quality in Europe. This further shows the essential impact of improving video compression on today’s social development.
During the past decades, several video compression algorithms, such as MPEG [28], H.264 [54] and H.265 [45]
were standardized. These standards are handcrafted, and the modules in compression frameworks cannot be jointly optimized. Recently, inspired by the success of Deep Neural Networks (DNN) in advancing the ratedistortion performance of image compression
[40, 29, 22], many deep learningbased video compression approaches [55, 36, 10, 16, 59] were proposed. In these learned video compression approaches, the whole frameworks are optimized in an endtoend manner.However, both the existing handcrafted [28, 54, 45] and learned video compression [55, 36, 10, 16, 59] approaches utilize nonrecurrent structures to compress the sequential video data. As such, only a limited number of references can be used to compress new frames, thus limiting their ability for exploring temporal correlation and reducing redundancy. Adopting a recurrent compression framework enables to fully take advantage of the correlated information in consecutive frames, and thus facilitates video compression. Moreover, in the entropy coding of previous learned approaches [55, 36, 10, 16, 59], the Probability Mass Functions (PMF) of latent representations are also independently estimated on each frame, ignoring the correlation between the latent representations among neighboring frames. Similar to the reference frames in the pixel domain, fully making use of the correlation in the latent domain benefits the compression of latent representations. Intuitively, the temporal correlation in the latent domain also can be explored in a recurrent manner.
Therefore, this paper proposes a Recurrent Learned Video Compression (RLVC) approach, with the Recurrent AutoEncoder (RAE) and Recurrent Probability Model (RPM). As shown in Fig. 1, the proposed RLVC approach uses recurrent networks for representing inputs, reconstructing compressed outputs and modeling PMFs for entropy coding. Specifically, the proposed RAE network contains recurrent cells in both the encoder and decoder. Given a sequence of inputs , the encoder of RAE recurrently generates the latent representations , and the decoder also reconstructs the compressed outputs from in a recurrent manner. As such, all previous frames can be seen as references for compressing the current one, and therefore our RLVC approach is able to make use of the information in a large number of frames, instead of the very limited reference frames in the nonrecurrent approaches [55, 36, 10, 59].
Furthermore, the proposed RPM network recurrently models the PMF of conditioned on all previous latent representations . Because of the recurrent cell, our RPM network estimates the temporally conditional PMF , instead of the independent PMF as in previous works [55, 36, 10, 16, 59]. Due to the temporal correlation among , the (cross) entropy of conditioned on the previous information is expected to be lower than the independent (cross) entropy. Therefore, our RPM network is able to achieve lower bitrate to compress . As Fig. 1 illustrates, the proposed RAE and RPM networks build up a recurrent video compression framework. The hidden states for representation learning and probability modeling are recurrently transmitted from frame to frame, and therefore the information in consecutive frames can be fully exploited in both the pixel and latent domains for compressing the upcoming frames. This results in efficient video compression.
The contribution of this paper can be summarized as:
We propose employing the recurrent structure in learned video compression to fully exploit the temporal correlation among a large range of video frames.
We propose the recurrent autoencoder to expand the range of reference frames, and propose the recurrent probability model to recurrently estimate the temporally conditional PMF of the latent representations. This way, we achieve the expected bitrate as the conditional cross entropy, which can be lower than the independent cross entropy in previous nonrecurrent approaches.
The experiments validate the superior performance of the proposed approach to the existing learned video compression approaches, and the ablation studies verify the effectiveness of each recurrent component in our framework.
In the following, Section II presents the related works. The proposed RAE and RPM are introduced in Section III. Then, the experiments in Section IV validate the superior performance of the proposed RLVC approach to the existing learned video compression approaches. Finally, the ablation studies further demonstrate the effectiveness of the proposed RAE and RPM networks, respectively.
Autoencoders and RNNs. Autoencoders [20] have been popularly used for representation learning in the past decades. In the field of image processing, there are plenty of autoencoders proposed for image denoising [12, 18], enhancement [35, 41]
and super resolution
[60, 53]. Besides, inspired by the development of Recurrent Neural Networks (RNNs) and their applications on sequential data
[25], e.g., language modeling [39, 24] and video analysis [17], some recurrent autoencoders were proposed for representation learning on timeseries tasks, such as machine translation [11, 46] and captioning [51], etc. Moreover, Srivastava et al. [44]proposed learning for video representations using an autoencoder based on Long ShortTerm Memory (LSTM)
[21], and verified the effectiveness on classification and action recognition tasks on video. However, as far as we know, there is no recurrent autoencoder utilized in learned video compression.Learned image compression. In recent years, there are increasing interests in applying deep autoencoders in the endtoend DNN models for learned image compression [48, 49, 1, 47, 3, 4, 40, 37, 30, 23, 29, 22]. For instance, Theis et al. [47] proposed a compressive autoencoder for lossy image compression, and reached competitive performance with JPEG 2000 [43]. Later, various probability models were proposed. For instance, Ballé et al. [3, 4] proposed the factorized prior [3]
and hyperprior
[4] probability models to estimate entropy in the endtoend DNN image compression frameworks. Later, based on them, Minnen et al. [40] proposed the hierarchical prior entropy model to improve the compression efficiency. Besides, Mentzer et al. [37] utilized 3DCNN as the context model for entropy coding, and proposed learning an importance mask to reduce the redundancy in latent representation. Recently, the contextadaptive [29] and the coarsetofine hyperprior [22] entropy models were designed to further advance the ratedistortion performance, and successfully outperform the traditional image codec BPG [5].Learned video compression. Deep learning is also attracting more and more attention in video compression. To improve the coding efficiency of handcrafted standard, many approaches [57, 34, 13, 15, 31, 32] were proposed to replace the components in H.265 by DNN. Among them, Liu et al. [34]
utilized DNN in the fractional interpolation of motion compensation, and Choi
et al. [13] proposed a DNN model for frame prediction. Besides, [15, 31, 32] employed DNNs to improve the inloop filter of H.265. However, these approaches only advance the performance of one particular module, and the video compression frameworks cannot be jointly optimized.Inspired by the success of learned image compression, some learningbased video compression approaches were proposed [8, 9]. However, [8, 9] still adopt some handcrafted strategies, such as block matching for motion estimation and compensation, and therefore they fail to optimize the whole compression framework in an endtoend manner. Recently, several endtoend DNN frameworks have been proposed for video compression [55, 36, 10, 16, 19, 33, 59]. Specifically, Wu et al. [55] proposed predicting frames by interpolation from reference frames, and compressing residual by the image compression model [49]. Later, Lu et al. [36] proposed the Deep Video Compression (DVC) approach, which uses optical flow for motion estimation, and utilizes two autoencoders to compress the motion and residual, respectively. Then, Djelouah et al. [16] employs bidirectional prediction in to learned video compression. Liu et al. [33] proposed a deep video compression framework with the onestage flow for motion compensation. Most recently, Yang et al. [59] proposed learning for video compression with hierarchical quality layers and adopted a recurrent enhancement network in the deep decoder. Nevertheless, none of them learns to compress video with a recurrent model. Instead, there are at most two reference frames used in these approaches [55, 36, 10, 16, 33, 59], and therefore they fail to exploit the temporal correlation in a large number of frames.
Although Habibian et al. [19] proposed taking a group of frames as inputs to the 3D autoencoder, the temporal length is limited as all frames in one group have to fit into GPU memory at the same time. Instead, the proposed RLVC network takes as inputs only one frame and the hidden states from the previous frame, and recurrently moves forward. Therefore, we are able to explore larger range of temporal correlation with finite memory. Also, [19] uses a PixelCNNlike network [50] as an autoregressive probability model, which makes decoding slow. On the contrary, the proposed RPM network benefits our approach to achieve not only more efficient compression but also faster decoding.
The framework of the proposed RLVC approach is shown in Fig. 2. Inspired by traditional video codecs, we utilize motion compensation to reduce the redundancy among video frames, whose effectiveness in learned compression has been proved in previous works [36, 59]. To be specific, we apply the pyramid optical flow network [42] to estimate the temporal motion between the current frame and the previously compressed frame, e.g., and . The large receptive field of the pyramid network [42] benefits to handle large and fast motions. Here, we define the raw and compressed frames as and , respectively. Then, the estimated motion is compressed by the proposed RAE, and the compressed motion is applied for motion compensation. In our framework, we use the same motion compensation method as [36, 59]. In the following, the residual () between and the motion compensated frame can be obtained and compressed by another RAE. Given the compressed residual as , the compressed frame can be reconstructed. The details of the proposed RAE is described in Section IIIB.
In our framework, the two RAEs in each frame generate the latent representations of and for motion and residual compression, respectively. To compress and into a bit stream, we propose the RPM network to recurrently predict the temporally conditional PMFs of and . Due to the temporal relationship among video frames, the conditional cross entropy is expected to be lower than the independent cross entropy used in nonrecurrent approaches [55, 36, 10, 59]. Hence, utilizing the conditional PMF estimated by our RPM network effectively reduces bitrate in arithmetic coding [27]. The proposed RPM is detailed in Section IIIC.
As mentioned above, we apply two RAEs to compress and . Since the two RAEs share the same architecture, we denote both and by in this section for simplicity. Recall that in the nonrecurrent learned video compression works [36, 10, 59], when compressing the th frame, the autoencoders map the input to a latent representation
(1) 
through an encoder parametrized with . Then, the continuousvalued is quantized to the discretevalued . The compressed output is reconstructed by the decoder from the quantized latent representation, i.e.,
(2) 
Taking the inputs of only the current and to the encoder and decoder, they fail to take advantage of the temporal correlation in consecutive frames.
On the contrary, the proposed RAE includes recurrent cells in both the encoder and decoder. The architecture of the RAE network is illustrated in Fig. 3. We follow [4] to use four
downsampling convolutional layers with the activation function of GDN
[3] in the encoder of RAE. In the middle of the four convolutional layers, we insert a ConvLSTM [56] cell to achieve the recurrent structure. As such, the information from previous frames flows into the encoder network of the current frame through the hidden states of the ConvLSTM. Therefore, the proposed RAE generates latent representation based on the current as well as previous inputs. Similarly, the recurrent decoder in RAE also has a ConvLSTM cell in middle of the four upsampling convolutional layers with IGDN [3], and thus also reconstructs from both the current and previous latent representations. In summary, our RAE network can be formulated as(3)  
In (3), all previous frames can be seen as reference frames for compressing the current frame, and therefore our RLVC approach is able to make use of the information in a large range of frames, instead of the very limited number of reference frames in the nonrecurrent approaches [55, 36, 10, 59].
To compress the sequence of latent representations , the RPM network is proposed for entropy coding. First, we use and to denote the true and estimated independent PMFs of . The expected bitrate of is then given as the cross entropy
(4) 
Note that arithmetic coding [27] is able to encode at the bitrate of the cross entropy with negligible overhead. It can be seen from (4) that if has higher certainty, the bitrate can be smaller. Due to the temporal relationship among video frames, the distribution of in consecutive frames are correlated. Therefore, conditioned on the information of previous latent representations , the current is expected to be more certain. That is, defining and as the true and estimated temporally conditional PMF of , the conditional cross entropy
(5) 
can be smaller than the independent cross entropy in (4). To achieve the expected bitrate of (5), we propose the RPM network to recurrently model the conditional PMF .
Specifically, adaptive arithmetic coding [27] allows to change the PMF for each element in , and thus we estimate different conditional PMFs for different elements . Here, is defined as the element at the th 3D location in , and the conditional PMF of can be expressed as
(6) 
in which denotes the number of 3D positions in . As shown in Fig. 4, we model of each element as discretized logistic distribution in our approach. Since the quantization operation in RAE quantizes all to a discrete value , the conditional PMF of the quantized can be obtained by integrating the continuous logistic distribution [2] from to :
(7) 
in which the logistic distribution is defined as
(8) 
and its integral is the sigmoid distribution, i.e.,
(9) 
Given (7), (8) and (9), the estimated conditional PMF can be simplified as
(10)  
It can be seen from (10), the conditional PMF at each location is modelled with parameters and , which are varying for different locations in . The RPM network is proposed to recurrently estimate and in (10). Fig. 5 demonstrates the detailed architecture of our RPM network, which contains a recurrent network with convolution layers and a ConvLSTM cell in the middle. Due to the recurrent structure, and are generated based on all previous latent representations, i.e.,
(11) 
where represents the trainable parameters in RPM. Because takes previous latent representations as inputs, and learn to model the probability of each conditioned on according to (10). Finally, the conditional PMFs are applied to the adaptive arithmetic coding [27] to encode into a bit stream.
In this paper, we utilize the MultiScale Structural SIMilarity (MSSSIM) index and the Peak SignaltoNoise Ratio (PSNR) to evaluate compression quality, and train two models optimized for MSSSIM and PSNR, respectively. The distortion
is defined as when optimizing for MSSSIM, and as the Mean Square Error (MSE) when training the PSNR model. As Fig. 2 shows, our approach uses the unidirectional LowDelay P (LDP) structure. We follow [59] to compress the Iframe with the learned image compression method [29] for the MSSSIM model, and with BPG [5] for the PSNR model. Because of lacking previous latent representation for the first Pframe , and are compressed by the spatial entropy model of [3], with the bitrate defined as and , respectively. The following Pframes are compressed with the proposed RPM network. For , the actual bitrate can be calculated as(12)  
in which is modelled by the proposed RPM according to (6) to (11). Note that, assuming that the distribution of the training set is identical with the true distribution, the actual bitrate is expected to be the conditional cross entropy in (5). In our approach, two RPM networks are applied to the latent representations of motion and residual, and their bitrates are defined as and , respectively.
Our RLVC approach is trained on the Vimeo90k [58] dataset, in which each training sample has 7 frames. The first frame is compressed as the Iframe and the other 6 frames are Pframes. First, we warm up the network on the first Pframe
in a progressive manner. At the beginning, the motion estimation network is trained with the loss function of
(13) 
in which is the output of the motion estimation network (as shown in Fig. 2) and is the warping operation. When is converged, we further include the RAE network for compressing motion and the motion compensation network into training, using the following loss function
(14) 
After the convergence of , the whole network is jointly trained on by the loss of
(15) 
In the following, we train our recurrent model in an endtoend manner on the sequential training frames using loss function of
(16)  
During training, quantization is relaxed by the method in [3] to avoid zero gradients. We follow [59] to set as 8, 16, 32 and 64 for MSSSIM, and as 256, 512, 1024 and 2048 for PSNR. The Adam optimizer [26] is utilized for training. The initial learning rate is set as for all loss functions (13), (14), (15) and (16). When training the whole model by the final loss of (16), we decade the learning rate after convergence by the factor of 10 until .
Learned  Nonlearned  
DVC  Cheng  Habibian  HLVC  RLVC  x265  x265  x265  x265  x265  
Dataset  Video  [36]  [10]  [19]  [59]  (Ours)  LDP def.  default  SSIM def.  slowest  SSIM slowest  
UVG  Beauty    
Bosphorus    
HoneyBee    
Jockey    
ReadySetGo    
ShakeNDry    
YachtRide    
Average    

BasketballDrive      
BQTerrace      
Cactus      
Kimono      
ParkScene      
Average      

BasketballDrill    
BQMall    
PartyScene    
RaceHorses (480p)    
Average    

BasketballPass    
BlowingBubbles    
BQSquare    
RaceHorses (240p)    
Average    
Average on all videos     
The experiments are conducted to validate the effectiveness of our RLVC approach. We evaluate the performance on the same test set as [59], i.e., the JCTVC [7] (Classes B, C and D) and the UVG [38] datasets. The JCTVC Class B and UVG are high resolution () datasets, and the JCTVC Classes C and D are with the resolution of and , respectively. We compare our RLVC approach with the latest learned video compression methods: HLVC [59] (CVPR’20), Liu et al. [33] (AAAI’20), Habibian et al. [19] (ICCV’19), DVC [36] (CVPR’19), Cheng et al. [10] (CVPR’19) and Wu et al. [55] (ECCV’18). To compare with the handcrafted video coding standard H.265 [45], we first include the LDP very fast setting of x265 into comparison, which is used as the anchor in previous learned compression works [36, 59, 33]. We also compare our approach with the LDP default, the default and the slowest settings of x265. Moreover, the SSIMtuned x265 is also compared with our MSSSIM model. The detailed configurations of x265 are listed as follows:
x265 (LDP very fast):
ffmpeg (input) c:v libx265
preset veryfast tune zerolatency
x265params "crf=CRF:keyint=10" output.mkv
x265 (LDP default):
ffmpeg (input) c:v libx265
tune zerolatency
x265params "crf=CRF" output.mkv
x265 (default):
ffmpeg (input) c:v libx265
x265params "crf=CRF" output.mkv
x265 (SSIM default):
ffmpeg (input) c:v libx265 tune ssim
x265params "crf=CRF" output.mkv
x265 (slowest):
ffmpeg (input) c:v libx265
preset placebo^{1}^{1}1 Placebo is the slowest setting among the 10 speed levels in x265.
x265params "crf=CRF" output.mkv
x265 (SSIM slowest):
ffmpeg (input) c:v libx265
preset placebo tune ssim
x265params "crf=CRF" output.mkv
In above settings, “(input)” is short for “pix_fmt yuv420p s WidthxHeight r Framerate i input.yuv”. CRF indicates the compression quality, and lower CRF corresponds to higher quality. We set CRF = 15, 19, 23, 27 for the JCTVC dataset, and set CRF = 7, 11, 15, 19, 23 for the UVG dataset.
Please refer to the Supporting Document for the experimental results on more datasets, such as the conversational video dataset and the MCLJCV [52] dataset.
Comparison with learned approaches. Fig. 6 illustrates the ratedistortion curves of our RLVC approach in comparison with previous learned video compression approaches on the UVG and JCTVC datasets. Among the compared approaches, Liu et al. [33] and Habibian et al. [19] are optimized for MSSSIM. DVC [36] and Wu et al. [55] are optimized for PSNR. HLVC [59] trains two models for MSSSIM and PSNR, respectively. As we can see from Fig. 6 (a) and (b), our MSSSIM model outperforms all previous learned approaches, including the stateoftheart MSSSIM optimized approaches Liu et al. [33] (AAAI’20), HLVC [59] (CVPR’20) and Habibian et al. [19] (ICCV’19). In terms of PSNR, Fig. 6 (c) and (d) indicate the superior performance of our PSNR model to the PSNR optimized models HLVC [59] (CVPR’20), DVC [36] (CVPR’19) and Wu et al. [55] (ECCV’18).
We further tabulate the Bjøntegaard Delta BitRate (BDBR) [6] results calculated by MSSSIM and PSNR with the anchor of x265 (LDP very fast) in Tables I and II, respectively.^{2}^{2}2Since [55, 33] do not release the result on each video, their BDBR values cannot be obtained. Note that, BDBR calculates the average bitrate difference in comparison with the anchor. Lower BDBR value indicates better performance, and negative BDBR indicates saving bitrate in comparison with the anchor, i.e., outperforming the anchor. In Tables I and II, the bold numbers are the best results in learned approaches. As Table I shows, in terms of MSSSIM, the proposed RLVC approach outperforms previous learned approaches on all videos in the high resolution datasets UVG and JCTVC Class B. In all the 20 test videos, we achieve the best results in learned approaches on 18 videos, and have the best average BDBR performance among all learned approaches. Moreover, Table II shows that, in terms of PSNR, our PSNR model has better performance than all existing learned approaches on all test videos.
Note that, the latest HLVC [59] (CVPR’20) approach introduces bidirectional prediction, hierarchical structure and postprocessing into learned video compression, while the proposed RLVC approach only works in the unidirectional IPPP model without postprocessing (as shown in Fig. 2). Nevertheless, our approach still achieves better performance than HLVC [59], validating the effectiveness of our recurrent compression framework with the proposed RAE and RPM networks.
Comparison with x265. The ratedistortion curves compared with different settings of x265 are demonstrated in Fig. 7. As Fig. 7 (a) and (b) show, the proposed MSSSIM model outperforms x265 (LDP very fast), x265 (LDP default), x265 (default) and x265 (SSIM default) on both the UVG and JCTVC datasets from low to high bitrates. Besides, in comparison with the slowest setting of x265, we also achieve better performance on UVG and at high bitrates on JCTVC. Moreover, at high bitrates, we even have higher MSSSIM performance than the SSIMtuned slowest setting of x265, which can be seen as the best (MS)SSIM performance that x265 is able to reach.
Similar conclusion can be obtained from the BDBR results calculated by MSSSIM in Table I. That is, our RLVC approach averagely reduces bitrate of the anchor x265 (LDP very fast), and outperform x265 (LDP default), x265 (default), x265 (SSIM default) and x265 (slowest). In comparison with x265 (SSIM slowest), we achieve better performance on 8 out of the 20 test videos. We also have better average BDBR result than x265 (SSIM slowest) on JCTVC Class B, and reach almost the same average performance as x265 (SSIM slowest) on JCTVC Class D.
Learned  Nonlearned  
DVC  HLVC  RLVC  x265  x265  x265  
Video  [36]  [59]  (Ours)  LDP def.  default  slowest 
Beauty  
Bosphorus  
HoneyBee  
Jockey  
ReadySetGo  
ShakeNDry  
YachtRide  
Ave. (UVG)  
BasketballDrive  
BQTerrace  
Cactus  
Kimono  
ParkScene  
Ave. (Class B)  
BasketballDrill  
BQMall  
PartyScene  
RaceHorses  
Ave. (Class C)  
BasketballPass  
BlowingBubbles  
BQSquare  
RaceHorses  
Ave. (Class D)  
Ave. (all videos)  
In terms of PSNR, Fig. 7 (c) and (d) show that our PSNR model outperforms x265 (LDP very fast) from low to high bitrates on both the UVG and JCTVC test sets. Besides, we are superior to x265 (LDP default) at high bitrates on UVG and in a large of bitrates on JCTVC. The BDBR results calculated by PSNR in Table II also indicate that our approach achieves less bitrate than x265 (LDP very fast), and reduces more bitrate than x265 (LDP default). We do not outperform the default and the slowest settings of x265 on PSNR. However, x265 (default) and x265 (slowest) apply advanced strategies in video compression, such as bidirectional prediction and hierarchical frame structure, while our approach only utilizes the unidirectional IPPP mode. Note that, as far as we know, there is no learned video compression approach beats the default setting of x265 in terms of PSNR. The proposed RLVC approach advances the stateoftheart learned video compression performance and contributes to catching up with the handcrafted standards step by step.
Visual results. The visual results of our MSSSIM and PSNR models are illustrated in Fig. 8, comparing with the default setting of x265. It can be seen from Fig. 8 that our MSSSIM model reaches higher MSSSIM with lower bitrate than x265, and produces the compressed frame with less blocky artifacts. For our PSNR model, as discussed above, we do not beat the default setting of x265 in terms of PSNR. However, as Fig. 8 shows, our PSNR model also achieves less blocky artifacts and less noise than x265, and is able to reach similar or even higher MSSSIM than the default setting of x265 in some cases.
DVC  HLVC  Habibian  RLVC  
[36]  [59]  [19]  (Ours)  
Encoding  23.3  28.8  31.3  15.9 
Decoding  39.5  18.3  0.004  32.1 
Computational complexity. We measure the complexity of the learned approaches on one NVIDIA 1080Ti GPU. The results in terms of frame per second (fps) are shown in Table III. As Table III shows, due to the recurrent cells in our autoencoders and probability model, the superior performance of our approach is at the cost of the higher encoding complexity than previous approaches. Nevertheless, we have faster decoding than [59, 19], and achieve the realtime decoding on 240p videos with frame rate . Note that, HLVC [59] adopts an enhancement network in the decoder to improve compression quality, which increases decoding complexity. Our RLVC approach (without enhancement) still reaches higher compression performance than HLVC [59], and also has faster decoding speed. Besides, the autoregressive (PixelCNNlike) probability model used in [19] leads to slow decoding, while the proposed RPM network is more efficient.
The ablation studies are conducted to verify the effectiveness of each recurrent component in our approach. We define the baseline (BL) as our framework without recurrent cells, i.e., without recurrent cells in autoencoders and replacing our RPM network with the factorized spatial entropy model [3]. In the following, we enable the recurrent cell in the encoder (BL+RE) and in the decoder (BL+RD), respectively. Then, both of them are enabled, i.e., the proposed RAE network (BL+RAE). Finally, our RPM network is further applied to replace the spatial model [3] (BL+RAE+RPM, i.e., our full model). Besides, we also compare our RPM network with the hyperprior spatial entropy model [4].
The proposed RAE. As Fig. 9 shows, the ratedistortion curves of BL+RE and BL+RD are both above the baseline. This indicates that the recurrent encoder and the recurrent decoder are both able to improve the compression performance. Moreover, combining them together in the proposed RAE, the ratedistortion performance is further improved (shown as BL+RAE). The probable reason is that, because of the dual recurrent cells in both the encoder and decoder, it learns to encode the residual information between the current and previous inputs, which reduces the information content represented by each latent representation, and then the decoder reconstructs the output based on the encoded residual and previous outputs. This results in efficient compression.
The proposed RPM. It can be seen from Fig. 9 that the proposed RPM (BL+RAE+RPM) significantly reduces the bitrate in comparison with BL+RAE, which uses the spatial entropy model [3]. This proves the fact that at the same compression quality, the temporally conditional cross entropy is smaller than the independent cross entropy, i.e.,
Besides, Fig. 9 shows that our RPM network further outperforms the hyperprior spatial entropy model [4], which generates the side information to facilitate the compression of . This indicates that when compressing video at the same quality, the temporally conditional cross entropy is smaller than the spatial conditional cross entropy (with the overhead cross entropy of ), i.e.,
The proposed RPM has two benefits over [4]. First, our RPM does not consume overhead bitrate to compress the prior information, while [4] has to compress into bit stream. Second, our RPM uses the temporal prior of all previous latent representations, while there is only one spatial prior in [4] with much smaller size, i.e., of . In conclusion, these studies verify the benefits of applying temporal prior to estimate the conditional probability in a recurrent manner.
It is worth pointing out that the proposed RPM network is flexible to be combined with various spatial probability models, e.g., [4, 29, 22]. As an example, we train a model combining the proposed approach with the hyperprior spatial probability model [4], which is illustrated in Fig. 10. This combined model only slightly improves our approach, i.e., bitrate reduction on the JCTVC dataset. On the one hand, such slight improvement indicates that due to the high correlation among video frames, the previous latent representations are able to provide most of the useful information, and the spatial prior, which leads to bitrate overhead, is not very helpful to further improve the performance. This validates the effectiveness of our RPM network. On the other hand, it also shows the flexibility of our RPM network to combine with spatial probability models, e.g., replacing the spatial model in Fig. 10 with [4, 29] or [22]^{3}^{3}3Since [29, 22] do not release the training codes, we are not able to learn the model combining RPM with [29, 22]., and the possibility to further advance the performance.
This paper has proposed a recurrent learned video compression approach. Specifically, we proposed recurrent autoencoders to compress motion and residual, fully exploring the temporal correlation in video frames. Then, we showed how modeling the conditional probability in a recurrent manner improves the coding efficiency. The proposed recurrent autoencoders and recurrent probability model significantly expands the range of reference frames, which has not been achieved in previous learned as well as handcrafted standards. The experiments validate that the proposed approach outperforms all previous learned approaches and the LDP default setting of x265 in terms of both PSNR and MSSSIM, and also outperforms x265 (slowest) on MSSSIM. The ablation studies verify the effectiveness of each recurrent component in our RLVC approach, and show the flexibility of the proposed RPM network to combine with spatial probability models.
In this paper, our approach works in the IPPP mode. Combining our approach with bidirectional prediction and hierarchical frame structure can be seen as promising future works. Besides, the recurrent framework of the proposed approach still relies on the warping operation and motion compensation to reduce the temporal redundancy. Therefore, another possible future work is designing a fully recurrent deep video compression network to automatically learn to explore the temporal redundancy without adopting optical flow based motion.
Softtohard vector quantization for endtoend learning compressible representations
. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1141–1151. Cited by: §II.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 10071–10080. Cited by: §I, §I, §I, §I, §II, §IIIA, §IIIB, §IIIB, §IVA, TABLE I.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pp. 1724–1734. Cited by: §II.Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images
. InProceedings of the International Conference on Machine Learning (ICML)
, pp. 432–440. Cited by: §II.A convolutional neural network approach for postprocessing in HEVC intra coding
. In Proceedings of the International Conference on Multimedia Modeling (MMM), pp. 28–39. Cited by: §II.Proceedings of the AAAI Conference on Artificial Intelligence
, Cited by: §I, §II, §IVD, footnote 3.