Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model

06/24/2020 ∙ by Ren Yang, et al. ∙ ETH Zurich 21

The past few years have witnessed increasing interests in applying deep learning to video compression. However, the existing approaches compress a video frame with only a few number of reference frames, which limits their ability to fully exploit the temporal correlation among video frames. To overcome this shortcoming, this paper proposes a Recurrent Learned Video Compression (RLVC) approach with the Recurrent Auto-Encoder (RAE) and Recurrent Probability Model (RPM). Specifically, the RAE employs recurrent cells in both the encoder and decoder. As such, the temporal information in a large range of frames can be used for generating latent representations and reconstructing compressed outputs. Furthermore, the proposed RPM network recurrently estimates the Probability Mass Function (PMF) of the latent representation, conditioned on the distribution of previous latent representations. Due to the correlation among consecutive frames, the conditional cross entropy can be lower than the independent cross entropy, thus reducing the bit-rate. The experiments show that our approach achieves the state-of-the-art learned video compression performance in terms of both PSNR and MS-SSIM. Moreover, our approach outperforms the default Low-Delay P (LDP) setting of x265 on PSNR, and also has better performance on MS-SSIM than the SSIM-tuned x265 and the slowest setting of x265.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

page 12

Code Repositories

VideoCodeingTestFramework

A framework that you can easily test videos using various plugins, and even you can write your own plugin.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Nowadays, video contributes to the majority of mobile data traffic [14]. The demands of high resolution and high quality video are also increasing. Therefore, video compression is essential to enable the efficient transmission of video data over the band-limited Internet. Especially, during the COVID-19 pandemic, the increasing data traffic used for video conferencing, gaming and online learning forced Netflix and YouTube to limit video quality in Europe. This further shows the essential impact of improving video compression on today’s social development.

During the past decades, several video compression algorithms, such as MPEG [28], H.264 [54] and H.265 [45]

were standardized. These standards are handcrafted, and the modules in compression frameworks cannot be jointly optimized. Recently, inspired by the success of Deep Neural Networks (DNN) in advancing the rate-distortion performance of image compression 

[40, 29, 22], many deep learning-based video compression approaches [55, 36, 10, 16, 59] were proposed. In these learned video compression approaches, the whole frameworks are optimized in an end-to-end manner.

However, both the existing handcrafted [28, 54, 45] and learned video compression [55, 36, 10, 16, 59] approaches utilize non-recurrent structures to compress the sequential video data. As such, only a limited number of references can be used to compress new frames, thus limiting their ability for exploring temporal correlation and reducing redundancy. Adopting a recurrent compression framework enables to fully take advantage of the correlated information in consecutive frames, and thus facilitates video compression. Moreover, in the entropy coding of previous learned approaches [55, 36, 10, 16, 59], the Probability Mass Functions (PMF) of latent representations are also independently estimated on each frame, ignoring the correlation between the latent representations among neighboring frames. Similar to the reference frames in the pixel domain, fully making use of the correlation in the latent domain benefits the compression of latent representations. Intuitively, the temporal correlation in the latent domain also can be explored in a recurrent manner.

Fig. 1: The recurrent structure in our RLVC approach. In this figure, two time steps are shown as an example.

Therefore, this paper proposes a Recurrent Learned Video Compression (RLVC) approach, with the Recurrent Auto-Encoder (RAE) and Recurrent Probability Model (RPM). As shown in Fig. 1, the proposed RLVC approach uses recurrent networks for representing inputs, reconstructing compressed outputs and modeling PMFs for entropy coding. Specifically, the proposed RAE network contains recurrent cells in both the encoder and decoder. Given a sequence of inputs , the encoder of RAE recurrently generates the latent representations , and the decoder also reconstructs the compressed outputs from in a recurrent manner. As such, all previous frames can be seen as references for compressing the current one, and therefore our RLVC approach is able to make use of the information in a large number of frames, instead of the very limited reference frames in the non-recurrent approaches [55, 36, 10, 59].

Furthermore, the proposed RPM network recurrently models the PMF of conditioned on all previous latent representations . Because of the recurrent cell, our RPM network estimates the temporally conditional PMF , instead of the independent PMF as in previous works [55, 36, 10, 16, 59]. Due to the temporal correlation among , the (cross) entropy of conditioned on the previous information is expected to be lower than the independent (cross) entropy. Therefore, our RPM network is able to achieve lower bit-rate to compress . As Fig. 1 illustrates, the proposed RAE and RPM networks build up a recurrent video compression framework. The hidden states for representation learning and probability modeling are recurrently transmitted from frame to frame, and therefore the information in consecutive frames can be fully exploited in both the pixel and latent domains for compressing the upcoming frames. This results in efficient video compression.

The contribution of this paper can be summarized as:

  • We propose employing the recurrent structure in learned video compression to fully exploit the temporal correlation among a large range of video frames.

  • We propose the recurrent auto-encoder to expand the range of reference frames, and propose the recurrent probability model to recurrently estimate the temporally conditional PMF of the latent representations. This way, we achieve the expected bit-rate as the conditional cross entropy, which can be lower than the independent cross entropy in previous non-recurrent approaches.

  • The experiments validate the superior performance of the proposed approach to the existing learned video compression approaches, and the ablation studies verify the effectiveness of each recurrent component in our framework.

In the following, Section II presents the related works. The proposed RAE and RPM are introduced in Section III. Then, the experiments in Section IV validate the superior performance of the proposed RLVC approach to the existing learned video compression approaches. Finally, the ablation studies further demonstrate the effectiveness of the proposed RAE and RPM networks, respectively.

Ii Related works

Auto-encoders and RNNs. Auto-encoders [20] have been popularly used for representation learning in the past decades. In the field of image processing, there are plenty of auto-encoders proposed for image denoising [12, 18], enhancement [35, 41]

and super resolution 

[60, 53]

. Besides, inspired by the development of Recurrent Neural Networks (RNNs) and their applications on sequential data 

[25], e.g., language modeling [39, 24] and video analysis [17], some recurrent auto-encoders were proposed for representation learning on time-series tasks, such as machine translation [11, 46] and captioning [51], etc. Moreover, Srivastava et al. [44]

proposed learning for video representations using an auto-encoder based on Long Short-Term Memory (LSTM) 

[21], and verified the effectiveness on classification and action recognition tasks on video. However, as far as we know, there is no recurrent auto-encoder utilized in learned video compression.

Learned image compression. In recent years, there are increasing interests in applying deep auto-encoders in the end-to-end DNN models for learned image compression [48, 49, 1, 47, 3, 4, 40, 37, 30, 23, 29, 22]. For instance, Theis et al. [47] proposed a compressive auto-encoder for lossy image compression, and reached competitive performance with JPEG 2000 [43]. Later, various probability models were proposed. For instance, Ballé et al. [3, 4] proposed the factorized prior [3]

and hyperprior 

[4] probability models to estimate entropy in the end-to-end DNN image compression frameworks. Later, based on them, Minnen et al. [40] proposed the hierarchical prior entropy model to improve the compression efficiency. Besides, Mentzer et al. [37] utilized 3D-CNN as the context model for entropy coding, and proposed learning an importance mask to reduce the redundancy in latent representation. Recently, the context-adaptive [29] and the coarse-to-fine hyper-prior [22] entropy models were designed to further advance the rate-distortion performance, and successfully outperform the traditional image codec BPG [5].

Learned video compression. Deep learning is also attracting more and more attention in video compression. To improve the coding efficiency of handcrafted standard, many approaches [57, 34, 13, 15, 31, 32] were proposed to replace the components in H.265 by DNN. Among them, Liu et al. [34]

utilized DNN in the fractional interpolation of motion compensation, and Choi 

et al. [13] proposed a DNN model for frame prediction. Besides, [15, 31, 32] employed DNNs to improve the in-loop filter of H.265. However, these approaches only advance the performance of one particular module, and the video compression frameworks cannot be jointly optimized.

Inspired by the success of learned image compression, some learning-based video compression approaches were proposed [8, 9]. However, [8, 9] still adopt some handcrafted strategies, such as block matching for motion estimation and compensation, and therefore they fail to optimize the whole compression framework in an end-to-end manner. Recently, several end-to-end DNN frameworks have been proposed for video compression [55, 36, 10, 16, 19, 33, 59]. Specifically, Wu et al. [55] proposed predicting frames by interpolation from reference frames, and compressing residual by the image compression model [49]. Later, Lu et al. [36] proposed the Deep Video Compression (DVC) approach, which uses optical flow for motion estimation, and utilizes two auto-encoders to compress the motion and residual, respectively. Then, Djelouah et al. [16] employs bi-directional prediction in to learned video compression. Liu et al. [33] proposed a deep video compression framework with the one-stage flow for motion compensation. Most recently, Yang et al. [59] proposed learning for video compression with hierarchical quality layers and adopted a recurrent enhancement network in the deep decoder. Nevertheless, none of them learns to compress video with a recurrent model. Instead, there are at most two reference frames used in these approaches [55, 36, 10, 16, 33, 59], and therefore they fail to exploit the temporal correlation in a large number of frames.

Although Habibian et al. [19] proposed taking a group of frames as inputs to the 3D auto-encoder, the temporal length is limited as all frames in one group have to fit into GPU memory at the same time. Instead, the proposed RLVC network takes as inputs only one frame and the hidden states from the previous frame, and recurrently moves forward. Therefore, we are able to explore larger range of temporal correlation with finite memory. Also, [19] uses a PixelCNN-like network [50] as an auto-regressive probability model, which makes decoding slow. On the contrary, the proposed RPM network benefits our approach to achieve not only more efficient compression but also faster decoding.

Iii The proposed RLVC approach

Fig. 2: The framework of our RLVC approach. The details of the proposed RAE are shown in Fig. 3. The proposed RPM, which is illustrated in Fig. 5, is applied on the latent representations and to estimate their conditional PMF for arithmetic coding.

Iii-a Framework

The framework of the proposed RLVC approach is shown in Fig. 2. Inspired by traditional video codecs, we utilize motion compensation to reduce the redundancy among video frames, whose effectiveness in learned compression has been proved in previous works [36, 59]. To be specific, we apply the pyramid optical flow network [42] to estimate the temporal motion between the current frame and the previously compressed frame, e.g., and . The large receptive field of the pyramid network [42] benefits to handle large and fast motions. Here, we define the raw and compressed frames as and , respectively. Then, the estimated motion is compressed by the proposed RAE, and the compressed motion is applied for motion compensation. In our framework, we use the same motion compensation method as [36, 59]. In the following, the residual () between and the motion compensated frame can be obtained and compressed by another RAE. Given the compressed residual as , the compressed frame can be reconstructed. The details of the proposed RAE is described in Section III-B.

In our framework, the two RAEs in each frame generate the latent representations of and for motion and residual compression, respectively. To compress and into a bit stream, we propose the RPM network to recurrently predict the temporally conditional PMFs of and . Due to the temporal relationship among video frames, the conditional cross entropy is expected to be lower than the independent cross entropy used in non-recurrent approaches [55, 36, 10, 59]. Hence, utilizing the conditional PMF estimated by our RPM network effectively reduces bit-rate in arithmetic coding [27]. The proposed RPM is detailed in Section III-C.

Fig. 3: The architecture of the proposed RAE network. In convolutions layers, and

indicate up- and down-sampling with the stride of 2, respectively. In RAE, the filter sizes of all convolutional layers are set as

when compressing motion, and set as for residual compression. The filter number of each layer is set as 128.

Iii-B Recurrent Auto-Encoder (RAE)

As mentioned above, we apply two RAEs to compress and . Since the two RAEs share the same architecture, we denote both and by in this section for simplicity. Recall that in the non-recurrent learned video compression works [36, 10, 59], when compressing the -th frame, the auto-encoders map the input to a latent representation

(1)

through an encoder parametrized with . Then, the continuous-valued is quantized to the discrete-valued . The compressed output is reconstructed by the decoder from the quantized latent representation, i.e.,

(2)

Taking the inputs of only the current and to the encoder and decoder, they fail to take advantage of the temporal correlation in consecutive frames.

On the contrary, the proposed RAE includes recurrent cells in both the encoder and decoder. The architecture of the RAE network is illustrated in Fig. 3. We follow [4] to use four

down-sampling convolutional layers with the activation function of GDN 

[3] in the encoder of RAE. In the middle of the four convolutional layers, we insert a ConvLSTM [56] cell to achieve the recurrent structure. As such, the information from previous frames flows into the encoder network of the current frame through the hidden states of the ConvLSTM. Therefore, the proposed RAE generates latent representation based on the current as well as previous inputs. Similarly, the recurrent decoder in RAE also has a ConvLSTM cell in middle of the four up-sampling convolutional layers with IGDN [3], and thus also reconstructs from both the current and previous latent representations. In summary, our RAE network can be formulated as

(3)

In (3), all previous frames can be seen as reference frames for compressing the current frame, and therefore our RLVC approach is able to make use of the information in a large range of frames, instead of the very limited number of reference frames in the non-recurrent approaches [55, 36, 10, 59].

Iii-C Recurrent Probability Model (RPM)

To compress the sequence of latent representations , the RPM network is proposed for entropy coding. First, we use and to denote the true and estimated independent PMFs of . The expected bit-rate of is then given as the cross entropy

(4)

Note that arithmetic coding [27] is able to encode at the bit-rate of the cross entropy with negligible overhead. It can be seen from (4) that if has higher certainty, the bit-rate can be smaller. Due to the temporal relationship among video frames, the distribution of in consecutive frames are correlated. Therefore, conditioned on the information of previous latent representations , the current is expected to be more certain. That is, defining and as the true and estimated temporally conditional PMF of , the conditional cross entropy

(5)

can be smaller than the independent cross entropy in (4). To achieve the expected bit-rate of (5), we propose the RPM network to recurrently model the conditional PMF .

Specifically, adaptive arithmetic coding [27] allows to change the PMF for each element in , and thus we estimate different conditional PMFs for different elements . Here, is defined as the element at the -th 3D location in , and the conditional PMF of can be expressed as

(6)

in which denotes the number of 3D positions in . As shown in Fig. 4, we model of each element as discretized logistic distribution in our approach. Since the quantization operation in RAE quantizes all to a discrete value , the conditional PMF of the quantized can be obtained by integrating the continuous logistic distribution [2] from to :

(7)

in which the logistic distribution is defined as

(8)

and its integral is the sigmoid distribution, i.e.,

(9)

Given (7), (8) and (9), the estimated conditional PMF can be simplified as

(10)

It can be seen from (10), the conditional PMF at each location is modelled with parameters and , which are varying for different locations in . The RPM network is proposed to recurrently estimate and in (10). Fig. 5 demonstrates the detailed architecture of our RPM network, which contains a recurrent network with convolution layers and a ConvLSTM cell in the middle. Due to the recurrent structure, and are generated based on all previous latent representations, i.e.,

(11)

where represents the trainable parameters in RPM. Because takes previous latent representations as inputs, and learn to model the probability of each conditioned on according to (10). Finally, the conditional PMFs are applied to the adaptive arithmetic coding [27] to encode into a bit stream.

Fig. 4: Modeling the conditional PMF with a discretized logistic distribution.

Iii-D Training

In this paper, we utilize the Multi-Scale Structural SIMilarity (MS-SSIM) index and the Peak Signal-to-Noise Ratio (PSNR) to evaluate compression quality, and train two models optimized for MS-SSIM and PSNR, respectively. The distortion

is defined as when optimizing for MS-SSIM, and as the Mean Square Error (MSE) when training the PSNR model. As Fig. 2 shows, our approach uses the uni-directional Low-Delay P (LDP) structure. We follow [59] to compress the I-frame with the learned image compression method [29] for the MS-SSIM model, and with BPG [5] for the PSNR model. Because of lacking previous latent representation for the first P-frame , and are compressed by the spatial entropy model of [3], with the bit-rate defined as and , respectively. The following P-frames are compressed with the proposed RPM network. For , the actual bit-rate can be calculated as

(12)

in which is modelled by the proposed RPM according to (6) to (11). Note that, assuming that the distribution of the training set is identical with the true distribution, the actual bit-rate is expected to be the conditional cross entropy in (5). In our approach, two RPM networks are applied to the latent representations of motion and residual, and their bit-rates are defined as and , respectively.

Fig. 5: The architecture of the RPM network, in which all layers have 128 convolutional filters with the size of .

Our RLVC approach is trained on the Vimeo-90k [58] dataset, in which each training sample has 7 frames. The first frame is compressed as the I-frame and the other 6 frames are P-frames. First, we warm up the network on the first P-frame

in a progressive manner. At the beginning, the motion estimation network is trained with the loss function of

(13)

in which is the output of the motion estimation network (as shown in Fig. 2) and is the warping operation. When is converged, we further include the RAE network for compressing motion and the motion compensation network into training, using the following loss function

(14)

After the convergence of , the whole network is jointly trained on by the loss of

(15)

In the following, we train our recurrent model in an end-to-end manner on the sequential training frames using loss function of

(16)
Fig. 6: The rate-distortion performance of our RLVC approach compared with the learned video compression approaches on the UVG and JCT-VC datasets.
Fig. 7: The rate-distortion performance of our RLVC approach compared with different settings of x265 on the UVG and JCT-VC datasets.

During training, quantization is relaxed by the method in [3] to avoid zero gradients. We follow [59] to set as 8, 16, 32 and 64 for MS-SSIM, and as 256, 512, 1024 and 2048 for PSNR. The Adam optimizer [26] is utilized for training. The initial learning rate is set as for all loss functions (13), (14), (15) and (16). When training the whole model by the final loss of (16), we decade the learning rate after convergence by the factor of 10 until .

Iv Experiments

Iv-a Settings

Learned Non-learned
DVC Cheng Habibian HLVC RLVC x265 x265 x265 x265 x265
Dataset Video [36] [10] [19] [59] (Ours) LDP def. default SSIM def. slowest SSIM slowest
UVG Beauty -
Bosphorus -
HoneyBee -
Jockey -
ReadySetGo -
ShakeNDry -
YachtRide -
Average -
JCT-VC
Class B
BasketballDrive - -
BQTerrace - -
Cactus - -
Kimono - -
ParkScene - -
Average - -
JCT-VC
Class C
BasketballDrill -
BQMall -
PartyScene -
RaceHorses (480p) -
Average -
JCT-VC
Class D
BasketballPass -
BlowingBubbles -
BQSquare -
RaceHorses (240p) -
Average -
Average on all videos - -
TABLE I: BDBR calculated by MS-SSIM with the anchor of x265 (LDP very fast). Bold is the best results in learned approaches.

The experiments are conducted to validate the effectiveness of our RLVC approach. We evaluate the performance on the same test set as [59], i.e., the JCT-VC [7] (Classes B, C and D) and the UVG [38] datasets. The JCT-VC Class B and UVG are high resolution () datasets, and the JCT-VC Classes C and D are with the resolution of and , respectively. We compare our RLVC approach with the latest learned video compression methods: HLVC [59] (CVPR’20), Liu et al. [33] (AAAI’20), Habibian et al. [19] (ICCV’19), DVC [36] (CVPR’19), Cheng et al. [10] (CVPR’19) and Wu et al. [55] (ECCV’18). To compare with the handcrafted video coding standard H.265 [45], we first include the LDP very fast setting of x265 into comparison, which is used as the anchor in previous learned compression works [36, 59, 33]. We also compare our approach with the LDP default, the default and the slowest settings of x265. Moreover, the SSIM-tuned x265 is also compared with our MS-SSIM model. The detailed configurations of x265 are listed as follows:

  • x265 (LDP very fast):

    ffmpeg (input) -c:v libx265

    -preset veryfast -tune zerolatency

    -x265-params "crf=CRF:keyint=10" output.mkv

  • x265 (LDP default):

    ffmpeg (input) -c:v libx265

    -tune zerolatency

    -x265-params "crf=CRF" output.mkv

  • x265 (default):

    ffmpeg (input) -c:v libx265

    -x265-params "crf=CRF" output.mkv

  • x265 (SSIM default):

    ffmpeg (input) -c:v libx265 -tune ssim

    -x265-params "crf=CRF" output.mkv

  • x265 (slowest):

    ffmpeg (input) -c:v libx265

    -preset placebo111 Placebo is the slowest setting among the 10 speed levels in x265.

    -x265-params "crf=CRF" output.mkv

  • x265 (SSIM slowest):

    ffmpeg (input) -c:v libx265

    -preset placebo -tune ssim

    -x265-params "crf=CRF" output.mkv

In above settings, “(input)” is short for “-pix_fmt yuv420p -s WidthxHeight -r Framerate -i input.yuv”. CRF indicates the compression quality, and lower CRF corresponds to higher quality. We set CRF = 15, 19, 23, 27 for the JCT-VC dataset, and set CRF = 7, 11, 15, 19, 23 for the UVG dataset.

Please refer to the Supporting Document for the experimental results on more datasets, such as the conversational video dataset and the MCL-JCV [52] dataset.

Iv-B Performance

Comparison with learned approaches. Fig. 6 illustrates the rate-distortion curves of our RLVC approach in comparison with previous learned video compression approaches on the UVG and JCT-VC datasets. Among the compared approaches, Liu et al. [33] and Habibian et al. [19] are optimized for MS-SSIM. DVC [36] and Wu et al. [55] are optimized for PSNR. HLVC [59] trains two models for MS-SSIM and PSNR, respectively. As we can see from Fig. 6 (a) and (b), our MS-SSIM model outperforms all previous learned approaches, including the state-of-the-art MS-SSIM optimized approaches Liu et al. [33] (AAAI’20), HLVC [59] (CVPR’20) and Habibian et al. [19] (ICCV’19). In terms of PSNR, Fig. 6 (c) and (d) indicate the superior performance of our PSNR model to the PSNR optimized models HLVC [59] (CVPR’20), DVC [36] (CVPR’19) and Wu et al. [55] (ECCV’18).

We further tabulate the Bjøntegaard Delta Bit-Rate (BDBR) [6] results calculated by MS-SSIM and PSNR with the anchor of x265 (LDP very fast) in Tables I and II, respectively.222Since [55, 33] do not release the result on each video, their BDBR values cannot be obtained. Note that, BDBR calculates the average bit-rate difference in comparison with the anchor. Lower BDBR value indicates better performance, and negative BDBR indicates saving bit-rate in comparison with the anchor, i.e., outperforming the anchor. In Tables I and II, the bold numbers are the best results in learned approaches. As Table I shows, in terms of MS-SSIM, the proposed RLVC approach outperforms previous learned approaches on all videos in the high resolution datasets UVG and JCT-VC Class B. In all the 20 test videos, we achieve the best results in learned approaches on 18 videos, and have the best average BDBR performance among all learned approaches. Moreover, Table II shows that, in terms of PSNR, our PSNR model has better performance than all existing learned approaches on all test videos.

Note that, the latest HLVC [59] (CVPR’20) approach introduces bi-directional prediction, hierarchical structure and post-processing into learned video compression, while the proposed RLVC approach only works in the uni-directional IPPP model without post-processing (as shown in Fig. 2). Nevertheless, our approach still achieves better performance than HLVC [59], validating the effectiveness of our recurrent compression framework with the proposed RAE and RPM networks.

Comparison with x265. The rate-distortion curves compared with different settings of x265 are demonstrated in Fig. 7. As Fig. 7 (a) and (b) show, the proposed MS-SSIM model outperforms x265 (LDP very fast), x265 (LDP default), x265 (default) and x265 (SSIM default) on both the UVG and JCT-VC datasets from low to high bit-rates. Besides, in comparison with the slowest setting of x265, we also achieve better performance on UVG and at high bit-rates on JCT-VC. Moreover, at high bit-rates, we even have higher MS-SSIM performance than the SSIM-tuned slowest setting of x265, which can be seen as the best (MS-)SSIM performance that x265 is able to reach.

Similar conclusion can be obtained from the BDBR results calculated by MS-SSIM in Table I. That is, our RLVC approach averagely reduces bit-rate of the anchor x265 (LDP very fast), and outperform x265 (LDP default), x265 (default), x265 (SSIM default) and x265 (slowest). In comparison with x265 (SSIM slowest), we achieve better performance on 8 out of the 20 test videos. We also have better average BDBR result than x265 (SSIM slowest) on JCT-VC Class B, and reach almost the same average performance as x265 (SSIM slowest) on JCT-VC Class D.

Learned Non-learned
DVC HLVC RLVC x265 x265 x265
Video [36] [59] (Ours) LDP def. default slowest
Beauty
Bosphorus
HoneyBee
Jockey
ReadySetGo
ShakeNDry
YachtRide
Ave. (UVG)
BasketballDrive
BQTerrace
Cactus
Kimono
ParkScene
Ave. (Class B)
BasketballDrill
BQMall
PartyScene
RaceHorses
Ave. (Class C)
BasketballPass
BlowingBubbles
BQSquare
RaceHorses
Ave. (Class D)
Ave. (all videos)
TABLE II: BDBR calculated by PSNR with the anchor of x265 (LDP very fast). Bold is the best results in learned approaches.

In terms of PSNR, Fig. 7 (c) and (d) show that our PSNR model outperforms x265 (LDP very fast) from low to high bit-rates on both the UVG and JCT-VC test sets. Besides, we are superior to x265 (LDP default) at high bit-rates on UVG and in a large of bit-rates on JCT-VC. The BDBR results calculated by PSNR in Table II also indicate that our approach achieves less bit-rate than x265 (LDP very fast), and reduces more bit-rate than x265 (LDP default). We do not outperform the default and the slowest settings of x265 on PSNR. However, x265 (default) and x265 (slowest) apply advanced strategies in video compression, such as bi-directional prediction and hierarchical frame structure, while our approach only utilizes the uni-directional IPPP mode. Note that, as far as we know, there is no learned video compression approach beats the default setting of x265 in terms of PSNR. The proposed RLVC approach advances the state-of-the-art learned video compression performance and contributes to catching up with the handcrafted standards step by step.

Fig. 8: The visual results of the MS-SSIM and PSNR models of the proposed RLVC approach in comparison with the default setting of x265.

Visual results. The visual results of our MS-SSIM and PSNR models are illustrated in Fig. 8, comparing with the default setting of x265. It can be seen from Fig. 8 that our MS-SSIM model reaches higher MS-SSIM with lower bit-rate than x265, and produces the compressed frame with less blocky artifacts. For our PSNR model, as discussed above, we do not beat the default setting of x265 in terms of PSNR. However, as Fig. 8 shows, our PSNR model also achieves less blocky artifacts and less noise than x265, and is able to reach similar or even higher MS-SSIM than the default setting of x265 in some cases.

DVC HLVC Habibian RLVC
[36] [59] [19] (Ours)
Encoding 23.3 28.8 31.3 15.9
Decoding 39.5 18.3 0.004 32.1
TABLE III: Complexity (fps) on 240p videos.

Computational complexity. We measure the complexity of the learned approaches on one NVIDIA 1080Ti GPU. The results in terms of frame per second (fps) are shown in Table III. As Table III shows, due to the recurrent cells in our auto-encoders and probability model, the superior performance of our approach is at the cost of the higher encoding complexity than previous approaches. Nevertheless, we have faster decoding than [59, 19], and achieve the real-time decoding on 240p videos with frame rate . Note that, HLVC [59] adopts an enhancement network in the decoder to improve compression quality, which increases decoding complexity. Our RLVC approach (without enhancement) still reaches higher compression performance than HLVC [59], and also has faster decoding speed. Besides, the auto-regressive (PixelCNN-like) probability model used in [19] leads to slow decoding, while the proposed RPM network is more efficient.

Iv-C Ablation studies

The ablation studies are conducted to verify the effectiveness of each recurrent component in our approach. We define the baseline (BL) as our framework without recurrent cells, i.e., without recurrent cells in auto-encoders and replacing our RPM network with the factorized spatial entropy model [3]. In the following, we enable the recurrent cell in the encoder (BL+RE) and in the decoder (BL+RD), respectively. Then, both of them are enabled, i.e., the proposed RAE network (BL+RAE). Finally, our RPM network is further applied to replace the spatial model [3] (BL+RAE+RPM, i.e., our full model). Besides, we also compare our RPM network with the hyperprior spatial entropy model [4].

The proposed RAE. As Fig. 9 shows, the rate-distortion curves of BL+RE and BL+RD are both above the baseline. This indicates that the recurrent encoder and the recurrent decoder are both able to improve the compression performance. Moreover, combining them together in the proposed RAE, the rate-distortion performance is further improved (shown as BL+RAE). The probable reason is that, because of the dual recurrent cells in both the encoder and decoder, it learns to encode the residual information between the current and previous inputs, which reduces the information content represented by each latent representation, and then the decoder reconstructs the output based on the encoded residual and previous outputs. This results in efficient compression.

Fig. 9: Ablation results of PSNR (dB) on the JCT-VC dataset.

The proposed RPM. It can be seen from Fig. 9 that the proposed RPM (BL+RAE+RPM) significantly reduces the bit-rate in comparison with BL+RAE, which uses the spatial entropy model [3]. This proves the fact that at the same compression quality, the temporally conditional cross entropy is smaller than the independent cross entropy, i.e.,

Besides, Fig. 9 shows that our RPM network further outperforms the hyperprior spatial entropy model [4], which generates the side information to facilitate the compression of . This indicates that when compressing video at the same quality, the temporally conditional cross entropy is smaller than the spatial conditional cross entropy (with the overhead cross entropy of ), i.e.,

The proposed RPM has two benefits over [4]. First, our RPM does not consume overhead bit-rate to compress the prior information, while [4] has to compress into bit stream. Second, our RPM uses the temporal prior of all previous latent representations, while there is only one spatial prior in [4] with much smaller size, i.e., of . In conclusion, these studies verify the benefits of applying temporal prior to estimate the conditional probability in a recurrent manner.

Fig. 10: The probability model combining the proposed RPM with the spatial hyperprior model [4].

Iv-D Combining RPM with spatial probability models

It is worth pointing out that the proposed RPM network is flexible to be combined with various spatial probability models, e.g., [4, 29, 22]. As an example, we train a model combining the proposed approach with the hyperprior spatial probability model [4], which is illustrated in Fig. 10. This combined model only slightly improves our approach, i.e., bit-rate reduction on the JCT-VC dataset. On the one hand, such slight improvement indicates that due to the high correlation among video frames, the previous latent representations are able to provide most of the useful information, and the spatial prior, which leads to bit-rate overhead, is not very helpful to further improve the performance. This validates the effectiveness of our RPM network. On the other hand, it also shows the flexibility of our RPM network to combine with spatial probability models, e.g., replacing the spatial model in Fig. 10 with [4, 29] or [22]333Since [29, 22] do not release the training codes, we are not able to learn the model combining RPM with [29, 22]., and the possibility to further advance the performance.

V Conclusion and future work

This paper has proposed a recurrent learned video compression approach. Specifically, we proposed recurrent auto-encoders to compress motion and residual, fully exploring the temporal correlation in video frames. Then, we showed how modeling the conditional probability in a recurrent manner improves the coding efficiency. The proposed recurrent auto-encoders and recurrent probability model significantly expands the range of reference frames, which has not been achieved in previous learned as well as handcrafted standards. The experiments validate that the proposed approach outperforms all previous learned approaches and the LDP default setting of x265 in terms of both PSNR and MS-SSIM, and also outperforms x265 (slowest) on MS-SSIM. The ablation studies verify the effectiveness of each recurrent component in our RLVC approach, and show the flexibility of the proposed RPM network to combine with spatial probability models.

In this paper, our approach works in the IPPP mode. Combining our approach with bi-directional prediction and hierarchical frame structure can be seen as promising future works. Besides, the recurrent framework of the proposed approach still relies on the warping operation and motion compensation to reduce the temporal redundancy. Therefore, another possible future work is designing a fully recurrent deep video compression network to automatically learn to explore the temporal redundancy without adopting optical flow based motion.

References

  • [1] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool (2017)

    Soft-to-hard vector quantization for end-to-end learning compressible representations

    .
    In Advances in Neural Information Processing Systems (NeurIPS), pp. 1141–1151. Cited by: §II.
  • [2] N. Balakrishnan (1991) Handbook of the logistic distribution. CRC Press. Cited by: §III-C.
  • [3] J. Ballé, V. Laparra, and E. P. Simoncelli (2017) End-to-end optimized image compression. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §II, §III-B, §III-D, §III-D, §IV-C, §IV-C.
  • [4] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston (2018) Variational image compression with a scale hyperprior. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §II, §III-B, Fig. 10, §IV-C, §IV-C, §IV-D.
  • [5] F. Bellard BPG image format. Note: https://bellard.org/bpg/ Cited by: §II, §III-D.
  • [6] G. Bjontegaard (2001) Calculation of average PSNR differences between RD-curves. VCEG-M33. Cited by: §IV-B.
  • [7] F. Bossen (2013) Common test conditions and software reference configurations. JCTVC-L1100 12. Cited by: §IV-A.
  • [8] T. Chen, H. Liu, Q. Shen, T. Yue, X. Cao, and Z. Ma (2017) DeepCoder: a deep neural network based video compression. In Proceedings of the IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. Cited by: §II.
  • [9] Z. Chen, T. He, X. Jin, and F. Wu (2019) Learning for video compression. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II.
  • [10] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto (2019) Learning image and video compression through spatial-temporal energy compaction. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 10071–10080. Cited by: §I, §I, §I, §I, §II, §III-A, §III-B, §III-B, §IV-A, TABLE I.
  • [11] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    pp. 1724–1734. Cited by: §II.
  • [12] K. Cho (2013)

    Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images

    .
    In

    Proceedings of the International Conference on Machine Learning (ICML)

    ,
    pp. 432–440. Cited by: §II.
  • [13] H. Choi and I. V. Bajić (2019) Deep frame prediction for video coding. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II.
  • [14] Cisco Cisco visual networking index: global mobile data traffic forecast update, 2017-2022 white paper. Note: https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-738429.html Cited by: §I.
  • [15] Y. Dai, D. Liu, and F. Wu (2017)

    A convolutional neural network approach for post-processing in HEVC intra coding

    .
    In Proceedings of the International Conference on Multimedia Modeling (MMM), pp. 28–39. Cited by: §II.
  • [16] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers (2019) Neural inter-frame compression for video coding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6421–6429. Cited by: §I, §I, §I, §II, §V-B.
  • [17] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2625–2634. Cited by: §II.
  • [18] L. Gondara (2016) Medical image denoising using convolutional denoising autoencoders. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 241–246. Cited by: §II.
  • [19] A. Habibian, T. van Rozendaal, J. M. Tomczak, and T. S. Cohen (2019) Video compression with rate-distortion autoencoders. In Proceedings of the IEEE International Conference of Computer Vision (ICCV), Cited by: §II, §II, §IV-A, §IV-B, §IV-B, TABLE I, TABLE III.
  • [20] G. E. Hinton and R. S. Zemel (1994) Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pp. 3–10. Cited by: §II.
  • [21] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II.
  • [22] Y. Hu, W. Yang, and J. Liu (2020) Coarse-to-fine hyper-prior modeling for learned image compression. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Cited by: §I, §II, §IV-D, footnote 3.
  • [23] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici (2018) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4385–4393. Cited by: §II.
  • [24] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410. Cited by: §II.
  • [25] Cited by: §II.
  • [26] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §III-D.
  • [27] G. G. Langdon (1984) An introduction to arithmetic coding. IBM Journal of Research and Development 28 (2), pp. 135–149. Cited by: §III-A, §III-C, §III-C.
  • [28] D. J. Le Gall (1992) The MPEG video compression algorithm. Signal Processing: Image Communication 4 (2), pp. 129–140. Cited by: §I, §I.
  • [29] J. Lee, S. Cho, and S. Beack (2019) Context-adaptive entropy model for end-to-end optimized image compression. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §I, §II, §III-D, §IV-D, footnote 3.
  • [30] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang (2018) Learning convolutional networks for content-weighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3214–3223. Cited by: §II.
  • [31] T. Li, M. Xu, R. Yang, and X. Tao (2019) A DenseNet based approach for multi-frame in-loop filter in HEVC. In Proceedings of the Data Compression Conference (DCC), pp. 270–279. Cited by: §II.
  • [32] T. Li, M. Xu, C. Zhu, R. Yang, Z. Wang, and Z. Guan (2019) A deep learning approach for multi-frame in-loop filter of HEVC. IEEE Transactions on Image Processing. Cited by: §II.
  • [33] H. Liu, L. Huang, M. Lu, T. Chen, and Z. Ma (2020) Learned video compression via joint spatial-temporal correlation exploration. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §II, §IV-A, §IV-B, §V-A, footnote 2, footnote 4.
  • [34] J. Liu, S. Xia, W. Yang, M. Li, and D. Liu (2018) One-for-all: grouped variation network-based fractional interpolation in video coding. IEEE Transactions on Image Processing 28 (5), pp. 2140–2151. Cited by: §II.
  • [35] K. G. Lore, A. Akintayo, and S. Sarkar (2017) LLNet: a deep autoencoder approach to natural low-light image enhancement. Pattern Recognition 61, pp. 650–662. Cited by: §II.
  • [36] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao (2019) DVC: an end-to-end deep video compression framework. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11006–11015. Cited by: §I, §I, §I, §I, §II, §III-A, §III-A, §III-B, §III-B, §IV-A, §IV-B, TABLE I, TABLE II, TABLE III, §V-A.
  • [37] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool (2018) Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4394–4402. Cited by: §II.
  • [38] A. Mercat, M. Viitanen, and J. Vanne (2020) UVG dataset: 50/120fps 4K sequences for video codec analysis and development. In Proceedings of the 11th ACM Multimedia Systems Conference, pp. 297–302. Cited by: §IV-A.
  • [39] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §II.
  • [40] D. Minnen, J. Ballé, and G. D. Toderici (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems (NeurIPS), pp. 10771–10780. Cited by: §I, §II.
  • [41] S. Park, S. Yu, M. Kim, K. Park, and J. Paik (2018) Dual autoencoder network for retinex-based low-light image enhancement. IEEE Access 6, pp. 22084–22093. Cited by: §II.
  • [42] A. Ranjan and M. J. Black (2017) Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4161–4170. Cited by: §III-A.
  • [43] A. Skodras, C. Christopoulos, and T. Ebrahimi (2001) The JPEG 2000 still image compression standard. IEEE Signal processing magazine 18 (5), pp. 36–58. Cited by: §II.
  • [44] N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using LSTMs. In Proceedings of the International Conference on Machine Learning (ICML), pp. 843–852. Cited by: §II.
  • [45] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand (2012) Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on circuits and systems for video technology 22 (12), pp. 1649–1668. Cited by: §I, §I, §IV-A.
  • [46] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §II.
  • [47] L. Theis, W. Shi, A. Cunningham, and F. Huszár (2017) Lossy image compression with compressive autoencoders. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §II.
  • [48] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar (2016) Variable rate image compression with recurrent neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §II.
  • [49] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, and M. Covell (2017) Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5306–5314. Cited by: §II, §II.
  • [50] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §II.
  • [51] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3156–3164. Cited by: §II.
  • [52] H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C. J. Kuo (2016) MCL-JCV: a JND-based H.264/AVC video quality assessment dataset. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 1509–1513. Cited by: §IV-A.
  • [53] R. Wang and D. Tao (2016) Non-local auto-encoder with collaborative stabilization for image restoration. IEEE Transactions on Image Processing 25 (5), pp. 2117–2129. Cited by: §II.
  • [54] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra (2003) Overview of the H.264/AVC video coding standard. IEEE Transactions on circuits and systems for video technology 13 (7), pp. 560–576. Cited by: §I, §I.
  • [55] C. Wu, N. Singhal, and P. Krahenbuhl (2018) Video compression through image interpolation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 416–431. Cited by: §I, §I, §I, §I, §II, §III-A, §III-B, §IV-A, §IV-B, footnote 2.
  • [56] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810. Cited by: §III-B.
  • [57] M. Xu, T. Li, Z. Wang, X. Deng, R. Yang, and Z. Guan (2018) Reducing complexity of HEVC: a deep learning approach. IEEE Transactions on Image Processing 27 (10), pp. 5044–5059. Cited by: §II.
  • [58] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision 127 (8), pp. 1106–1125. Cited by: §III-D.
  • [59] R. Yang, F. Mentzer, L. Van Gool, and R. Timofte (2020) Learning for video compression with hierarchical quality and recurrent enhancement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §I, §I, §I, §II, §III-A, §III-A, §III-B, §III-B, §III-D, §III-D, §IV-A, §IV-B, §IV-B, §IV-B, TABLE I, TABLE II, TABLE III.
  • [60] K. Zeng, J. Yu, R. Wang, C. Li, and D. Tao (2015) Coupled deep autoencoder for single image super-resolution. IEEE transactions on cybernetics 47 (1), pp. 27–37. Cited by: §II.