Log In Sign Up

Utilising Low Complexity CNNs to Lift Non-Local Redundancies in Video Coding

by   Jan P. Klopp, et al.

Digital media is ubiquitous and produced in ever-growing quantities. This necessitates a constant evolution of compression techniques, especially for video, in order to maintain efficient storage and transmission. In this work, we aim at exploiting non-local redundancies in video data that remain difficult to erase for conventional video codecs. We design convolutional neural networks with a particular emphasis on low memory and computational footprint. The parameters of those networks are trained on the fly, at encoding time, to predict the residual signal from the decoded video signal. After the training process has converged, the parameters are compressed and signalled as part of the code of the underlying video codec. The method can be applied to any existing video codec to increase coding gains while its low computational footprint allows for an application under resource-constrained conditions. Building on top of High Efficiency Video Coding, we achieve coding gains similar to those of pretrained denoising CNNs while only requiring about 1% of their computational complexity. Through extensive experiments, we provide insights into the effectiveness of our network design decisions. In addition, we demonstrate that our algorithm delivers stable performance under conditions met in practical video compression: our algorithm performs without significant performance loss on very long random access segments (up to 256 frames) and with moderate performance drops can even be applied to single frames in high resolution low delay settings.


page 1

page 2

page 3

page 4


Data-independent Low-complexity KLT Approximations for Image and Video Coding

The Karhunen-Loève transform (KLT) is often used for data decorrelation ...

BVI-DVC: A Training Database for Deep Video Compression

Deep learning methods are increasingly being applied in the optimisation...

CVEGAN: A Perceptually-inspired GAN for Compressed Video Enhancement

We propose a new Generative Adversarial Network for Compressed Video qua...

VECTORS: Video communication through opportunistic relays and scalable video coding

Crowd-sourced video distribution is frequently of interest in the local ...

Improved CNN-based Learning of Interpolation Filters for Low-Complexity Inter Prediction in Video Coding

The versatility of recent machine learning approaches makes them ideal f...

Non-Local ConvLSTM for Video Compression Artifact Reduction

Video compression artifact reduction aims to recover high-quality videos...

Generalized Difference Coder: A Novel Conditional Autoencoder Structure for Video Compression

Motion compensated inter prediction is a common component of all video c...

1 Introduction

Video streams make up the largest part of worldwide Internet traffic and still experience strong growth due to an increase in on demand video services as well as high resolution and high-frame-rate content. At the same time, video decoding needs to operate under real time requirements on mobile devices under computationally constrained conditions, demanding algorithms that can be efficiently implemented in hardware.

Though image and video compression have been long standing problems, they have only recently attracted broad attention from the computer vision and machine learning communities

Wu2018 ; Minnen2018 ; Liu2019 ; Mentzer2018 ; Theis2017 ; Toderici2016 ; Johnston2017 ; Rippel2017 ; Balle2017 ; Li2017 ; Baig2017 ; Toderici2015 ; Agustsson2017 ; Balle2018 ; Klopp , bringing significant progress to the field of image compression. The increasing availability of highly efficient neural network accelerators renders machine learned solutions a viable alternative due to their easy adaptability to different data distributions and their relative independence from special purpose hardware.

Most existing and widely applied video compression techniques rely on two distinct mechanisms to exploit redundancies: first, motion compensation is used for content that can be reached through spatio-temporal references and, second, (residual) image coding for content not yet available or cannot be predicted by the decoder. Traditional encoding techniques perform both in a block-wise scheme where each block carries information about its temporal or spatial reference of an adjacent and already decoded block as well as about residual. More advanced coding techniques use variable block sizes and predict adjacent blocks, however, they still exploit redundancies that are temporally and spatially local. The same holds for recent machine-learning based approaches. They have replaced block-wise processing by deep convolutional neural networks. The generated code, however, is still local, bound to a certain position in the image, making it difficult to exploit global statistics.

Our approach is to utilize the ability of convolutional neural networks to compactly represent complex mappings inferred from large amounts of data in order to exploit non-local redundancies. In our case, the neural network’s parameters are part of the code, hence the code is not tied to a certain part of the data. More concretely, we employ a CNN to predict the residual error of an existing encoder. The CNN thereby needs to be fit only to a particular segment of a video sequence instead of generalising to all possible videos sequences, which significantly reduces its computational footprint. Where the encoder only exploits local spatial or temporal redundancies, the neural network is optimised over an entire group of pictures at a time, thereby lifting non-local redundancies.

In summary, our contributions are as follows:

  • We introduce a network structure that is lightweight enough to be trained on the fly and signalled to the decoder and can still achieve coding gains of up to 6.8% and 9.1% in random access and low delay mode, respectively, for luma. Chroma gains rise up to 15.2% and 19.5% respectively.

  • An algorithm for stably training and compressing the neural network on the fly without further hyper parameter search is presented.

  • We evaluate our approach and elements of our algorithm design on the HEVC test sets and compare to a pretrained CNN method.

The remainder of our paper is structured as follows. Section 2 introduces works related to our method. Section 3 motivates and describes our approach and we report experimental results in Section 4. Section 5 analyses and compares the complexity of our proposed algorithm. Finally, Section 6 concludes the paper and gives and outlook on future research in this direction.

2 Related Work

Our approach bears resemblance to conventional signal denoising filters. Such denoising filters can be found in recent video codecs, such as H.264 (AVC) Wiegand2003 or H.265 (HEVC) Sullivan2012a . In their simplest form, deblocking filters Kim1999 ; List2003 ; Norkin2012 ; Jo2016 are employed to remove artifacts at the block boundaries. More recently, sample adaptive offset filtering Fu2012 has been developed as part of HEVC, which targets not only block boundaries but all pixels within a block. Even more flexible is adaptive loop filtering (ALF) Tsai2013a , which exploits Wiener filter theory to derive an optimal linear denoising operator. This happens at encoding time with respect to an entire slice or a single block. The resulting filter parameters are then explicitly signalled to the decoder. In a more advanced version, Zhang2017

proposed a non-local ALF that represents the noise-free signal as a low rank approximation of patches of the decoded signal. While increasing coding gains, their method is more costly, especially for the decoder, as it relies on singular value decomposition. Krutz et al.


took a different direction and derived optimal filtering for multiple frames under motion estimation errors. These approaches can be seen as simpler predecessors to the proposed algorithm. While they face less challenges from signaling overhead or computational complexity due to their simpler nature, this also limits their gains. Furthermore, they often model linear dependencies while a neural network extends to more complex non-linear function, reaching improvements that linear filters cannot realize.

Recently, several methods based on convolutional neural networks have been proposed. Dong et al. Dong2015 introduced a CNN based method to suppress JPEG compression artefacts after decoding. Zhang2017b ; Zhang2017a ; Zhang2017c propose a CNN-based image prior for denoising, enabling "blind" denoising without assumptions over the noise distribution. Yan et al. Yan2017

introduced a frame interpolation neural network that interpolates motion estimates to sub-pixel accuracy, improving coding gain through better motion vectors. Several works

Yang2018 ; Li2017a ; Yang2017 ; Cavigelli2017 ; Dai2017 employ CNNs to denoise HEVC compressed frames, where they distinguish between different slice types and quantisation levels. Zhang2018 explored the residual network architecture for this task, Jia et al. Jia2019 showed that ensembles yield further improvements.

In a different direction, the machine learning community has recently taken on the problem of image compression. Early approaches Toderici2015 ; Toderici2016 ; Johnston2017

adopted a residual encoding approach with recurrent neural networks. Later approaches

Rippel2017 ; Balle2017

took the variational autoencoder as a basis and augmented it with code length regularisation, thereby reaching shorter codes at less complexity. Their algorithms were extended by context models

Mentzer2018 ; Balle2018 ; Minnen2018 ; Klopp ; Liu2019 to generate hierarchical codes, vector quantisation Agustsson2017 , content weighting Li2017 to control which parts of the image receive more bits and inpainting Baig2017 which predicts adjacent patches in the image space.

Finally, based on the aforementioned results from CNN-based image coding, several works Wu2018 ; Lu2019 ; Rippel2018 have proposed to perform motion and residual coding with neural networks. While their approaches are promising, they do not achieve results comparable to HEVC and are computationally expensive. All CNN-based methods in the literature that we are aware of share this high complexity characteristic. In contrast, a key feature of our algorithm is the adoption of an advanced machine learning model at low complexity and this sets our approach apart from other machine learning based algorithms.

Figure 1: Encoding Process. The video signal is split into groups of pictures (GoP), the residuals of which are jointly predicted by a CNN that is trained on the fly (1). The CNN parameters are quantised (2) and the resulting CNN is tested (3) for coding gains on the GoP. If the test is positive, its parameters are compressed (4) before they are added to the bit stream of the underlying video codec. The dashed arrows/boxes indicate data transfer/operations that are only carried out in streaming scenarios where access to data signalled for previous frames is granted at the decoder. In such a streaming scenario, previously signalled parameters are first tested on the following GoP (5), before fine-tuning on that GoP (6) commences. Quantisation of those fine-tuned parameters (7) is followed by another test (8) to compare if higher gains can be achieved. If this is the case, the difference between new and old parameters is compressed (9) and added to the bit stream.

3 Exploiting Non-Local Redundancies

Conventional video compression exploits redundancies between temporally and spatially adjacent parts in video sequences. With higher resolutions and more details in video footage, exploiting non-local redundancies becomes harder: conventional codecs would need to either increase their block sizes to capture patterns in a single set of transform coefficients or search across a larger set of blocks. Larger blocks, however, have less homogeneous content, making it difficult to capture structure in the image with a few quantised transform coefficients and a larger block search range in both spatial and the temporal direction leads to cubic complexity growths. At lower bit rates, these insufficiencies increase the chance of artefacts being present in the image if bandwidth does not allow fine grained quantisation. Deblocking filters Jo2016 ; Kim1999 ; List2003 ; Norkin2012 , sample adaptive offset Fu2012 or adaptive loop filtering Zhang2017 ; Tsai2013a have been developed to counter those artefacts. These techniques, however, are applied locally, to a single block or a single slice, making it difficult to lift redundancies that are distributed across the temporal axis. Furthermore, only linear functions are used, limiting the expressiveness of the artefact suppression.

Our approach is to encode those parts of the residual that are redundant, i.e. details that are non local. These details need not necessarily originate from the same texture or pattern but need to have a similarity that can be encoded into a neural network conditioned on the decoded signal. We train a CNN on the fly at encoding time to predict the residual signal from the decoded one. As will be shown, the major advantage is that one can achieve coding gains with less operations compared to pretrained deep learning based denoising or loop filtering approaches. To achieve this, the network needs to be small and converge fast enough to make online training computationally feasible. At the same time, the network’s parameters need to have a short code length so that their overhead on the existing bit stream does not cannibalise coding gains. In the following, we describe how to meet these challenges to yield an algorithm that can operate in an AutoML fashion to train the network. An overview of the algorithm is shown in Figure 

1. Our algorithm can work in random access as well as low delay settings. The major difference is that in the low delay setting, the CNN’s parameters can reference parameters previously signalled, thereby achieving lower compression rates. Besides this, our algorithm proceeds in the same way for both streaming and non-streaming video data. Its input is the decoded data from an existing video codec as well as the residual to be predicted. After training, the parameters are quantised and the network is run with the quantised parameters to test for an improvement, i.e. reduction of the residual error. If the test is successfull, the parameters are compressed and signalled. In a streaming scenario, this test is repeated on the next group of pictures so that after fine-tuning on that group, a new set of parameters is only signalled if the PSNR improvement of the new parameters exceeds 110% of that of the previous parameters, thereby saving additional overhead. In the following, more details on the choice of network architecture, the parameter compression and the optimisation are given.

3.1 Network Architecture

max width= Layer Channels Filters Kernel Complexity (MAC/Pixel) , , , , 1 12 12 12 12 12 2 1 12 108 54 54 27 3 12 12 144 72 72 36 4 1 12 108 54 54 27 5 12 12 12 12 12 Total 384 204 204 114

Table 1: Network Architecture and Complexity for the Y Channel. Details of the five layer and 12 filter network architecture used for most experiments. Complexities are given in multiply-accumulate (MAC) operations per pixel for different pixel packaging configurations.

max width= Layer Channels Filters Kernel Complexity (MAC/Pixel) , , , , 1 12 6 6 6 6 2 1 12 27 13.5 13.5 6.75 3 12 12 36 18 18 9 4 1 12 27 13.5 13.5 6.75 5 12 6 6 6 6 Total 102 57 57 34.5

Table 2: Network Architecture and Complexity for the concatenated UV channels. Details of the five layer and 12 filter network architecture used for most experiments. Complexities are given in multiply-accumulate (MAC) operations per pixel for different pixel packaging configurations under the assumption that U and V channels are subsampled as in "YUV420".

The network architecture needs to be expressive enough to correct noise in the input video stream and at the same time lightweight enough to maintain a low computational footprint and a low signalling overhead when it’s compressed and sent to the decoder. For this reason we chose an architecture inspired by MobileNetsHoward2017 . MobileNets have been shown to work well in image recognition tasks and they combine the expressive power of deep neural networks with low computational and parameter size complexity. The basic idea of MobileNets is to factorise the convolutional layers, which are determined by their filters of dimensions with feature maps, input channels and kernel height and width given by and , respectively. Two separate convolutional layers are used to represent the same function, one operating only in the spatial domain with an independent filter for each input channel (), and the other only connecting different channels with a kernel ().

Figure 2: Pixel packing. Every square represents a single pixel. Patches of pixels are reorganised into vectors, which get treated like different channels by the CNN. denote height and width of a patch: equals no pixel packing (a), is shown in (b), in (c) and in (d).

However, even with these techniques, the network’s complexity may still be too high for high resolution content. For further reduction, we take inspirations from video codec design. Newer codecs like H.265 or H.264 profit from larger coding unit sizes as shown by Ohm et al. in Ohm2012 , in particular for higher resolutions. We exploit the fact that higher resolution videos have more homogeneous areas to reduce the complexity of our approach even further. We use pixel packing (Figure 2) where a patch sized of the input image is rearranged to a vector with elements. This way, several pixels are processed within the same convolution, hence reducing the pixel-wise complexity by a factor . At the same time, the receptive field is enlarged without additional layers or layers with spatially larger filters. The network predicts a vector of elements that are rearranged to form the residual prediction in the same shape as the input.

Our approach relies on optimising a non-convex function at encoding time using stochastic gradient descent. Unlike for convex optimisation, convergence guarantees for this non-convex problem are harder to obtain, if at all. Batch Normalization (BN)

Ioffe2015 has been shown to greatly improve convergence behaviour of deep neural networks. We apply a Batch Normalization layer before each convolutional layer, except the first. Lastly, we remove the bias from the last convolutional layer as the overall residual we are predicting is bias-free. A single batch, however, may have a bias, yet this is what should be predicted from the input data instead of falsely adapting a fixed bias added to the prediction of the network.

With the network architecture considerations presented above, we can use three hyper parameters to adjust the network complexity: the number of layers, the number of channels and pixel packaging. We found that a simple network of five layers and 12 channels works well while still guaranteeing low computational footprint. In addition, such a small network allows for efficient hardware implementation where all layers are processed jointly without intermediate DRAM memory access as shown in Chih2018 , making hardware realisations simpler. Table 1 lists each layer along with the pixel-wise complexity with and without different pixel packing choices. The chosen network requires from 114 to 384 operations per pixel for the Y channel. The U and V channels are processed jointly by a single pass through one network. The two chroma channels are concatenated, increasing the number of input channels/output filters of the network to . However, as the U and V channels in the popular "YUV420" format have only a quarter of the original pixels, the operations per pixel are lower as shown in Table 2. Note that because luma and chroma are processed by different neural networks, their respective pixel packing configurations may differ.

In total, this network design has a computational complexity low enough to allow real time dedicated hardware implementations for video compression at the decoder side even in mobile scenarios.

3.2 Parameter Representation and Compression

Each layer (with the exception of the first) consists of a Batch Normalization (BN), a Convolution and a ReLU nonlinearity. At encoding time, those are separated. After the optimisation routine has finished, the convolution and its batch normalisation layer can be merged into a single affine operation. The output

at any position is given by


where and are the BN parameters for channel and is the weight for channel at kernel position that has been scaled by . As long as the size of the network parameters is small relative to the code length of the group of frames they accompany, there is no need for compression. However, in low latency live streaming scenarios, the neural network parameters may be updated and signalled every few frames, so that an efficient representation is necessary. This can be achieved by quantisation. While some approaches Courbariaux2016 ; Rastegari2016 quantise neural networks at training time to reduce the impact of quantisation, we did not find this beneficial in our experiments and as it adds additional overhead, we do quantise only once: after optimisation, before testing and signalling. Let and denote how many bits are used to quantise weights and bias, leading to quantisation ranges of and , respectively. The weights are quantised by normalising them to the channel-wise quantisation range, , where the scaling factor is signalled separately. Quantisation of the bias happens over all filters so that where is a signalled separately again.

In a streaming scenario, the decoder can still access previously signalled information, i.e. from the last frames. Hence, we transfer only their change to the decoder, thereby reducing the code length required to signal the neural network parameters. The idea is to take the difference between quantised values of two time steps and use an arithmetic coder to compress the difference signal. For the bias , this works well because it changes slowly. For the weights, most change originates from a change in the batch normalisation parameter . As these changes are captured by the scaling factor , the quantised weights are unaffected by this change, which significantly lowers the coding rate. The differences of the significand and the exponent of and are encoded separately using the same method.

The largest share in code length originates from the weights . To reduce the code length of the neural network parameters in low bitrate settings, the parameters for a group of pictures should not differ too much from those of the previous group, . This way, an arithmetic coder would find a very simple distribution to encode. We add a regularisation term

to the loss function in order to control the differences between normalised weights

identified by layer , channel and kernel position at time , . For layer , the regularisation is then


The entire regularisation is


where we normalise by the number of channels, and kernel elements in layer . In practice, we give this term a weighting of 0.1 and add it to the reconstruction loss. Note that this is not applied in random access mode, as accessing previous weights is not possible at decoding time. Instead, weights are subjected to norm regularisation at optimisation time.

3.3 Optimisation

At encoding time, the neural network learns to minimise the squared residual error for a particular group of pictures. This process should converge quickly, especially in online low delay settings, and not demand extensive hyper parameter tuning. We implement several techniques to accomplish this goal.

To counter instabilities during training, we normalise the loss by the average error of the group of pictures. This leads to a normalisation of the gradient given by the loss, i.e.


where is the average norm over each pixel of the residual. This way, the optimisation process will not become unstable or require different learning rates for sequences with a different MSE. An additional benefit is that weight regularisation and code length regularisation (see previous paragraph) weightings do no need to be adjusted for different magnitudes of the reconstruction loss.

To aid convergence, we apply Batch Normalisation (BN), which works by eliminating the mean of each channel and normalising its variance to 1 during training time while accumulating a "global" mean and variance to be used during inference. The accumulation process is usually realised by giving a momentum

to the accumulated estimate and changing it by the current estimate weighted by . For the estimated dataset mean and the mean of the current batch, :


The momentum is often set to values around , which is fine for scenarios where many training iterations are performed. In our case, we choose so that adoption can proceed much quicker. Adding to this that we prefer a small batch size for performance reasons, Batch Normalisation can become unstable because its normalisation alters the activation that is used to compute the weight update where

is the backpropagated gradient and

the convolution kernel weight. If, for example, the mean of one channel for a particular batch happens to be far from the dataset’s mean, this may lead to a weight update that severely deteriorates overall performance. To counter this, we tie the estimated mean value during training to the accumulated mean value. One could derive a different BN equation where the global mean estimate forms a prior for the batch-wise mean estimation. However, this would render the standard BN implementation that hardware vendors provide in their libraries useless and lead to a slower custom implementation. Instead, at training time we simply pad the input of the network with two more lines of zeros on two sides (e.g. right and bottom) that are cropped after exiting the network. Thereby a part of the activations throughout the network are constants (there is a negligible diffusion from the

kernels). Those constants vary little as long as the bias values are small, which is the case for our network. The constant values will cause the mean estimates to be pulled towards them, acting like a regularisation. Note that for the optimisation problem itself, there is no difference as the loss is computed over the cropped image.

4 Experiments

max width= Y Channel U Channel V Channel 1/1 1/2 2/1 2/2 1/1 1/2 2/1 2/2 1/1 1/2 2/1 2/2 HEVC A -5.6% -5.7% -5.9% -6.8% -13.0% -10.8% -10.4% -9.0% -14.4% -12.8% -12.8% -11.9% HEVC B -5.4% -5.9% -4.5% -4.4% -9.8% -8.2% -7.4% -6.9% -10.8% -8.8% -8.3% -7.8% HEVC C -3.7% -2.8% -2.6% -1.6% -9.4% -6.5% -6.1% -4.8% -15.3% -12.0% -11.4% -9.0% HEVC D -3.4% -2.2% -2.2% -0.8% -8.7% -6.1% -5.8% -3.8% -14.5% -10.5% -10.2% -7.3% HEVC E -2.8% -1.6% -2.2% 0.2% -5.9% -3.2% -2.9% -2.5% -10.5% -7.1% -7.0% -7.1%

Table 3: Average BD Rate savings in Random Access mode for different pixel packings.

4.1 Setup

We implement our approach in PyTorch 1.0

Paszke2017 and run our experiments on an Nvidia 1080 GPU. CuDNN’s benchmark mode is disabled, its determinism enabled. As outlined in the previous section, our method needs to automatically apply to any sequence in any dataset and therefore we do not apply any sequence-wise or dataset-wise tuning. Albeit such hyper parameter optimisation being an active topic in research NIPS2012_4522 , it is to date only feasible in large scale operations.

All our experiments use Adam Kingma2015 as optimiser and a learning rate of 0.02. During training we randomly sample non-overlapping patches of size

and form batches of 64 patches. The weights are left to the standard PyTorch initialisation procedure, the bias is explicitly set to 0. These hyperparameters are the same for each sequence in each test set.

Our experiments are based on the HEVC Test Model HM-16.17. We evaluate our approach on the HEVC test sequences A to E in random access and low delay P and B modes. Our experimental results are reported separately for each of the two settings. We use the BD Rate Bjontegaard2001 savings to HM-16.17 to measure performance. For each channel, the rate savings are computed based on the channel’s PSNR value and the total rate, i.e. the rate after the CNN filter has been applied to all channels. This is in accordance with video coding standards.

4.2 Random Access

In the random access scenario, the network parameters for a particular random access segment (RAS) are independent of those belonging to previous or later segments. In practice, RAS are often encoded in parallel as this provides a linear speedup. To be compatible to this approach, we learn network parameters for each RAS from scratch.

We present several series of experiments to analyse different aspects of the proposed algorithm. At first, we analyse the influence of pixel packing on the optimisation performance. As described in Section 3, pixel packing increases the receptive field size of the model, but requires joint prediction of several pixels at the same time. Table 3 shows BD Rate savings for different pixel packings for each HEVC test set and channel. For the Y channel, high resolution tends to benefit from a larger receptive field even if the complexity per pixel is reduced, for HEVC A a packing yields significantly higher results than all other variants, HEVC B peaks for , where two horizontally adjacent pixels are packed. Performance on smaller resolutions like HEVC C and D on the other hand almost halves when pixel packing is applied, the reason for this may lie in the higher information density of a low resolution sequence, which does not benefit from a larger receptive field. For the chroma channels, the picture is clearer, all test sets peak when no pixel packing is applied. However, smaller resolutions suffer higher losses when packing is applied. Overall, this shows that pixel packing helps where it is needed most: in high resolutions where the significantly lower number of operations per pixel contributes most to reducing the overall cost for implementing this method.

Experiments conforming with the HEVC Test Model are constrained to use only up to 32 frames in random access mode to guarantee comparability to other methods. In practice, however, encoders utilise much longer intra frame periods, for example in online video streams where it’s unlikely that the user will perform a lot of fine grained seek operations throughout the sequence. Applying the CNN to a larger set of frames at once has two advantages:

  • The same amount of data is signalled for more frames, leading to less signalling overhead and thereby potentially to higher coding gains, especially for small resolutions.

  • The same amount of computation is used. We observed that the convergence of the CNNs online training process is hardly influenced by the number of frames taken into the training data set. Hence, we leave the the number of patches per batch, the size of each patch and the number of training iterations constant.

max width= Y Channel U Channel V Channel #Frames 32 64 128 256 32 64 128 256 32 64 128 256 HEVC A -5.6% -6.0% -5.6% -6.0% -13.0% -12.4% -13.2% -10.8% -14.4% -13.8% -15.2% -12.4% HEVC B -5.4% -5.7% -5.3% -5.1% -9.8% -9.4% -10.0% -9.3% -10.8% -10.6% -11.2% -10.2% HEVC C -3.7% -4.1% -4.2% -4.1% -9.4% -9.4% -9.4% -9.1% -15.3% -15.3% -14.5% -14.1% HEVC D -3.4% -4.8% -5.4% -5.5% -8.7% -8.8% -9.7% -9.4% -14.5% -15.0% -15.0% -14.9% HEVC E -2.8% -3.9% -4.2% -4.6% -5.9% -6.4% -7.4% -7.9% -10.5% -11.0% -11.5% -11.9%

Table 4: Average BD Rate savings in Random Access mode for different RAS segment sizes.
Y Channel U Channel V Channel
2/2 1/1 2/2 1/1 2/2 1/1
#Filters 12 6 12 6 12 6
Complexity (MAC/Pixel) 114 156 34.5 42 34.5 42
HEVC A -6.8% -4.7% -9.0% -9.3% -11.9% -11.0%
HEVC B -4.4% -4.3% -6.9% -6.4% -7.8% -7.1%
HEVC C -1.6% -2.9% -4.8% -5.7% -9.0% -11.1%
HEVC D -0.8% -3.4% -3.8% -5.9% -7.3% -11.2%
HEVC E 0.2% -2.0% -2.5% -4.0% -7.1% -7.9%
Table 5: Average BD Rate savings in Random Access mode for different complexities.

Table 4 shows experiments results for RAS lengths 32 (same as Table 3), 64, 128 and 256 for all channels and datasets. Note that we fixed pixel packing to , however, the results should be equally applicable to other packing configurations. For the Y channel, across all datasets, we observe that despite increasing the number of pixels by 8-fold (i.e. from 32 to 256 frames), the performance drop is acceptable for HEVC B while average BD rate savings are increasing for all other test sets. The increase is most dramatic for HEVC D and E. HEVC D is small in size (416x240) while HEVC E features typical streaming content, similar to video conferencing. Hence, both test sets have low bit rates compared to the other sets. Because signalling the CNN’s parameters adds an almost constant overhead to the bit stream, its negative effect on rate savings is most pronounced for low bit rate sequences. Signalling CNN parameters for more frames can then mitigate these effects if PSNR improvement is preserved as is the case for all test sets. For the chroma channels Table 4 shows a similar pattern, albeit most sequences peak at 128 frames. Overall, this demonstrates that in practical applications where typically larger I frame periods are chosen, our approach can be used even in low bit rate conditions to achieve similar gains as for high bit rate sequences.

In the preceding section, we introduced pixel packing as a method to reduce the per pixel complexity of the neural network while maintaining its architecture. Table 5 compares BD rate savings for the 2/2 pixel packed case with a downsized network architecture using only 6 filters in each layer. For higher resolutions (HEVC A & B), pixel packing performs better or at least equal, despite having a lower complexity. At low resolutions, less filters per layer are the better option. This once more underlines that pixel packing’s impact is largest where its complexity reduction is needed most.

max width= Y Channel U Channel V Channel 1/1 1/2 2/1 2/2 1/1 1/2 2/1 2/2 1/1 1/2 2/1 2/2 HEVC A -5.2% -5.0% -5.1% -5.9% -15.5% -13.3% -12.7% -11.8% -18.4% -16.0% -15.5% -14.7% HEVC B -4.6% -4.9% -3.5% -3.4% -12.5% -9.9% -9.3% -8.4% -16.8% -13.1% -13.7% -12.0% HEVC C -3.3% -1.9% -1.7% 0.3% -12.1% -8.6% -8.5% -5.7% -18.0% -14.8% -14.5% -10.8% HEVC D 2.8% 5.7% 5.7% 9.9% -7.8% -2.5% -2.3% 2.8% -16.7% -9.4% -10.0% -3.2% HEVC E -3.4% -1.3% -1.3% 3.1% -10.2% -7.1% -5.7% -2.0% -17.0% -15.3% -13.6% -10.7%

Table 6: Average BD Rate savings in Low Delay B mode for different pixel packings at a GoP size of five frames.

max width= Y Channel U Channel V Channel 1/1 1/2 2/1 2/2 1/1 1/2 2/1 2/2 1/1 1/2 2/1 2/2 HEVC A -9.0% -8.7% -8.8% -9.1% -15.7% -13.9% -13.5% -11.6% -19.5% -17.0% -17.3% -15.2% HEVC B -6.3% -6.8% -5.2% -5.3% -13.9% -11.4% -10.8% -9.4% -17.8% -15.1% -15.0% -12.7% HEVC C -3.2% -1.8% -1.7% 0.4% -12.4% -8.4% -8.6% -5.8% -18.8% -15.4% -14.9% -11.6% HEVC D 3.2% 5.8% 6.0% 10.3% -7.9% -2.6% -3.2% 2.1% -15.5% -8.7% -11.1% -3.5% HEVC E -5.3% -3.0% -3.6% 1.8% -12.6% -9.0% -7.5% -3.9% -19.7% -15.9% -15.7% -11.3%

Table 7: Average BD Rate savings in Low Delay P mode for different pixel packings at a GoP size of five frames.

max width= Y Channel U Channel V Channel #Frames 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 HEVC A -5.2% -5.0% -4.8% -4.6% -4.5% -15.5% -15.2% -14.3% -14.2% -13.7% -18.4% -18.0% -16.9% -17.1% -16.8% HEVC B -4.6% -4.3% -4.2% -4.0% -3.5% -12.5% -12.0% -11.7% -11.0% -10.4% -16.8% -16.2% -15.8% -15.1% -14.5% HEVC C -3.3% -2.7% -2.3% -1.4% 1.6% -12.1% -11.6% -11.0% -10.1% -7.7% -18.0% -18.0% -17.6% -16.7% -14.2% HEVC D 2.8% 4.7% 6.9% 12.3% 28.3% -7.8% -5.4% -3.2% 0.9% 14.5% -16.7% -13.4% -11.6% -8.9% 3.6% HEVC E -3.4% -2.7% -2.0% 0.2% 6.2% -10.2% -9.2% -9.3% -6.5% -0.6% -17.0% -17.5% -16.4% -14.7% -8.4%

Table 8: Average BD Rate savings in Low Delay B mode for different GoP lengths.

max width= Y Channel U Channel V Channel #Frames 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 HEVC A -9.0% -8.6% -8.5% -8.4% -8.5% -15.7% -15.0% -13.6% -14.3% -14.0% -19.5% -18.4% -18.0% -18.4% -17.7% HEVC B -6.3% -6.2% -6.1% -5.9% -5.3% -13.9% -13.5% -13.2% -12.6% -11.9% -17.8% -17.4% -17.2% -16.5% -15.8% HEVC C -3.2% -2.8% -2.3% -1.1% 2.2% -12.4% -11.4% -11.4% -9.7% -7.4% -18.8% -19.3% -18.0% -17.1% -14.3% HEVC D 3.2% 4.9% 7.7% 13.7% 28.9% -7.9% -6.5% -3.6% 0.7% 13.9% -15.5% -15.1% -13.2% -9.3% 2.4% HEVC E -5.3% -4.7% -3.8% -2.0% 4.2% -12.6% -12.0% -11.2% -8.4% -2.3% -19.7% -19.2% -18.9% -16.3% -9.5%

Table 9: Average BD Rate savings in Low Delay P mode for different GoP lengths.
Y Channel U Channel V Channel
2/2 1/1 2/2 1/1 2/2 1/1
#Filters 12 6 12 6 12 6
Complexity (MAC/Pixel) 114 156 34.5 42 34.5 42
HEVC A -4.6% -3.9% -14.2% -11.0% -17.1% -13.1%
HEVC B -4.0% -3.1% -11.0% -7.8% -15.1% -11.2%
HEVC C -1.4% -2.4% -10.1% -7.5% -16.7% -13.2%
HEVC D 12.3% 2.4% 0.9% -2.9% -8.9% -9.9%
HEVC E 0.2% -1.5% -6.5% -5.0% -14.7% -13.0%
Table 10: Average BD Rate savings in Low Delay B mode for different complexities at a GoP size of two frames.
Y Channel U Channel V Channel
2/2 1/1 2/2 1/1 2/2 1/1
#Filters 12 6 12 6 12 6
Complexity (MAC/Pixel) 114 156 34.5 42 34.5 42
HEVC A -8.4% -7.6% -14.3% -10.8% -18.4% -13.4%
HEVC B -5.9% -5.3% -12.6% -9.8% -16.5% -12.5%
HEVC C -1.1% -2.2% -9.7% -7.6% -17.1% -13.7%
HEVC D 13.7% 3.1% 0.7% -3.0% -9.3% -10.8%
HEVC E -2.0% -2.9% -8.4% -6.4% -16.3% -14.4%
Table 11: Average BD Rate savings in Low Delay P mode for different complexities at a GoP size of two frames.

4.3 Low Delay B/P

The low delay setting is more challenging for our approach compared to the random access setting as the GoP size is reduced to only a few frames. The signalled parameters are likely to cause a higher bit rate overhead in this scenario. On the other hand, data that is already available at the decoder can be reused. Therefore, for our experiments, we signal new parameters only if their PSNR improvement is more than 10% higher than what the previously signalled CNN would achieve when applied to unseen frames of the following GoP. In practice, this enables our algorithm to be applied to lower resolution settings even for small GoP sizes.

Following our analysis in the random access setting, we first evaluate the influence of pixel packing on the different test sets for both LD B and P settings. The results are listed in Tables 6 and 7, and reflect the preference of higher resolutions for larger receptive fields through pixel packing as observed in the random access setting for the Y channel. For chroma channels, no pixel packing yields the best results, however, for large resolutions the performance drop is much smaller than for lower bit rates. As mentioned in the previous paragraph, the small number of frames per GoP makes it challenging to apply the algorithm to low bit rate sequences. While there are still significant BD rate savings for HEVC C and E, the algorithm fails when applied to HEVC D, as seen from the positive rate savings in both low delay variants. In addition, application to the LDP mode yields greater improvement. LDP allows only prediction in one direction and hence gives the encoder less options to optimise the code and reduce the residual. It is hence plausible that the CNN based residual prediction has a higher chance of lifting unexploited patterns in the residual signal.

With a GoP size of five frames, we follow a common setting. However, in some applications an even lower latency is favourable. Tables 8 and 9 show BD rate savings for different GoP sizes, down to a single frame, for LDB and LDP, respectively. Unsurprisingly, a larger GoP size performs best in all settings. It is evident, though, that the proposed algorithm can achieve significant rate savings even if only a single frame is processed at a time for the high resolution sequences in HEVC A & B. For smaller resolutions, HEVC C & E, chroma channels still hold gains if the GoP size is reduced, however in the luma domain rate savings turn positive as initial coding gains did not reach similar levels to gains in the chroma domain. In the smallest resolution, HEVC D, gains vanish even for U and V channels if the algorithm runs in single frame mode as the signalling overhead cannibalises any gains achieved by the neural network.

To measure the efficacy of pixel packing as complexity reduction, Tables 10 and 11 compare 2/2 pixel packaging to downsizing by reducing the number of filters per layer. Similar to the random access results before, pixel packing is an efficient option for high resolution in the Y channel. For chroma channels, there is a significant drop in performance despite higher complexity when trading pixel packing for less filters. Overall, this underlines the importance of choosing the right complexity reduction method.

Figure 3: Overhead per frame caused by signalling the CNN parameters over time for different QP values when applied to the "Controlled Burn" test sequence. Vertical lines indicate scene changes.

Finally, Fig. 3 shows the progression of frame wise signalling overhead for the test sequence "Controlled Burn". The sequence features several scene changes and fast forwards that are indicated by vertical lines. We’ve plotted graphs for three different QPs. Note that "skipping" a network, i.e. indicating that the previously signalled network ought to be reused, is deactivated to emphasise the data rate behaviour in this case. The graphs reside at different levels as the weights are quantised more coarse for higher QPs. Most of these abrupt changes in scene statistics cause a small upturn in the bitrate that is quickly reduced after a few frames. The first value (for frame 0) indicates the code size without reference to prior network parameters. It can easily be observed that despite scene changes, the code size is regularly settling far below that initial value. This shows that our modified loss function together with difference coding work effectively to yield a significant code size reduction in low delay stream settings.

Our’s Jia et al. Jia2019
Complexity (Y+U/V)
HEVC A 216 (114+102) -6.8% -13.0% -14.4% 326336 -6.6% -3.4% -3.0%
HEVC B 306 (204+102) -5.9% -9.8% -10.8% 326336 -6.5% -2.5% -2.7%
HEVC C 486 (384+102) -3.7% -9.4% -15.3% 326336 -4.5% -3.3% -4.5%
HEVC D 486 (384+102) -3.4% -8.7% -14.5% 326336 -3.3% -2.6% -3.6%
HEVC E 486 (384+102) -2.8% -5.9% -10.5% 326336 -9.0% -4.2% -5.3%
Table 12: Comparison of average BD Rate savings and complexity with a pretrained CNN approach in Random Access mode.
Our’s Jia et al. Jia2019
Complexity (Y+U/V)
HEVC A 216 (114+102) -5.9% -15.5% -18.4% 326336 -6.7% -2.6% -1.9%
HEVC B 306 (204+102) -4.9% -12.5% -16.8% 326336 -5.7% -1.6% -2.2%
HEVC C 486 (384+102) -3.3% -12.1% -18.0% 326336 -5.0% -3.4% -5.0%
HEVC D 486 (384+102) 2.8% -7.8% -16.7% 326336 -3.8% -1.7% -2.6%
HEVC E 486 (384+102) -3.4% -10.2% -17.0% 326336 -8.6% -5.2% -5.6%
Table 13: Comparison of average BD Rate savings and complexity with a pretrained CNN approach in Low Delay B mode.
Our’s Jia et al. Jia2019
Complexity (Y+U/V)
HEVC A 216 (114+102) -9.1% -15.7% -19.5% 326336 -3.5% 0.2% 0.3%
HEVC B 306 (204+102) -6.8% -13.9% -17.8% 326336 -4.5% -0.5% -1.1%
HEVC C 486 (384+102) -3.2% -12.4% -18.8% 326336 -4.4% -1.0% -3.0%
HEVC D 486 (384+102) 3.2% -7.9% -15.5% 326336 -3.5% -0.8% -0.9%
HEVC E 486 (384+102) -5.3% -12.6% -19.7% 326336 -7.7% -1.7% -0.9%
Table 14: Comparison of average BD Rate savings and complexity with a pretrained CNN approach in Low Delay P mode.

4.4 Comparison to pretrained CNNs

The previous sections presented and analysed our results under different configurations with varying parameter settings. In this section, we compare our results to the CNN-based denoising approach of Jia et al. Jia2019 , who propose an ensemble of networks. An additional discrimination network is responsible to chose the network to denoise a particular patch. Jia et al. Jia2019 showed that they outperform similar approaches like VRCNN Dai2017b and VDSR Kim2016a and at the same time put an emphasis on parameter and complexity reduction in their network design. This offers a good comparison to our low complexity online learning approach.

Table 5 compares the two approaches in the random access setting. In accordance with the results presented above, we chose 2/2 and 1/2 pixel packing for the luma channel of HEVC A and B, respectively. All remaining results are obtained without pixel packing. Despite having about three orders of magnitude less complexity, our algorithm performs well on par for HEVC A and D luma, and underperforms by 0.6% and 0.8% on the luma of HEVC B and C, respectively. For HEVC E, though, our algorithm is clearly outperformed by the static CNN. This may be by the low bit rate for HEVC E sequences and the fact that their content contains very little dynamic, where error recovery by a statically trained neural network may be easier than in scenes with a low of motion. For chroma channels, on the other hand, our algorithm performs favourably on all test sets.

In the low delay setting, results differ between bi-directional (Table 13) and uni-directional (Table 14) prediction. As discussed before, our algorithm performs more efficient in high bit rate (high resolution) settings, hence we perform favourably in the LDP setting of HEVC A and B and contain the shortfall to 0.8% in LDB. In low bit rate settings (HEVC C-E), our algorithm is outperformed on the Y channel, while our chroma gains remain significantly above those of Jia et al. for all test sets.

Overall, this shows that our algorithm can perform at least on par with pretrained CNNs in high resolution settings and reduce the computational cost at the decoder to 0.1% of that of a pretrained CNN.

5 Complexity

Complexity is a major issue in video coding, especially for the decoder. By design, our approach is asymmetric, requiring a higher complexity at the encoder but enabling a lower complexity at the decoder.

max width= Device CPU GPU Setting RA (32 Frames) RA (256 Frames) LDB LDP RA (32 Frames) RA (256 Frames) LDB LDP HEVC A 124% 104% 113% 117% 101% 101% 101% 101% HEVC B 168% 109% 136% 143% 103% 101% 102% 102% HEVC C 434% 142% 295% 346% 113% 102% 108% 111% HEVC D 1659% 295% 937% 1144% 163% 108% 136% 145% HEVC E 714% 178% 416% 506% 125% 104% 114% 117% Average 620% 166% 379% 451% 121% 104% 112% 115% Jia et al. Jia2019 (Avg. over RA/LDB/LDP) 213%

Table 15: Encoding complexity of our algorithm relative to the HM-16.17 baseline. Low Delay with a GoP of 5 is used. The relative timing is computed by dividing the runtime of HM-16.17 including our algorithm by the original runtime of HM-16.17.

Table 15 shows the encoding complexity relative to the HM-16.17 baseline as where is the total runtime of HM-16.17 and our algorithm and is the original HM runtime. The CPU (Intel i7-3770 @ 3.40GHz) performs significantly slower than the GPU. Besides the GPU’s parallel processing capabilities, the used CPU is about 6 years old and more recent CPUs have even better support for SIMD operations. Jia et al. Jia2019 note that their encoding (using GPU) takes on average 213% (they give an overhead of 113% without the HM-16.17 running time). Comparing that to our averages, ranging from 112% to 121%, our approach encodes significantly faster even though the network is trained during encoding while Jia et al. have a pretrained ensemble available. It should further be noted, that we did not tune hyperparamers like the number of iterations or batch size to optimize the training times. In addition, when we apply our algorithm to longer RAS (up to 256 frames) as reported in Table 4, the encoding time of our proposed algorithm remains constant as neither the batch size nor the number of iterations need to be adjusted. This reduces the overall overhead even further. While such a configuration is not part of the HEVC Common Test Conditions evaluation settings, it’s often used in practice.

Device CPU GPU
HEVC A 417% 381% 356% 128% 125% 123%
HEVC B 416% 393% 423% 129% 128% 130%
HEVC C 488% 449% 467% 154% 149% 151%
HEVC D 456% 436% 455% 209% 203% 203%
HEVC E 783% 906% 932% 192% 204% 206%
Average 512% 514% 527% 162% 162% 163%
Jia et al. Jia2019  (Avg.) 11756%
Table 16: Decoding complexity of our algorithm relative to the HM-16.17 baseline. Low Delay with a GoP of 5 is used. The relative timing is computed by dividing the runtime of HM-16.17 including our algorithm by the original runtime of HM-16.17.

Table 16 shows the decoding complexity of our algortihm for different devices and settings. While the overhead is generally higher than for the encoding case, the CPU, for the reasons noted above, runs once again significantly slower. As the decoding process is similar for random access and low delay modes, their is little difference between their respective time complexities. When compared to the application of a pretrained CNN, the advantage of our algortihm becomes even more obvious. At the expense of training parameters at encoding time, our approach yields a network significantly smaller than a pretrained alternative, reducing decoding complexity by several orders of magnitude.

6 Conclusion

In this paper, we propose an online learning algorithm to exploit non-local redundancies in High Efficiency Video Coding. The novelty of our approach resides in the ability to learn parameters at encoding time and transmit those to the decoder in order to enable low complexity non-linear denoising. We propose a network design that is efficient enough for both PSNR improvement signalling as part of the video bit stream. Extensive experiment results shed light on certain aspects of our algorithm design and demonstrate favourable performance over the HEVC CTC baseline and, for high resolutions, over a statically trained CNN ensemble in terms of coding gain. The low complexity design makes practical applications possible and thereby increases the potential impact of this work on future video coding technologies.


  • [1] Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini, and Luc Van Gool. Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations. NIPS, 2017.
  • [2] Mohammad Haris Baig, Vladlen Koltun, and Lorenzo Torresani. Learning to Inpaint for Image Compression. NIPS, 2017.
  • [3] Johannes Ballé, Valero Laparra, and Eero P. Simoncelli. End-to-end Optimized Image Compression. ICLR, 2017.
  • [4] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston.

    Variational image compression with a scale hyperprior.

    International Conference On Learning Representations, 2018.
  • [5] G Bjøntegaard. Calculation of Average PSNR Differences between RD curves. ITU-T SG16/Q6. Technical report, ITU-T SG16/Q6, Austin, Texas, USA, 2001.
  • [6] Lukas Cavigelli, Pascal Hager, and Luca Benini. CAS-CNN: A deep convolutional neural network for image compression artifact suppression. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 752–759. IEEE, may 2017.
  • [7] Chung Yan Chih, Sih Sian Wu, Jan P. Klopp, and Liang Gee Chen.

    Accurate and Bandwidth Efficient Architecture for CNN-based Full-HD Super-Resolution.

    In Proceedings - IEEE International Symposium on Circuits and Systems, 2018.
  • [8] Matthieu Courbariaux and Yoshua Bengio. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv, 2016.
  • [9] Yuanying Dai, Dong Liu, and Feng Wu. A convolutional neural network approach for post-processing in HEVC intra coding. In

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    , 2017.
  • [10] Yuanying Dai, Dong Liu, and Feng Wu. A Convolutional Neural Network Approach for Post-Processing in HEVC Intra Coding. In International Conference on Multimedia Modeling, pages 28–39. Springer, Cham, 2017.
  • [11] Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
  • [12] Chih Ming Fu, Elena Alshina, Alexander Alshin, Yu Wen Huang, Ching Yeh Chen, Chia Yang Tsai, Chih Wei Hsu, Shaw Min Lei, Jeong Hoon Park, and Woo Jin Han. Sample adaptive offset in the HEVC standard. IEEE Transactions on Circuits and Systems for Video Technology, 2012.
  • [13] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets. arXiv preprint arXiv:1704.04861, 2017.
  • [14] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. International Conference on Machine Learning, feb 2015.
  • [15] Chuanmin Jia, Shiqi Wang, Xinfeng Zhang, Shanshe Wang, Jiaying Liu, Shiliang Pu, and Siwei Ma. Content-Aware Convolutional Neural Network for In-loop Filtering in High Efficiency Video Coding. IEEE Transactions on Image Processing, pages 1–1, jan 2019.
  • [16] Hyunho Jo, Seanae Park, and Donggyu Sim. Parallelized deblocking filtering of HEVC decoders based on complexity estimation. Journal of Real-Time Image Processing, 2016.
  • [17] Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Troy Chinen, Sung Jin Hwang, Joel Shor, and George Toderici. Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks.

    Computer Vision and Pattern Recognition

    , 2017.
  • [18] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016.
  • [19] Sung Deuk Kim, Jaeyoun Yi, Hyun Mun Kim, and Jong Beom Ra. A deblocking filter with two separate modes in block-based video coding. IEEE Transactions on Circuits and Systems for Video Technology, 1999.
  • [20] Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations 2015, pages 1–15, 2015.
  • [21] Jan P Klopp, Yu-chiang Frank Wang, and Liang-gee Chen. Learning a Code-Space Predictor by Exploiting Intra-Image-Dependencies Review of Learned Image Compression. In British Machine Vision Conference, pages 1–12, 2018.
  • [22] Andreas Krutz, Alexander Glantz, Michael Tok, Marko Esche, and Thomas Sikora. Adaptive global motion temporal filtering for high efficiency video coding. IEEE Transactions on Circuits and Systems for Video Technology, 2012.
  • [23] Chen Li, Li Song, Rong Xie, and Wenjun Zhang. CNN based post-processing to improve HEVC. In 2017 IEEE International Conference on Image Processing (ICIP), pages 4577–4580. IEEE, sep 2017.
  • [24] Mu Li, Wangmeng Zuo, Shuhang Gu, Debin Zhao, and David Zhang. Learning Convolutional Networks for Content-weighted Image Compression. 2017.
  • [25] Peter List, Anthony Joch, Jani Lainema, Gisle Bjøntegaard, and Marta Karczewicz. Adaptive deblocking filter. IEEE Transactions on Circuits and Systems for Video Technology, 2003.
  • [26] Haojie Liu, Tong Chen, Peiyao Guo, Qiu Shen, and Zhan Ma. Gated Context Model with Embedded Priors for Deep Image Compression. feb 2019.
  • [27] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. DVC: An End-to-end Deep Video Compression Framework. In Computer Vision and Patter Recognition, nov 2019.
  • [28] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool.

    Conditional Probability Models for Deep Image Compression.

  • [29] David Minnen, Johannes Ballé, and George Toderici. Joint Autoregressive and Hierarchical Priors for Learned Image Compression. In Neural Information Processing Systems, pages 10771–10780, 2018.
  • [30] Andrey Norkin, Gisle Bjøntegaard, Arild Fuldseth, Matthias Narroschke, Masaru Ikeda, Kenneth Andersson, Minhua Zhou, and Geert Van Der Auwera. HEVC deblocking filter. IEEE Transactions on Circuits and Systems for Video Technology, 2012.
  • [31] Jens Rainer Ohm, Gary J. Sullivan, Heiko Schwarz, Thiow Keng Tan, and Thomas Wiegand. Comparison of the coding efficiency of video coding standards-including high efficiency video coding (HEVC). IEEE Transactions on Circuits and Systems for Video Technology, 2012.
  • [32] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic Differentiation in {PyTorch}. In NIPS Autodiff Workshop, 2017.
  • [33] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. arXiv, 2016.
  • [34] Oren Rippel and Lubomir Bourdev. Real-Time Adaptive Image Compression. ICML, 2017.
  • [35] Oren Rippel, Sanjay Nair, Carissa Lew, Steve Branson, Alexander G. Anderson, and Lubomir Bourdev. Learned Video Compression. nov 2018.
  • [36] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In F Pereira, C J C Burges, L Bottou, and K Q Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2951–2959. Curran Associates, Inc., 2012.
  • [37] Gary J. Sullivan, Jens Rainer Ohm, Woo Jin Han, and Thomas Wiegand. Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology, 2012.
  • [38] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy Image Compression with Compressive Autoencoders. ICLR, pages 1–19, 2017.
  • [39] George Toderici, Sean M. O’Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar. Variable Rate Image Compression with Recurrent Neural Networks. International Conference On Learning Representations, pages 1–9, 2015.
  • [40] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and Michele Covell. Full Resolution Image Compression with Recurrent Neural Networks. Computer Vision and Pattern Recognition, 2016.
  • [41] Chia Yang Tsai, Ching Yeh Chen, Tomoo Yamakage, In Suk Chong, Yu Wen Huang, Chih Ming Fu, Takayuki Itoh, Takashi Watanabe, Takeshi Chujoh, Marta Karczewicz, and Shaw Min Lei. Adaptive loop filtering for video coding. IEEE Journal on Selected Topics in Signal Processing, 2013.
  • [42] Thomas Wiegand, Gary J. Sullivan, Gisle Bjøntegaard, and Ajay Luthra. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 2003.
  • [43] Chao Yuan Wu, Nayan Singhal, and Philipp Krähenbühl. Video compression through image interpolation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 11212 LNCS, pages 425–440, apr 2018.
  • [44] Ning Yan, Dong Liu, Houqiang Li, and Feng Wu. A convolutional neural network approach for half-pel interpolation in video coding. In Proceedings - IEEE International Symposium on Circuits and Systems, 2017.
  • [45] Ren Yang, Mai Xu, Tie Liu, Zulin Wang, and Zhenyu Guan. Enhancing Quality for HEVC Compressed Videos. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2018.
  • [46] Ren Yang, Mai Xu, and Zulin Wang. Decoder-side HEVC quality enhancement with scalable convolutional neural network. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pages 817–822. IEEE, jul 2017.
  • [47] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing, 2017.
  • [48] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep CNN denoiser prior for image restoration. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017.
  • [49] Lei Zhang and Wangmeng Zuo. Image Restoration: From Sparse and Low-Rank Priors to Deep Priors [Lecture Notes]. IEEE Signal Processing Magazine, 2017.
  • [50] Xinfeng Zhang, Ruiqin Xiong, Weisi Lin, Jian Zhang, Shiqi Wang, Siwei Ma, and Wen Gao. Low-Rank-Based Nonlocal Adaptive Loop Filter for High-Efficiency Video Compression. IEEE Transactions on Circuits and Systems for Video Technology, 2017.
  • [51] Yongbing Zhang, Tao Shen, Xiangyang Ji, Yun Zhang, Ruiqin Xiong, and Qionghai Dai. Residual Highway Convolutional Neural Networks for in-loop Filtering in HEVC. IEEE Transactions on Image Processing, 27(8), 2018.