PyTorch code for the CVPR'20 paper "Learning Better Lossless Compression Using Lossy Compression"
We leverage the powerful lossy image compression algorithm BPG to build a lossless image compression system. Specifically, the original image is first decomposed into the lossy reconstruction obtained after compressing it with BPG and the corresponding residual. We then model the distribution of the residual with a convolutional neural network-based probabilistic model that is conditioned on the BPG reconstruction, and combine it with entropy coding to losslessly encode the residual. Finally, the image is stored using the concatenation of the bitstreams produced by BPG and the learned residual coder. The resulting compression system achieves state-of-the-art performance in learned lossless full-resolution image compression, outperforming previous learned approaches as well as PNG, WebP, and JPEG2000.READ FULL TEXT VIEW PDF
PyTorch code for the CVPR'20 paper "Learning Better Lossless Compression Using Lossy Compression"
The need to efficiently store the ever growing amounts of data generated continuously on mobile devices has spurred a lot of research on compression algorithms. Algorithms like JPEG  for images and H.264  for videos are used by billions of people daily.
After the breakthrough results achieved with deep neural networks in image classification 
, and the subsequent rise of deep-learning based methods,learned lossy image compression has emerged as an active area of research (e.g. [5, 44, 45, 36, 1, 3, 29, 27, 47]). In lossy compression, the goal is to achieve small bitrates given a certain allowed distortion in the reconstruction, i.e., the rate-distortion trade-off is optimized. In contrast, in lossless compression, no distortion is allowed, and we aim to reconstruct the input perfectly by transmitting as few bits as possible. To this end, a probabilistic model of the data can be used together with entropy coding techniques to encode and transmit data via a bitstream. The theoretical foundation for this idea is given in Shannon’s landmark paper , which proves a lower bound for the bitrate achievable by such a probabilistic model, and the overhead incurred by using an imprecise model of the data distribution. One beautiful result is that maximizing the likelihood of a parametric probabilistic model is equivalent to minimizing the bitrate obtained when using that model for lossless compression with an entropy coder (see, e.g., ). Learning parametric probabilistic models by likelihood maximization has been studied to a great extent in the generative modeling literature (e.g. [49, 48, 38, 33, 24]). Recent works have linked these results to learned lossless compression [28, 17, 46, 23].
Even though recent learned lossy image compression methods achieve state-of-the-art results on various data sets, the results obtained by the non-learned H.265-based BPG [42, 6] are still highly competitive, without requiring sophisticated hardware accelerators such as GPUs to run. While BPG was outperformed by learning-based approaches across the bitrate spectrum in terms of PSNR  and visual quality , it still excels particularly at high-PSNR lossy reconstructions.
In this paper, we propose a learned lossless compression system by leveraging the power of the lossy BPG, as illustrated in Fig. 1. Specifically, we decompose the input image into the lossy reconstruction produced by BPG and the corresponding residual . We then learn a probabilistic model of the residual, conditionally on the lossy reconstruction . This probabilistic model is fully convolutional and can be evaluated using a single forward pass, both for encoding and decoding. We combine it with an arithmetic coder to losslessly compress the residual and store or transmit the image as the concatenation of the bitstrings produced by BPG and the residual compressor. Further, we use a computationally inexpensive technique from the generative modeling literature, tuning the “certainty” (temperature) of , as well as an auxiliary shallow classifier to predict the quantization parameter of BPG in order to optimize our compressor on a per-image basis. These components together lead to a state-of-the-art full-resolution learned lossless compression system. All of our code and data sets are available on github.111https://github.com/fab-jul/RC-PyTorch
In contrast to recent work in lossless compression, we do not need to compute and store any side information (as opposed to L3C ), and our CNN is lightweight enough to train and evaluate on high-resolution natural images (as opposed to [17, 23], which have not been scaled to full-resolution images to our knowledge).
In summary, our main contributions are:
We leverage the power of the classical state-of-the-art lossy compression algorithm BPG in a novel way to build a conceptually simple learned lossless image compression system.
Our system is optimized on a per-image basis with a light-weight post-training step, where we obtain a lower-bitrate probability distribution by adjusting the confidence of the predictions of our probabilistic model.
Our system outperform the state-of-the-art in learned lossless full-resolution image compression, L3C , as well as the classical engineered algorithms WebP, JPEG200, PNG. Further, in contrast to L3C, we are also outperforming FLIF on Open Images, the domain where our approach (as well as L3C) is trained.
Arguably most closely related to this paper, Mentzer et al.  build a computationally cheap hierarchical generative model (termed L3C) to enable practical compression on full-resolution images.
Townsend et al.  and Kingma et al.  leverage the “bits-back scheme”  for lossless compression of an image stream, where the overall bitrate of the stream is reduced by leveraging previously transmitted information. Motivated by recent progress in generative modeling using (continuous) flow-based models (e.g. [34, 22]), Hoogeboom et al.  propose Integer Discrete Flows (IDFs), defining an invertible transformation for discrete data. In contrast to L3C, the latter works focus on smaller data sets such as MNIST, CIFAR-10, ImageNet32, and ImageNet64, where they achieve state-of-the-art results.
As mentioned in Section 1, virtually every generative model can be used for lossless compression, when used with an entropy coding algorithm. Therefore, while the following generative approaches do not take a compression perspective, they are still related. The state-of-the-art PixelCNN -based models rely on auto-regression in RGB space to efficiently model a conditional distribution. The original PixelCNN  and PixelRNN  model the probability distribution of a pixel given all previous pixels (in raster-scan order). To use these models for lossless compression, forward passes are required, where and are the image height and width, respectively. Various speed optimizations and a probability model amendable to faster training were proposed in . Different other parallelization techniques were developed, including those from , modeling the image distribution conditionally on subsampled versions of the image, as well as those from , conditioning on a RGB pyramid and grayscale images. Similar techniques were also used by [8, 30].
The wide-spread PNG  applies simple autoregressive filters to remove redundancies from the RGB representation (e.g. replacing pixels with the difference to their left neighbor), and then uses the DEFLATE  algorithm for compression. In contrast, WebP  uses larger windows to transform the image (enabling patch-wise conditional compression), and relies on a custom entropy coder for compression. Mainly in use for lossy compression, JPEG2000  also has a lossless mode, where an invertible mapping from RGB to compression space is used. At the heart of FLIF  is an entropy coding method called “meta-adaptive near-zero integer arithmetic coding” (MANIAC), which is based on the CABAC method used in, e.g., H.264 . In CABAC, the context model used to compress a symbol is selected from a finite set based on local context 
. The “meta-adaptive” part in MANIAC refers to the context model which is a decision tree learnedper image.
Artifact removal methods in the context of lossy compression are related to our approach in that they aim to make predictions about the information lost during the lossy compression process. In this context, the goal is to produce sharper and/or more visually pleasing images given a lossy reconstruction from, e.g., JPEG. Dong et al. 
proposed the first CNN-based approach using a network inspired by super-resolution networks. extends this using a residual structure, and  relies on hierarchical skip connections and a multi-scale loss. Generative models in the context of artifact removal are explored by , which proposes to use GANs  to obtain more visually pleasing results.
We give a very brief overview of lossless compression basics here and refer to the information theory literature for details [39, 9]. In lossless compression, we consider a stream of symbols , where each is an element from the same finite set . The stream is obtained by drawing each symbol independently from the same distribution , i.e., the are i.i.d. according to . We are interested in encoding the symbol stream into a bitstream, such that we can recover the exact symbols by decoding. In this setup, the entropy of is equal to the expected number of bits needed to encode each :
In general, however, the exact is unknown, and we instead consider the setup where we have an approximate model . Then, the expected bitrate will be equal to the cross-entropy between and , given by:
Intuitively, the higher the discrepancy between the model used for coding is from the real , the more bits we need to encode data that is actually distributed according to .
Given a symbol stream as above and a probability distribution (not necessarily ), we can encode the stream using entropy coding. Intuitively, we would like to build a table that maps every element in to a bit sequence, such that gets a short sequence if is high. The optimum is to output bits for symbol , which is what entropy coding algorithms achieve. Examples include Huffman coding  and arithmetic coding .
In general, we can use a different distribution for every symbol in the stream, as long as the are also available for decoding. Adaptive entropy coding algorithms work by allowing such varying distributions as a function of previously encoded symbols. In this paper, we use adaptive arithmetic coding .
As explained in the previous section, all we need for lossless compression is a model , since we can use entropy coding to encode and decode any input losslessly given . In particular, we can use a CNN to parametrize . To this end, one general approach is to introduce (structured) side information available both at encoding and decoding time, and model the probability distribution of natural images conditionally on , using the CNN to parametrize .222We write to denote the entire probability mass function and to denote evaluated at . Assuming that both the encoder and decoder have access to and , we can losslessly encode as follows: We first use the CNN to produce . Then, we employ an entropy encoder (described in the previous section) with to encode to a bitstream. To decode, we once again feed to the CNN, obtaining , and decode from the bitstream using the entropy decoder.
One key difference among the approaches in the literature is the factorization of . In the original PixelCNN paper  the image is modeled as a sequence of pixels, and corresponds to all previous pixels. Encoding as well as decoding are done autoregressively. In IDF , is mapped to a using an invertible function, and is then encoded using a fixed prior , i.e., here is a deterministic function of . In approaches based on the bits-back paradigm [46, 23], while encoding, is obtained by decoding from additional available information (e.g. previously encoded images). In L3C ,
corresponds to features extracted with a hierarchical model that are also saved to the bitstream using hierarchically predicted distributions.
BPG is a lossy image compression method based on the HEVC video coding standard , essentially applying HEVC on a single image. To motivate our usage of BPG, we show the histogram of the marginal pixel distribution of the residuals obtained by BPG on Open Images (one of our testing sets, see Section 5.1) in Fig. 2. Note that while the possible range of a residual is , we observe that for most images, nearly every point in the residual is in the restricted set , which is indicative of the high-PSNR nature of BPG. Additionally, Fig. A1 (in the suppl.) presents a comparison of BPG to the state-of-the-art learned image compression methods, showing that BPG is still very competitive in terms of PSNR.
BPG follows JPEG in having a chroma format parameter to enable color space subsampling, which we disable by setting it to . The only remaining parameter to set is the quantization parameter , where . Smaller results in less quantization and thus better quality (i.e., different to the quality factor of JPEG, where larger means better reconstruction quality). We learn a classifier to predict , described in Section 4.4.
We give an overview of our method in Fig. 1. To encode an image , we first obtain the quantization parameter from the Q-Classifier (QC) network (Section 4.4). Then, we compress with BPG, to obtain the lossy reconstruction , which we save to a bitstream. Given , the Residual Compressor (RC) network (Section 4.1) predicts the probability mass function of the residual , i.e.,
We model as a discrete mixture of logistic distributions (Section 4.2). Given and , we compress to the bitstream using adaptive arithmetic coding algorithm (see Section 3.1). Thus, the bitstream consists of the concatenation of the codes corresponding to and . To decode from , we first obtain using the BPG decoder, then we obtain once again , and subsequently decode from the bitstream using . Finally, we can reconstruct . In the formalism of Section 3.2, we have .
Note that no matter how bad RC is at predicting the real distribution of , we can always do lossless
compression. Even if RC were to predict, e.g., a uniform distribution—in that case, we would just need many bits to store.
channels, which we then downscale using a stride-2 convolution, and feed through 16 residual blocks. Instead of BatchNorm layers as in ResNet, our residual blocks contain GDN layers proposed by . Subsequently, we upscale back to the resolution of the input image using a transposed convolution. The resulting features are concatenated with , and convolved to contract the channels back to , like in U-Net. Finally, the network splits into four tails, predicting the different parameters of the mixture model, , described next.
We use a discrete mixture of logistics to model the probability mass function of the residual, , similar to [28, 38]. We closely follow the formulation of  here: Let denote the RGB channel and the spatial location. We define
We use a (weak) autoregression over the three RGB channels to define the joint distribution over channels via logistic mixtures:
where we removed the indices to simplify the notation. For the mixture we use a mixture of logistic distributions . Our distributions are defined by the outputs of the RC network, which yields mixture weights , means, as well as mixture coefficients . The autoregression over RGB channels is only used to update the means using a linear combination of and the target of previous channels, scaled by the coefficients . We thereby obtain :
As motivated in Section 3.1, we are interested in minimizing the cross-entropy between the real distribution of the residual and our model : the smaller the cross-entropy, the closer is to , and the fewer bits an entropy coder will use to encode . We consider the setting where we have training images . For every image, we compute the lossy reconstruction as well as the corresponding residual . While the true distribution is unknown, we can consider the empirical distribution obtained from the samples and minimize:
This loss decomposes over samples, allowing us to minimize it over mini-batches. Note that minimizing Eq. 7 is the same as maximizing the likelihood of , which is the perspective taken in the likelihood-based generative modeling literature.
A random set of natural images is expected to contain images of varying “complexity”, where complex can mean a lot of high frequency structure and/or noise. While virtually all lossy compression methods have a parameter like BPG’s , to navigate the trade-off between bitrate and quality, it is important to note that compressing a random set of natural images with the same fixed will usually lead to the bitrates of these images being spread around some -dependent mean. Thus, in our approach, it is suboptimal to fix for all images.
Indeed, in our pipeline we have a trade-off between the bits allocated to BPG and the bits allocated to encoding the residual. This trade-off can be controlled with : For example, if an image contains components that are easier for the RC network to model, it is beneficial to use a higher , such that BPG does not waste bits encoding these components. We observe that for a fixed image, and a trained RC, there is a single optimal .
To efficiently obtain a good , we train a simple classifier network, the Q-Classifier (QC), and then use to compress with BPG. For the architecture, we use a light-weight ResNet-inspired network with 8 residual blocks for QC, and train it to predict a class in , given an image ( was selected using the Open Images validation set). In contrast to ResNet, we employ no normalization layers (to ensure that the prediction is independent of the input size). Further, the final features are obtained by average pooling each of the final channels of the -dimensional feature map. The resultingclasses, which are then normalized with a softmax. Details are shown in Section A.1 in the supplementary material.
While the input to QC is the full-resolution image, the network is shallow and downsamples multiple times, making this a computationally lightweight component.
Inspired by the temperature scaling employed in the generative modeling literature (e.g. ) , we further optimize the predicted distribution with a simple trick: Intuitively, if RC predicts a that is close to the target , we can make the cross-entropy in Eq. 7 (and thus the bitrate) smaller by making the predicted logistic “more certain” by choosing a smaller . This shifts probability mass towards . However, there is a breaking point, where we make it “too certain” (i.e., the probability mass concentrates too tightly around ) and the cross-entropy increases again.
While RC is already trained to learn a good , the prediction is only based on . We can improve the final bitrate during encoding, when we additionally have access to the target , by rescaling the predicted with a factor , chosen for every mixture and every channel . This yields a more optimal Obviously, also needs to be known for decoding, and we thus have to transmit it via the bitstream. However, since we only learn a for every channel and every mixture (and not for every spatial location), this causes a completely negligible overhead of floats bytes.
We find for a given image by minimizing the likelihood in Eq. 7 on that image, i.e., we optimize
where is equal to predicted from RC but using . To optimize Eq. 8
, we use stochastic gradient descent with a very high learning rate ofand momentum , which converges in 10-20 iterations, depending on the image.
We note that this is also computationally cheap. Firstly, we only need to do the forward pass through RC once, to get , and then in every step of the -optimization, we only need to evaluate and subsequently Eq. 8. Secondly, the optimization is only over 15 parameters. Finally, since for practical -dimensional images, , we can do the sum in Eq. 8 over a spatially subsampled version of .
Like L3C , we train on images from the Open Images data set . These images are made available as JPEGs, which is not ideal for the lossless compression task we are considering, but we are not aware of a similarly large scale lossless training data set. To prevent overfitting on JPEG artifacts, we downscale each training image using a factor randomly selected from by means of the Lanczos filter provided by the Pillow library . For a fair comparison, the L3C baseline results were also obtained by training on the exact same data set.
We evaluate our model on four data sets: Open Images is a subset of 500 images from Open Images validation set, preprocessed like the training data. CLIC.mobile and CLIC.pro are two new data sets commonly used in recent image compression papers, released as part of the “Workshop and Challenge on Learned Image Compression” (CLIC) . CLIC.mobile contains 61 images taken using cell phones, while CLIC.pro contains 41 images from DSLRs, retouched by professionals. Finally, we evaluate on the 100 images from DIV2K , a super-resolution data set with high-quality images. We show examples from these data sets in Section A.3.
For a small fraction of exceptionally high-resolution images (note that the considered testing sets contain images of widely varying resolution), we follow L3C in extracting 4 non-overlapping crops from the image such that combining yields . We then compress the crops individually. However, we evaluate the non-learned baselines on the full images to avoid a bias in favor of our method.
We train for epochs on batches of 16 random
crops extracted from the training set, using the RMSProp optimizer. We start with an initial learning rate (LR) of , which we decay every iterations by a factor of . Since our Q-Classifier is trained on the output of a trained RC network, it is not available while training the RC network. Thus, we compress the training images with a random selected from , obtaining a pair for every image.
Given a trained RC network, we randomly select 10% of the training set, and compress each selected image once for each , obtaining a for each . We then evaluate RC for each pair to find the optimal that gives the minimum bitrate for that image. The resulting list of pairs
forms the training set for the QC. For training, we use a standard cross-entropy loss between the softmax-normalized logits and the one-hot encoded ground truth. We train for 11 epochs on batches of 32 random crops, using the Adam optimizer . We set the initial LR to the Adam-default , and decay after 5 and 10 epochs by a factor of .
As noted in Section 5.2, we select a random during training, since QC is only available after training. We explored fixing to one value (trying ) and found that this hurts generalization performance. This may be explained by the fact that RC sees more varied residual statistics during training if we have random ’s.
Using crops of to train a model evaluated on full-resolution images may seem too constraining. To explore the effect of crop size, we trained different models, each seeing the same number of pixels in every iteration, but distributed differently in terms of batch size vs. crop size. We trained each model for iterations, and then evaluated on the Open Images validation set (using a fixed for training and testing). The results are shown in the following table and indicate that smaller crops and bigger batch-sizes are beneficial.
|Batch Size||Crop Size||BPSP on Open Images|
We found that the GDN layers are crucial for good performance. We also explored instance normalization, and conditional instance normalization layers, in the latter case conditioning on the bitrate of BPG, in the hope that this would allow the network to distinguish different operation modes. However, we found that instance normalization is more sensitive to the resolution used for training, which led worse overall bitrates.
We follow previous work in evaluating bits per subpixel (Each RGB pixel has 3 subpixels), bpsp for short, sometimes called bits per dimension. In Table 1, we show the performance of our approach on the described test sets. On Open Images, the domain where we train, we are outperforming all methods, including FLIF. Note that while L3C was trained on the same data set, it does not outperform FLIF. On the other data sets, we consistently outperform both L3C and the non-learned approaches PNG, WebP, and JPEG2000.
These results indicate that our simple approach of using a powerful lossy compressor to compress the high-level image content and leverage a complementary learned probabilistic model to model the low level variations for lossless residual compression is highly effective. Even though we only train on Open Images, our method can generalize to various domains of natural images: mobile phone pictures (CLIC.mobile), images retouched by professional photographers (CLIC.pro), as well as high-quality images with diverse complex structures (DIV2K).
In Fig. 4 we show the bpsp of each of the 500 images of Open Images, when compressed using our method, FLIF, and PNG. For our approach, we also show the bits used to store for each image, measured in bpsp on top (“ only”), and as a percentage on the bottom. The percentage averages at 42%, going up towards the high-bpsp end of the figure. This plot shows the wide range of bpsp covered by a random set of natural images, and motivates our Q-Classifier. We can also see that while our method tends to outperform FLIF on average, FLIF is better for some high-bpsp images, where the bpsp of both FLIF and our method approach that of PNG.
|Input/Output||Lossy reconstruction||Residual||Two samples from our predicted|
We compare the decoding speed of RC to that of L3C for images, using an NVidia Titan XP. For our components: BPG: 163ms; RC: 166ms; arithmetic coding: 89.1ms; i.e., in a total 418ms, compared to L3C’s 374ms.
QC and -optimization are only needed for encoding. We discussed above that both components are computationally cheap. In terms of actual runtime: QC: 6.48ms; -optimization: 35.2ms.
In Table 2 we show the benefits of using the Q-Classifier as well as the -optimization. We show the resulting bpsp for the Open Images validation set (top) and for DIV2K (bottom), as well as the percentage of predicted that are away from the optimal (denoted “ to ”), against a baseline of using a fixed (the mean over QC’s training set, see Section 5.2). The last column shows the required number of forward passes through RC.
We first note that even though the QC was only trained on Open Images (see Sec 5.2), we get similar behavior on Open Images and DIV2K. Moreover, we see that using QC is clearly beneficial over using a fixed for all images, and only incurs a small increase in bpsp compared to using the optimal ( for Open Images, for DIV2K). This can be explained by the fact that QC manages to predict within of for of the images in Open Images and of the DIV2K images.
Furthermore, the small increase in bpsp is traded for a reduction from requiring forward passes to compute to a single one. In that sense, using the QC is similar to the “fast” modes common in image compression algorithms, where speed is traded against bitrate.
Table 2 shows that using -Optimization on top of QC reduces the bitrate on both testing sets.
While the gains of both components are small, their computational complexity is also very low (see Section 6.2). As such, we found it quite impressive to get the reported gains. We believe the direction of tuning a handful of parameters post training on an instance basis is a very promising direction for image compression. One fruitful direction could be using dedicated architectures and including a tuning step end-to-end as in meta learning.
|Data set||Setup||bpsp||to||# forward|
|Our QC +||2.790||1|
|Our QC +||3.079||1|
While the bpsp results from the previous section validate the compression performance of our model, it is interesting to investigate the distribution predicted by RC. Note that we predict a mixture distribution per pixel, which is hard to visualize directly. Instead, we sample from the predicted distribution. We expect the samples to be visually similar to the ground-truth residual .
The sampling results are shown in Fig. 5, where we visualize two images from CLIC.pro with their lossy reconstructions, as obtained by BPG. We also show the ground-truth residuals . Then, we show two samples obtained from the probability distribution predicted by our RC network. For the top image, is in , for the bottom it is in (cf. Fig. 2), and we re-normalized to the RGB range for visualization, but to reduce eye strain we replaced the most frequent value (, i.e., gray), with white.
We can clearly see that our approach i) learned to model the noise patterns discarded by BPG inherent with these images, ii) learned to correctly predict a zero residual where BPG manages to perfectly reconstruct, and iii) learned to predict structures similar to the ones in the ground-truth.
In this paper, we showed how to leverage BPG to achieve state-of-the-art results in full-resolution learned lossless image compression. Our approach outperforms L3C, PNG, WebP, and JPEG2000 consistently, and also outperforms the hand-crafted state-of-the-art FLIF on images from the Open Images data set. Future work should investigate input-dependent optimizations, which are also used by FLIF and which we started to explore here by optimizing the scale of the probabilistic model for the residual (-optimization). Similar approaches could also be applied to latent probability models of lossy image and video compression methods.
Neural Networks for Machine Learning Lecture 6a Overview of mini-batch gradient descent. Cited by: §5.2.
Parallel Multiscale Autoregressive Density Estimation. In ICML, Cited by: §1, §2.
Lossy Image Compression with Compressive Autoencoders. In ICLR, Cited by: §1.
Full Resolution Image Compression with Recurrent Neural Networks. In CVPR, Cited by: §1.
We show the architecture for the Q-Classifier in Table A1. Residual
denotes a sequence of convolution, ReLU, convolution, with a skip connection adding the input to the output (as in, but without BatchNorm).
|Conv + ReLU||3||64||2|
|Conv + ReLU||64||128||2|
We provide additional visual examples here:
Specifically, we show one image from each of our testing sets, alongside with the residual and a sample from , which is expected to be visually similar to . Please refer to Section 6.4 for details on sampling and the visualization.