Log In Sign Up

Learning Better Lossless Compression Using Lossy Compression

by   Fabian Mentzer, et al.
ETH Zurich

We leverage the powerful lossy image compression algorithm BPG to build a lossless image compression system. Specifically, the original image is first decomposed into the lossy reconstruction obtained after compressing it with BPG and the corresponding residual. We then model the distribution of the residual with a convolutional neural network-based probabilistic model that is conditioned on the BPG reconstruction, and combine it with entropy coding to losslessly encode the residual. Finally, the image is stored using the concatenation of the bitstreams produced by BPG and the learned residual coder. The resulting compression system achieves state-of-the-art performance in learned lossless full-resolution image compression, outperforming previous learned approaches as well as PNG, WebP, and JPEG2000.


Learned Lossless JPEG Transcoding via Joint Lossy and Residual Compression

As a commonly-used image compression format, JPEG has been broadly appli...

Learning to Inpaint for Image Compression

We study the design of deep architectures for lossy image compression. W...

Deep Lossy Plus Residual Coding for Lossless and Near-lossless Image Compression

Lossless and near-lossless image compression is of paramount importance ...

Binary Probability Model for Learning Based Image Compression

In this paper, we propose to enhance learned image compression systems w...

Demosaicing and Superresolution for Color Filter Array via Residual Image Reconstruction and Sparse Representation

A framework of demosaicing and superresolution for color filter array (C...

Practical Learned Lossless JPEG Recompression with Multi-Level Cross-Channel Entropy Model in the DCT Domain

JPEG is a popular image compression method widely used by individuals, d...

Code Repositories


PyTorch code for the CVPR'20 paper "Learning Better Lossless Compression Using Lossy Compression"

view repo

1 Introduction

The need to efficiently store the ever growing amounts of data generated continuously on mobile devices has spurred a lot of research on compression algorithms. Algorithms like JPEG [50] for images and H.264 [52] for videos are used by billions of people daily.

After the breakthrough results achieved with deep neural networks in image classification [26]

, and the subsequent rise of deep-learning based methods,

learned lossy image compression has emerged as an active area of research (e.g. [5, 44, 45, 36, 1, 3, 29, 27, 47]). In lossy compression, the goal is to achieve small bitrates given a certain allowed distortion in the reconstruction, i.e., the rate-distortion trade-off is optimized. In contrast, in lossless compression, no distortion is allowed, and we aim to reconstruct the input perfectly by transmitting as few bits as possible. To this end, a probabilistic model of the data can be used together with entropy coding techniques to encode and transmit data via a bitstream. The theoretical foundation for this idea is given in Shannon’s landmark paper [39], which proves a lower bound for the bitrate achievable by such a probabilistic model, and the overhead incurred by using an imprecise model of the data distribution. One beautiful result is that maximizing the likelihood of a parametric probabilistic model is equivalent to minimizing the bitrate obtained when using that model for lossless compression with an entropy coder (see, e.g., [28]). Learning parametric probabilistic models by likelihood maximization has been studied to a great extent in the generative modeling literature (e.g. [49, 48, 38, 33, 24]). Recent works have linked these results to learned lossless compression [28, 17, 46, 23].

Figure 1: Overview of the proposed learned lossless compression approach. To encode an input image

, we feed it into the Q-Classifier (QC) CNN to obtain an appropriate quantization parameter

, which is used to compress with BPG. The resulting lossy reconstruction

is fed into the Residual Compressor (RC) CNN, which predicts the probability distribution of the residual,

, conditionally on . An arithmetic coder (AC) encodes the residual to a bitstream, given . In gray we visualize how to reconstruct from the bistream. Learned components are shown in violet.

Even though recent learned lossy image compression methods achieve state-of-the-art results on various data sets, the results obtained by the non-learned H.265-based BPG [42, 6] are still highly competitive, without requiring sophisticated hardware accelerators such as GPUs to run. While BPG was outperformed by learning-based approaches across the bitrate spectrum in terms of PSNR [29] and visual quality [3], it still excels particularly at high-PSNR lossy reconstructions.

In this paper, we propose a learned lossless compression system by leveraging the power of the lossy BPG, as illustrated in Fig. 1. Specifically, we decompose the input image into the lossy reconstruction produced by BPG and the corresponding residual . We then learn a probabilistic model of the residual, conditionally on the lossy reconstruction . This probabilistic model is fully convolutional and can be evaluated using a single forward pass, both for encoding and decoding. We combine it with an arithmetic coder to losslessly compress the residual and store or transmit the image as the concatenation of the bitstrings produced by BPG and the residual compressor. Further, we use a computationally inexpensive technique from the generative modeling literature, tuning the “certainty” (temperature) of , as well as an auxiliary shallow classifier to predict the quantization parameter of BPG in order to optimize our compressor on a per-image basis. These components together lead to a state-of-the-art full-resolution learned lossless compression system. All of our code and data sets are available on github.111

In contrast to recent work in lossless compression, we do not need to compute and store any side information (as opposed to L3C [28]), and our CNN is lightweight enough to train and evaluate on high-resolution natural images (as opposed to [17, 23], which have not been scaled to full-resolution images to our knowledge).

In summary, our main contributions are:

  • [leftmargin=*,parsep=0pt,itemsep=0pt,topsep=1pt]

  • We leverage the power of the classical state-of-the-art lossy compression algorithm BPG in a novel way to build a conceptually simple learned lossless image compression system.

  • Our system is optimized on a per-image basis with a light-weight post-training step, where we obtain a lower-bitrate probability distribution by adjusting the confidence of the predictions of our probabilistic model.

  • Our system outperform the state-of-the-art in learned lossless full-resolution image compression, L3C [28], as well as the classical engineered algorithms WebP, JPEG200, PNG. Further, in contrast to L3C, we are also outperforming FLIF on Open Images, the domain where our approach (as well as L3C) is trained.

2 Related Work

Learned Lossless Compression

Arguably most closely related to this paper, Mentzer et al[28] build a computationally cheap hierarchical generative model (termed L3C) to enable practical compression on full-resolution images.

Townsend et al[46] and Kingma et al[23] leverage the “bits-back scheme” [16] for lossless compression of an image stream, where the overall bitrate of the stream is reduced by leveraging previously transmitted information. Motivated by recent progress in generative modeling using (continuous) flow-based models (e.g. [34, 22]), Hoogeboom et al[17] propose Integer Discrete Flows (IDFs), defining an invertible transformation for discrete data. In contrast to L3C, the latter works focus on smaller data sets such as MNIST, CIFAR-10, ImageNet32, and ImageNet64, where they achieve state-of-the-art results.

Likelihood-Based Generative Modeling

As mentioned in Section 1, virtually every generative model can be used for lossless compression, when used with an entropy coding algorithm. Therefore, while the following generative approaches do not take a compression perspective, they are still related. The state-of-the-art PixelCNN [49]-based models rely on auto-regression in RGB space to efficiently model a conditional distribution. The original PixelCNN [49] and PixelRNN [48] model the probability distribution of a pixel given all previous pixels (in raster-scan order). To use these models for lossless compression, forward passes are required, where and are the image height and width, respectively. Various speed optimizations and a probability model amendable to faster training were proposed in [38]. Different other parallelization techniques were developed, including those from [33], modeling the image distribution conditionally on subsampled versions of the image, as well as those from [24], conditioning on a RGB pyramid and grayscale images. Similar techniques were also used by [8, 30].

Engineered Lossless Compression Algorithms

The wide-spread PNG [32] applies simple autoregressive filters to remove redundancies from the RGB representation (e.g. replacing pixels with the difference to their left neighbor), and then uses the DEFLATE [10] algorithm for compression. In contrast, WebP [51] uses larger windows to transform the image (enabling patch-wise conditional compression), and relies on a custom entropy coder for compression. Mainly in use for lossy compression, JPEG2000 [40] also has a lossless mode, where an invertible mapping from RGB to compression space is used. At the heart of FLIF [41] is an entropy coding method called “meta-adaptive near-zero integer arithmetic coding” (MANIAC), which is based on the CABAC method used in, e.g., H.264 [52]. In CABAC, the context model used to compress a symbol is selected from a finite set based on local context [35]

. The “meta-adaptive” part in MANIAC refers to the context model which is a decision tree learned

per image.

Artifact Removal

Artifact removal methods in the context of lossy compression are related to our approach in that they aim to make predictions about the information lost during the lossy compression process. In this context, the goal is to produce sharper and/or more visually pleasing images given a lossy reconstruction from, e.g., JPEG. Dong et al[11]

proposed the first CNN-based approach using a network inspired by super-resolution networks.

[43] extends this using a residual structure, and [7] relies on hierarchical skip connections and a multi-scale loss. Generative models in the context of artifact removal are explored by [12], which proposes to use GANs [13] to obtain more visually pleasing results.

3 Background

3.1 Lossless Compression

We give a very brief overview of lossless compression basics here and refer to the information theory literature for details [39, 9]. In lossless compression, we consider a stream of symbols , where each is an element from the same finite set . The stream is obtained by drawing each symbol independently from the same distribution , i.e., the are i.i.d. according to . We are interested in encoding the symbol stream into a bitstream, such that we can recover the exact symbols by decoding. In this setup, the entropy of is equal to the expected number of bits needed to encode each :

In general, however, the exact is unknown, and we instead consider the setup where we have an approximate model . Then, the expected bitrate will be equal to the cross-entropy between and , given by:


Intuitively, the higher the discrepancy between the model used for coding is from the real , the more bits we need to encode data that is actually distributed according to .

Entropy Coding

Given a symbol stream as above and a probability distribution (not necessarily ), we can encode the stream using entropy coding. Intuitively, we would like to build a table that maps every element in to a bit sequence, such that gets a short sequence if is high. The optimum is to output bits for symbol , which is what entropy coding algorithms achieve. Examples include Huffman coding [18] and arithmetic coding [53].

In general, we can use a different distribution for every symbol in the stream, as long as the are also available for decoding. Adaptive entropy coding algorithms work by allowing such varying distributions as a function of previously encoded symbols. In this paper, we use adaptive arithmetic coding [53].

3.2 Lossless Image Compression with CNNs

As explained in the previous section, all we need for lossless compression is a model , since we can use entropy coding to encode and decode any input losslessly given . In particular, we can use a CNN to parametrize . To this end, one general approach is to introduce (structured) side information available both at encoding and decoding time, and model the probability distribution of natural images conditionally on , using the CNN to parametrize .222We write to denote the entire probability mass function and to denote evaluated at . Assuming that both the encoder and decoder have access to and , we can losslessly encode as follows: We first use the CNN to produce . Then, we employ an entropy encoder (described in the previous section) with to encode to a bitstream. To decode, we once again feed to the CNN, obtaining , and decode from the bitstream using the entropy decoder.

One key difference among the approaches in the literature is the factorization of . In the original PixelCNN paper [48] the image is modeled as a sequence of pixels, and corresponds to all previous pixels. Encoding as well as decoding are done autoregressively. In IDF [17], is mapped to a using an invertible function, and is then encoded using a fixed prior , i.e., here is a deterministic function of . In approaches based on the bits-back paradigm [46, 23], while encoding, is obtained by decoding from additional available information (e.g. previously encoded images). In L3C [28],

corresponds to features extracted with a hierarchical model that are also saved to the bitstream using hierarchically predicted distributions.

3.3 Bpg

Figure 2: Histogram of the marginal pixel distribution of residual values obtained using BPG and predicted from QC, on Open Images.

BPG is a lossy image compression method based on the HEVC video coding standard [42], essentially applying HEVC on a single image. To motivate our usage of BPG, we show the histogram of the marginal pixel distribution of the residuals obtained by BPG on Open Images (one of our testing sets, see Section 5.1) in Fig. 2. Note that while the possible range of a residual is , we observe that for most images, nearly every point in the residual is in the restricted set , which is indicative of the high-PSNR nature of BPG. Additionally, Fig. A1 (in the suppl.) presents a comparison of BPG to the state-of-the-art learned image compression methods, showing that BPG is still very competitive in terms of PSNR.

BPG follows JPEG in having a chroma format parameter to enable color space subsampling, which we disable by setting it to . The only remaining parameter to set is the quantization parameter , where . Smaller results in less quantization and thus better quality (i.e., different to the quality factor of JPEG, where larger means better reconstruction quality). We learn a classifier to predict , described in Section 4.4.

Figure 3: The architecture of the residual compressor (RC). On the left, we show a zoom-in of the Residual Block and the Tail networks. Given , the lossy reconstruction of the image , the network predicts the probability distribution of the residual, . This distribution is a mixture of logistics parametrized via .

4 Proposed Method

We give an overview of our method in Fig. 1. To encode an image , we first obtain the quantization parameter from the Q-Classifier (QC) network (Section 4.4). Then, we compress with BPG, to obtain the lossy reconstruction , which we save to a bitstream. Given , the Residual Compressor (RC) network (Section 4.1) predicts the probability mass function of the residual , i.e.,

We model as a discrete mixture of logistic distributions (Section 4.2). Given and , we compress to the bitstream using adaptive arithmetic coding algorithm (see Section 3.1). Thus, the bitstream consists of the concatenation of the codes corresponding to and . To decode from , we first obtain using the BPG decoder, then we obtain once again , and subsequently decode from the bitstream using . Finally, we can reconstruct . In the formalism of Section 3.2, we have .

Note that no matter how bad RC is at predicting the real distribution of , we can always do lossless

compression. Even if RC were to predict, e.g., a uniform distribution—in that case, we would just need many bits to store


4.1 Residual Compressor

We use a CNN inspired by ResNet [14] and U-Net [37], shown in detail in Fig. 3. We first extract an initial feature map with

channels, which we then downscale using a stride-2 convolution, and feed through 16 residual blocks. Instead of BatchNorm 

[19] layers as in ResNet, our residual blocks contain GDN layers proposed by [4]. Subsequently, we upscale back to the resolution of the input image using a transposed convolution. The resulting features are concatenated with , and convolved to contract the channels back to , like in U-Net. Finally, the network splits into four tails, predicting the different parameters of the mixture model, , described next.

4.2 Logistic Mixture Model

We use a discrete mixture of logistics to model the probability mass function of the residual, , similar to [28, 38]. We closely follow the formulation of [28] here: Let denote the RGB channel and the spatial location. We define


We use a (weak) autoregression over the three RGB channels to define the joint distribution over channels via logistic mixtures



where we removed the indices to simplify the notation. For the mixture we use a mixture of logistic distributions . Our distributions are defined by the outputs of the RC network, which yields mixture weights , means

, variances

, as well as mixture coefficients . The autoregression over RGB channels is only used to update the means using a linear combination of and the target of previous channels, scaled by the coefficients . We thereby obtain :


With these parameters, we can define


where denotes the channels with index smaller than (see Eq. 3), used to obtain as shown above, and is the logistic distribution:

We evaluate at discrete , via its CDF, as in [38, 28], evaluating


4.3 Loss

As motivated in Section 3.1, we are interested in minimizing the cross-entropy between the real distribution of the residual and our model : the smaller the cross-entropy, the closer is to , and the fewer bits an entropy coder will use to encode . We consider the setting where we have training images . For every image, we compute the lossy reconstruction as well as the corresponding residual . While the true distribution is unknown, we can consider the empirical distribution obtained from the samples and minimize:


This loss decomposes over samples, allowing us to minimize it over mini-batches. Note that minimizing Eq. 7 is the same as maximizing the likelihood of , which is the perspective taken in the likelihood-based generative modeling literature.

4.4 Q-Classifier

A random set of natural images is expected to contain images of varying “complexity”, where complex can mean a lot of high frequency structure and/or noise. While virtually all lossy compression methods have a parameter like BPG’s , to navigate the trade-off between bitrate and quality, it is important to note that compressing a random set of natural images with the same fixed will usually lead to the bitrates of these images being spread around some -dependent mean. Thus, in our approach, it is suboptimal to fix for all images.

Indeed, in our pipeline we have a trade-off between the bits allocated to BPG and the bits allocated to encoding the residual. This trade-off can be controlled with : For example, if an image contains components that are easier for the RC network to model, it is beneficial to use a higher , such that BPG does not waste bits encoding these components. We observe that for a fixed image, and a trained RC, there is a single optimal .

To efficiently obtain a good , we train a simple classifier network, the Q-Classifier (QC), and then use to compress with BPG. For the architecture, we use a light-weight ResNet-inspired network with 8 residual blocks for QC, and train it to predict a class in , given an image ( was selected using the Open Images validation set). In contrast to ResNet, we employ no normalization layers (to ensure that the prediction is independent of the input size). Further, the final features are obtained by average pooling each of the final channels of the -dimensional feature map. The resulting

-dimensional vector is fed to a fully connected layer, to obtain the logits for the

classes, which are then normalized with a softmax. Details are shown in Section A.1 in the supplementary material.

While the input to QC is the full-resolution image, the network is shallow and downsamples multiple times, making this a computationally lightweight component.

4.5 -Optimization

Inspired by the temperature scaling employed in the generative modeling literature (e.g. [21]) , we further optimize the predicted distribution with a simple trick: Intuitively, if RC predicts a that is close to the target , we can make the cross-entropy in Eq. 7 (and thus the bitrate) smaller by making the predicted logistic “more certain” by choosing a smaller . This shifts probability mass towards . However, there is a breaking point, where we make it “too certain” (i.e., the probability mass concentrates too tightly around ) and the cross-entropy increases again.

While RC is already trained to learn a good , the prediction is only based on . We can improve the final bitrate during encoding, when we additionally have access to the target , by rescaling the predicted with a factor , chosen for every mixture and every channel . This yields a more optimal Obviously, also needs to be known for decoding, and we thus have to transmit it via the bitstream. However, since we only learn a for every channel and every mixture (and not for every spatial location), this causes a completely negligible overhead of floats bytes.

We find for a given image by minimizing the likelihood in Eq. 7 on that image, i.e., we optimize


where is equal to predicted from RC but using . To optimize Eq. 8

, we use stochastic gradient descent with a very high learning rate of

and momentum , which converges in 10-20 iterations, depending on the image.

We note that this is also computationally cheap. Firstly, we only need to do the forward pass through RC once, to get , and then in every step of the -optimization, we only need to evaluate and subsequently Eq. 8. Secondly, the optimization is only over 15 parameters. Finally, since for practical -dimensional images, , we can do the sum in Eq. 8 over a spatially subsampled version of .

[bpsp] Open Images DIV2K
RC (Ours) 2.790 2.538 2.933 3.079
L3C 2.991 2.639 2.944 3.094
PNG 4.005 3.896 3.997 4.235
JPEG2000 3.055 2.721 3.000 3.127
WebP 3.047 2.774 3.006 3.176
FLIF 2.867 2.492 2.784 2.911
Table 1: Compression performance of the proposed method (RC) compared to the learned L3C [28], as well as the classical engineered approaches PNG, JPEG2000, WebP, and FLIF. We show the difference in percentage to our approach, using green to indicate that we achieve a better bpsp and red otherwise.

5 Experiments

5.1 Data sets


Like L3C [28], we train on images from the Open Images data set [25]. These images are made available as JPEGs, which is not ideal for the lossless compression task we are considering, but we are not aware of a similarly large scale lossless training data set. To prevent overfitting on JPEG artifacts, we downscale each training image using a factor randomly selected from by means of the Lanczos filter provided by the Pillow library [31]. For a fair comparison, the L3C baseline results were also obtained by training on the exact same data set.


We evaluate our model on four data sets: Open Images is a subset of 500 images from Open Images validation set, preprocessed like the training data. and are two new data sets commonly used in recent image compression papers, released as part of the “Workshop and Challenge on Learned Image Compression” (CLIC) [54]. contains 61 images taken using cell phones, while contains 41 images from DSLRs, retouched by professionals. Finally, we evaluate on the 100 images from DIV2K [2], a super-resolution data set with high-quality images. We show examples from these data sets in Section A.3.

For a small fraction of exceptionally high-resolution images (note that the considered testing sets contain images of widely varying resolution), we follow L3C in extracting 4 non-overlapping crops from the image such that combining yields . We then compress the crops individually. However, we evaluate the non-learned baselines on the full images to avoid a bias in favor of our method.

5.2 Training Procedures

Residual Compressor

We train for epochs on batches of 16 random

crops extracted from the training set, using the RMSProp optimizer 

[15]. We start with an initial learning rate (LR) of , which we decay every iterations by a factor of . Since our Q-Classifier is trained on the output of a trained RC network, it is not available while training the RC network. Thus, we compress the training images with a random selected from , obtaining a pair for every image.


Given a trained RC network, we randomly select 10% of the training set, and compress each selected image once for each , obtaining a for each . We then evaluate RC for each pair to find the optimal that gives the minimum bitrate for that image. The resulting list of pairs

forms the training set for the QC. For training, we use a standard cross-entropy loss between the softmax-normalized logits and the one-hot encoded ground truth

. We train for 11 epochs on batches of 32 random crops, using the Adam optimizer [20]. We set the initial LR to the Adam-default , and decay after 5 and 10 epochs by a factor of .

5.3 Architecture and Training Ablations

Training on Fixed

As noted in Section 5.2, we select a random during training, since QC is only available after training. We explored fixing to one value (trying ) and found that this hurts generalization performance. This may be explained by the fact that RC sees more varied residual statistics during training if we have random ’s.

Effect of the Crop Size

Using crops of to train a model evaluated on full-resolution images may seem too constraining. To explore the effect of crop size, we trained different models, each seeing the same number of pixels in every iteration, but distributed differently in terms of batch size vs. crop size. We trained each model for iterations, and then evaluated on the Open Images validation set (using a fixed for training and testing). The results are shown in the following table and indicate that smaller crops and bigger batch-sizes are beneficial.

Batch Size Crop Size BPSP on Open Images
16 2.854
4 2.864
1 2.877


We found that the GDN layers are crucial for good performance. We also explored instance normalization, and conditional instance normalization layers, in the latter case conditioning on the bitrate of BPG, in the hope that this would allow the network to distinguish different operation modes. However, we found that instance normalization is more sensitive to the resolution used for training, which led worse overall bitrates.

6 Results and Discussion

6.1 Compression performance in bpsp

We follow previous work in evaluating bits per subpixel (Each RGB pixel has 3 subpixels), bpsp for short, sometimes called bits per dimension. In Table 1, we show the performance of our approach on the described test sets. On Open Images, the domain where we train, we are outperforming all methods, including FLIF. Note that while L3C was trained on the same data set, it does not outperform FLIF. On the other data sets, we consistently outperform both L3C and the non-learned approaches PNG, WebP, and JPEG2000.

These results indicate that our simple approach of using a powerful lossy compressor to compress the high-level image content and leverage a complementary learned probabilistic model to model the low level variations for lossless residual compression is highly effective. Even though we only train on Open Images, our method can generalize to various domains of natural images: mobile phone pictures (, images retouched by professional photographers (, as well as high-quality images with diverse complex structures (DIV2K).

Figure 4: Top: Distribution of bpsp, on the 500 images from Open Images validation set. The images are sorted by the bpsp achieved using our approach. We show PNG and FLIF, as well as the bpsp needed to store the lossy reconstruction only (“ only”). Bottom: Fraction of total bits used by our approach that are used to store . Images follow the same order as on the top panel.

In Fig. 4 we show the bpsp of each of the 500 images of Open Images, when compressed using our method, FLIF, and PNG. For our approach, we also show the bits used to store for each image, measured in bpsp on top (“ only”), and as a percentage on the bottom. The percentage averages at 42%, going up towards the high-bpsp end of the figure. This plot shows the wide range of bpsp covered by a random set of natural images, and motivates our Q-Classifier. We can also see that while our method tends to outperform FLIF on average, FLIF is better for some high-bpsp images, where the bpsp of both FLIF and our method approach that of PNG.

Input/Output Lossy reconstruction Residual Two samples from our predicted
Figure 5: Visualizing the learned distribution by sampling from it. We compare the samples to the ground-truth target residual . We also show the image that we losslessly compress as well as the lossy reconstruction obtained from BPG. For easier visualizations, pixels in the residual images equal to 0 are set to white, instead of gray. Best viewed on screen due to the high-frequency noise.

6.2 Runtime

We compare the decoding speed of RC to that of L3C for images, using an NVidia Titan XP. For our components: BPG: 163ms; RC: 166ms; arithmetic coding: 89.1ms; i.e., in a total 418ms, compared to L3C’s 374ms.

QC and -optimization are only needed for encoding. We discussed above that both components are computationally cheap. In terms of actual runtime: QC: 6.48ms; -optimization: 35.2ms.

6.3 Q-Classifier and -Optimization

In Table 2 we show the benefits of using the Q-Classifier as well as the -optimization. We show the resulting bpsp for the Open Images validation set (top) and for DIV2K (bottom), as well as the percentage of predicted that are away from the optimal (denoted “ to ”), against a baseline of using a fixed (the mean over QC’s training set, see Section 5.2). The last column shows the required number of forward passes through RC.


We first note that even though the QC was only trained on Open Images (see Sec 5.2), we get similar behavior on Open Images and DIV2K. Moreover, we see that using QC is clearly beneficial over using a fixed for all images, and only incurs a small increase in bpsp compared to using the optimal ( for Open Images, for DIV2K). This can be explained by the fact that QC manages to predict within of for of the images in Open Images and of the DIV2K images.

Furthermore, the small increase in bpsp is traded for a reduction from requiring forward passes to compute to a single one. In that sense, using the QC is similar to the “fast” modes common in image compression algorithms, where speed is traded against bitrate.


Table 2 shows that using -Optimization on top of QC reduces the bitrate on both testing sets.


While the gains of both components are small, their computational complexity is also very low (see Section 6.2). As such, we found it quite impressive to get the reported gains. We believe the direction of tuning a handful of parameters post training on an instance basis is a very promising direction for image compression. One fruitful direction could be using dedicated architectures and including a tuning step end-to-end as in meta learning.

6.4 Visualizing the learned

Data set Setup bpsp to # forward
Open Optimal 2.789 100%
Images Fixed 2.801 82.6% 1
Our QC 2.794 94.8% 1
Our QC + 2.790 1
DIV2K Optimal 3.080 100%
Fixed 3.096 73.0% 1
Our QC 3.088 90.2% 1
Our QC + 3.079 1
Table 2: On Open Images and DIV2K, we compare using the optimal for encoding images, vs. a fixed and vs. using predicted by the Q-Classifier. For each data set, the last row shows the additional gains obtained from applying the -optimization. The forth column shows the percentage of predicted that are away from the optimal and the last column corresponds to the number of forward passes required for -optimization.

While the bpsp results from the previous section validate the compression performance of our model, it is interesting to investigate the distribution predicted by RC. Note that we predict a mixture distribution per pixel, which is hard to visualize directly. Instead, we sample from the predicted distribution. We expect the samples to be visually similar to the ground-truth residual .

The sampling results are shown in Fig. 5, where we visualize two images from with their lossy reconstructions, as obtained by BPG. We also show the ground-truth residuals . Then, we show two samples obtained from the probability distribution predicted by our RC network. For the top image, is in , for the bottom it is in (cf. Fig. 2), and we re-normalized to the RGB range for visualization, but to reduce eye strain we replaced the most frequent value (, i.e., gray), with white.

We can clearly see that our approach i) learned to model the noise patterns discarded by BPG inherent with these images, ii) learned to correctly predict a zero residual where BPG manages to perfectly reconstruct, and iii) learned to predict structures similar to the ones in the ground-truth.

7 Conclusion

In this paper, we showed how to leverage BPG to achieve state-of-the-art results in full-resolution learned lossless image compression. Our approach outperforms L3C, PNG, WebP, and JPEG2000 consistently, and also outperforms the hand-crafted state-of-the-art FLIF on images from the Open Images data set. Future work should investigate input-dependent optimizations, which are also used by FLIF and which we started to explore here by optimizing the scale of the probabilistic model for the residual (-optimization). Similar approaches could also be applied to latent probability models of lossy image and video compression methods.


  • [1] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool (2017) Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations. In NIPS, Cited by: §1.
  • [2] E. Agustsson and R. Timofte (2017) NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study. In CVPR Workshops, Cited by: §5.1.
  • [3] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool (2019) Generative Adversarial Networks for Extreme Learned Image Compression. In ICCV, Cited by: §1, §1.
  • [4] J. Ballé, V. Laparra, and E. P. Simoncelli (2015) Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281. Cited by: §4.1.
  • [5] J. Ballé, V. Laparra, and E. P. Simoncelli (2016) End-to-end Optimized Image Compression. ICLR. Cited by: §1.
  • [6] F. Bellard BPG Image format. Note: Cited by: §1.
  • [7] L. Cavigelli, P. Hager, and L. Benini (2017) CAS-cnn: a deep convolutional neural network for image compression artifact suppression. In IJCNN, pp. 752–759. Cited by: §2.
  • [8] X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel (2018) PixelSNAIL: An Improved Autoregressive Generative Model. In ICML, Cited by: §2.
  • [9] T. M. Cover and J. A. Thomas (2012) Elements of Information Theory. John Wiley & Sons. Cited by: §3.1.
  • [10] P. Deutsch (1996) DEFLATE compressed data format specification version 1.3. Technical report Cited by: §2.
  • [11] C. Dong, Y. Deng, C. Change Loy, and X. Tang (2015) Compression artifacts reduction by a deep convolutional network. In ICCV, pp. 576–584. Cited by: §2.
  • [12] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo (2017) Deep generative adversarial compression artifact removal. In ICCV, pp. 4826–4835. Cited by: §2.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §A.1, §4.1.
  • [15] G. Hinton, N. Srivastava, and K. Swersky

    Neural Networks for Machine Learning Lecture 6a Overview of mini-batch gradient descent

    Cited by: §5.2.
  • [16] G. Hinton and D. Van Camp (1993) Keeping neural networks simple by minimizing the description length of the weights. In COLT, Cited by: §2.
  • [17] E. Hoogeboom, J. W. Peters, R. v. d. Berg, and M. Welling (2019) Integer discrete flows and lossless compression. In NIPS, Cited by: §1, §1, §2, §3.2.
  • [18] D. A. Huffman (1952) A method for the construction of minimum-redundancy codes. Proc. IRE 40 (9), pp. 1098–1101. Cited by: §3.1.
  • [19] S. Ioffe and C. Szegedy (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, Cited by: §4.1.
  • [20] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §5.2.
  • [21] D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In NeurIPS, Cited by: §4.5.
  • [22] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In NIPS, Cited by: §2.
  • [23] F. H. Kingma, P. Abbeel, and J. Ho (2019) Bit-swap: recursive bits-back coding for lossless compression with hierarchical latent variables. In ICML, Cited by: §1, §1, §2, §3.2.
  • [24] A. Kolesnikov and C. H. Lampert (2017) PixelCNN Models with Auxiliary Variables for Natural Image Modeling. In ICML, Cited by: §1, §2.
  • [25] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy (2017) OpenImages: a public dataset for large-scale multi-label and multi-class image classification.. Dataset available from Cited by: §5.1.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
  • [27] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool (2018) Conditional Probability Models for Deep Image Compression. In CVPR, Cited by: §1.
  • [28] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool (2019) Practical full resolution learned lossless image compression. In CVPR, Cited by: 3rd item, §1, §1, §2, §3.2, §4.2, §4.2, Table 1, §5.1.
  • [29] D. Minnen, J. Ballé, and G. D. Toderici (2018) Joint Autoregressive and Hierarchical Priors for Learned Image Compression. In NeurIPS, Cited by: Figure A1, §A.2, §1, §1.
  • [30] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, and A. Ku (2018) Image Transformer. ICML. Cited by: §2.
  • [31] Pillow Library for Python. Note: Cited by: §5.1.
  • [32] Portable Network Graphics (PNG). Note: Cited by: §2.
  • [33] S. Reed, A. Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. Freitas (2017)

    Parallel Multiscale Autoregressive Density Estimation

    In ICML, Cited by: §1, §2.
  • [34] D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §2.
  • [35] I. E. Richardson (2004) H. 264 and mpeg-4 video compression: video coding for next-generation multimedia. John Wiley & Sons. Cited by: §2.
  • [36] O. Rippel and L. Bourdev (2017) Real-Time Adaptive Image Compression. In ICML, Cited by: §1.
  • [37] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §4.1.
  • [38] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma (2017) PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications. In ICLR, Cited by: §1, §2, §4.2, §4.2.
  • [39] C. E. Shannon (1948) A Mathematical Theory of Communication. Bell System Technical Journal 27 (3), pp. 379–423. External Links: Document, Link, Cited by: §1, §3.1.
  • [40] A. Skodras, C. Christopoulos, and T. Ebrahimi (2001) The JPEG 2000 still image compression standard. IEEE Signal Processing Magazine 18 (5), pp. 36–58. Cited by: §2.
  • [41] J. Sneyers and P. Wuille (2016) FLIF: Free lossless image format based on MANIAC compression. In ICIP, Vol. . External Links: Document, ISSN 2381-8549 Cited by: §2.
  • [42] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology 22 (12), pp. 1649–1668. Cited by: §1, §3.3.
  • [43] P. Svoboda, M. Hradis, D. Barina, and P. Zemcik (2016) Compression artifacts removal using convolutional neural networks. arXiv preprint arXiv:1605.00366. Cited by: §2.
  • [44] L. Theis, Shi,Wenzhe, A. Cunningham, and F. Huszar (2017)

    Lossy Image Compression with Compressive Autoencoders

    In ICLR, Cited by: §1.
  • [45] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell (2017)

    Full Resolution Image Compression with Recurrent Neural Networks

    In CVPR, Cited by: §1.
  • [46] J. Townsend, T. Bird, and D. Barber (2019) Practical lossless compression with latent variables using bits back coding. In ICLR, Cited by: §1, §2, §3.2.
  • [47] M. Tschannen, E. Agustsson, and M. Lucic (2018) Deep Generative Models for Distribution-Preserving Lossy Compression. In NeurIPS, Cited by: §1.
  • [48] A. van den Oord, N. Kalchbrenner, L. Espeholt, k. kavukcuoglu, O. Vinyals, and A. Graves (2016) Conditional Image Generation with PixelCNN Decoders. In NIPS, Cited by: §1, §2, §3.2.
  • [49] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016) Pixel Recurrent Neural Networks. In ICML, Cited by: §1, §2.
  • [50] G. K. Wallace (1992) The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38 (1), pp. xviii–xxxiv. Cited by: §1.
  • [51] WebP Image format. Note: Cited by: §2.
  • [52] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra (2003) Overview of the h. 264/avc video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 13 (7), pp. 560–576. Cited by: §1, §2.
  • [53] I. H. Witten, R. M. Neal, and J. G. Cleary (1987) Arithmetic coding for data compression. Communications of the ACM 30 (6), pp. 520–540. Cited by: §3.1, §3.1.
  • [54] Workshop and Challenge on Learned Image Compression. Note: Cited by: §5.1.

Appendix A Learning Better Lossless Compression Using Lossy Compression – Supplementary

a.1 Q-Classifier Architecture

We show the architecture for the Q-Classifier in Table A1. Residual

denotes a sequence of convolution, ReLU, convolution, with a skip connection adding the input to the output (as in 

[14], but without BatchNorm).

Layer Filter Stride
Conv + ReLU 3 64 2
Conv + ReLU 64 128 2
Residual 128 128
Conv 128 256 2
Residual 256 256
Channel-Avg. 256 256
Linear 256
Table A1: Q-Classifier architecture.

a.2 BPG Performance

Fig. A1 compares the performance of BPG on Kodak, in terms of PSNR, to the recent learned image compression approach from Minnen et al[29]. The plot is digitized from Figure 2 in [29].

Figure A1: Comparing BPG to Minnen et al[29]

a.3 Examples from the testing sets

We provide additional visual examples here:

Specifically, we show one image from each of our testing sets, alongside with the residual and a sample from , which is expected to be visually similar to . Please refer to Section 6.4 for details on sampling and the visualization.