Modeling Lost Information in Lossy Image Compression

by   Yaolong Wang, et al.
Peking University

Lossy image compression is one of the most commonly used operators for digital images. Most recently proposed deep-learning-based image compression methods leverage the auto-encoder structure, and reach a series of promising results in this field. The images are encoded into low dimensional latent features first, and entropy coded subsequently by exploiting the statistical redundancy. However, the information lost during encoding is unfortunately inevitable, which poses a significant challenge to the decoder to reconstruct the original images. In this work, we propose a novel invertible framework called Invertible Lossy Compression (ILC) to largely mitigate the information loss problem. Specifically, ILC introduces an invertible encoding module to replace the encoder-decoder structure to produce the low dimensional informative latent representation, meanwhile, transform the lost information into an auxiliary latent variable that won't be further coded or stored. The latent representation is quantized and encoded into bit-stream, and the latent variable is forced to follow a specified distribution, i.e. isotropic Gaussian distribution. In this way, recovering the original image is made tractable by easily drawing a surrogate latent variable and applying the inverse pass of the module with the sampled variable and decoded latent features. Experimental results demonstrate that with a new component replacing the auto-encoder in image compression methods, ILC can significantly outperform the baseline method on extensive benchmark datasets by combining with the existing compression algorithms.


page 17

page 18

page 19

page 20


Conditional Probability Models for Deep Image Compression

Deep Neural Networks trained as image auto-encoders have recently emerge...

Deep Learning-based Image Compression with Trellis Coded Quantization

Recently many works attempt to develop image compression models based on...

Towards Conceptual Compression

We introduce a simple recurrent variational auto-encoder architecture th...

Invertible Image Rescaling

High-resolution digital images are usually downscaled to fit various dis...

Layered Image Compression using Scalable Auto-encoder

This paper presents a novel convolutional neural network (CNN) based ima...

Modeling the Biological Pathology Continuum with HSIC-regularized Wasserstein Auto-encoders

A crucial challenge in image-based modeling of biomedical data is to ide...

Convolutional Poisson Gamma Belief Network

For text analysis, one often resorts to a lossy representation that eith...

1 Introduction

Lossy image compression has played a central role in the scenario of image storing and transferring for a long time, especially given the exploding amount of large-sized images and comparatively limited storage or bandwidth nowadays. Specifically, transform coding based methods perform well and are widely adopted in practice, such as JPEG [32], JPEG2000 [27] and BPG [9]. Recently, deep-learning-based lossy image compression methods [6, 7, 26, 24, 19, 31, 1, 8, 29, 25] have generated great interests due to the impressing performance and low bitrate.

Unlike previous transform coding methods, deep-learning-based methods first transform an image

into a lower-dimensional latent representation vector

by an encoder network, then quantize as a discrete-valued vector . Lossless entropy coding methods, such as arithmetic coding [28], are then applied to and compress it into a bit-stream. Some prior works [6, 7, 26, 23, 24]

adopt an auxiliary network as an entropy model to estimate the density and provide the statistics to the entropy model. A decoder network is used to approximate the inverse function that maps the latent variables back to pixels. There inevitably exists information loss after passing the encoder network by reducing the representation dimensions and quantization by rounding floating numbers.

The reduction in the dimensionality of the representation on induces a significant drop in the amount of information maintained by the original image

, which is inconsistent with the goal of compression that aims to reduce the entropy of the representation under a prior probability entropy model. Previous works 

[7, 26] mainly focus on finding a reasonable estimation of the prior density to minimize the expected length of the bit-stream, but few put effort into improving the expected distortion of the reconstructed image with respect to the original . The information loss poses a notable challenge to recovering the original image by using a decoder network only since the task is made ill-posed, which would obviously influence the rate-distortion optimization. One way to keep all the information during encoding is to preserve the dimensionality of the representation in , i.e. the same as the original image

. The reconstruction stage would benefit from such an informative representation, while it is extremely tough to encode such high-dimensional data into desired low bitrate with lossless entropy coding methods.

In this paper, we propose a novel framework for lossy image compression called Invertible Lossy Compression (ILC) to tackle this intractable problem by capturing the most knowledge of the lost information. The recovery of the original image can benefit from modelling the statistical behaviour of the lost information in expectation during encoding. To this end, the lost information should be expressed in an explicit form and can be encoded to hidden states without unconscious information loss. The encoder-decoder framework employed in previous image compression methods does not well meet such requirement since massive efforts have to be paid to make the encoder invertible and the decoder as the inverse of it, and even so there is still approximation error in the invertibility. Therefore, we adopt an invertible encoding module (IEM), which is strictly reversible and apparently be more satiable to this pair of inverse tasks. IEM consists of two essential components, i.e., invertible downsampling layers to enlarge the receptive field and decompose the spatial correlation among channels; and coupling layer to enhance the expressive power of the module. Since an invertible model gives a representation of the same dimensionality with input image , we split the encoded representation as and , where preserves the necessary information for image reconstruction and stores all left information (like high-frequency noise that doesn’t alter the original image much). For lossy compression, we discard and encode the quantized into a bit-stream, while we also model the distribution of the lost information . For reconstruction, we feed the INN inversely with and a randomly drawn sample of from the learned distribution.

To achieve our purpose, as ensuring the representation of to be informative as well as coding favourable, it is crucial to make IEM capturing the knowledge in the distribution of , and meanwhile removing the dependency between and . To this end, we employ a distribution matching loss to encourage following a Gaussian distribution. More importantly, the quantization process would change the distribution on the representation from to , which disturbs the inverse procedure of IEM when recovering by combining with . It is challenging to train an invertible model with such a distribution mismatch. Observing that conventional model training is more robust to the quantization process, we propose a knowledge distillation module (KDM) to transfer the knowledge from encoder-decoder models to IEM with knowledge distillation [11, 5, 17]. It usually starts by training a teacher model, an encoder-decoder compression model in our task, and then optimize the target, invertible student model, so that it mimics the teacher model’s behaviours. Empirically, we choose the output of the encoder network as the objective for distillation, combining reconstruction loss, entropy loss, and distribution loss to optimize efficiency. The KDM would provide soft labels and accelerate the convergence of invertible models to match the teacher encoder-decoder model’s performance quickly. We conduct experiments on extensive benchmark datasets, and the comparison with the baseline method has shown significant improvement that proves the effectiveness of our proposed ILC.

The main contributions of this paper are highlighted as follows:

  1. We propose a novel ILC framework by introducing an invertible encoding module into lossy image compression with an encoder-decoder framework to simultaneously produce low-dimensional informative representations and capture the knowledge in the distribution of the lost information during encoding.

  2. We propose an efficient training objective of IEM to encourage the latent variable to obey a pre-specified distribution, which can be easily drawn and provide rich information during decoding. We then employ a knowledge distillation module drawn from the teacher model to overcome the distribution mismatch problem brought by quantization in compression and accelerate the optimization of IEM.

  3. Extensive experimental results demonstrate that simply replacing the encoder and decoder models by our ILC framework can provide a performance boost of reconstructed image compression results on various benchmark datasets.

2 Related Work

2.1 Lossy Image Compression

The traditional lossy compression algorithms for images [32, 27, 9] use prior knowledge to decouple the low- and high-frequency signals, (e.g. discrete cosine transformation (DCT), discrete wavelet transformation (DWT), etc.), and perform reversible coding algorithms (e.g. Huffman Coding [30]) on the quantized signals to achieve compression.

In recent years, more and more deep-learning-based works came into sight and attained astonishing results [6, 7, 26, 24, 19, 31, 8, 22]

. Generally, they use auto-encoders, which is widely employed in representation learning and generative problems. The informative bottleneck representation in the auto-encoder framework is well-suited for lossy compression. It is trying to use a neural network, instead of prior knowledge, as an encoder, to draw a lower-dimensional representation from the image straightly, and quantize the low-frequency signal such that it can be coded into a bit-stream. When decompressing the image, a decoder network is employed to reconstruct the original image from the decoded quantized latent representation.

However, quantization leads to differentiability issues. For the past a few years, to enable the entire optimization process to be carried out end-to-end, it has reached a consensus that a uniform noise needs to be superimposed on low-frequency signals as a soft quantization during training. On the other hand, the previous works are also roughly the same in the choice of coding algorithm, that is, the use of arithmetic coding, an efficient coding algorithm based on entropy modelling. It then raises the question of how to estimate the entropy of the signal variable better. Much work on this issue has accomplished positive progress. For example, hyperprior and autoregressive components are used 

[7, 26] to jointly model the latent feature map such that a better-performing entropy model can be constructed. Besides, there is also some work on auto-encoder, which helps improve the neural network’s capacity to extract features. It introduced a non-local attention block to assist auto-encoder to capture the local and global correlation of the latent feature map [24].

Although previous methods share promising performance, they ignore the difficulty in the reconstruction stage brought by non-negligible information loss during encoding. In this work, we identify the problem and propose to mitigate by explicitly modelling the lost information.

2.2 Invertible Neural Network

It has been shown in lots of scenarios that the neural network with invertibility as its core design principle can achieve the same or even better performance than the non-invertible neural networks [21, 33, 16].

As for generative models, denoting the input data as , an invertible neural network as , and as , the inverse function can be trivially obtained, such that can be easily sampled with , where . Furthermore, the density function of can be explicitly defined, which allows us to use the maximum likelihood method for training. Besides, GANs and VAEs are also well-known in generative problems, but both have defects. In VAEs, the distribution of latent variables, , can only be approximately inferred by data, which means the entire training objective is not an exact form, but a variational lower bound on the log-likelihood of . And for GANs, due to the lack of encoder, it cannot perform inference, which significantly impedes the usability in our scenario. While, with invertibility, a neural network can not only accurately evaluate log-likelihood and perform inference, but be naturally superior in synthesis.

In order to ensure the invertibility of the system, each sub-block of the neural network is designed to be invertible, which makes maintaining the model capacity become a top priority task while limiting the design of structure. To address this concern, we take the coupling layer with affine transformation, introduced in RealNVP [13], as a general solution. Consider the coupling layer. Given a dimensional input , and a slicing position , the dimensional output of an affine coupling layer follows the equations:


where , , and are arbitrary dimensional invariant functions, and is a constant factor served as a clamp.

For the inverse, given a dimensional , and a slicing position , the dimensional follows:


INN has also been applied to paired data and this idea has been demonstrated in different problems. For example, Ardizzone et al. [3] analyzed real-world problems from medicine and astrophysics. In image compression tasks, the classical Maximum Mean Discrepancy (MMD) [14]

method fails to measure the difference in such high-dimensional probability distributions. A conditional INN 


is designed is applied to guided image generation and colorization. In their task, the guidance

is given as a condition that is obviously not suitable for our aim. Recently, Xiao et al.[33] propose to use INN as a transformation between high- and low- resolution images. In their task, no explicit constraint of entropy on the low-dimensional representation is considered, which would be one of the biggest challenges in image compression tasks.

3 Invertible Lossy Compression

Figure 1: (a): the general frameworks using variational auto-encoder, where the entropy model generally refers to all the modules that can estimate the density function of y. It can be expanded to jointly estimate with hyperprior or autoregressive component [7, 26]. (b): the general framework of ILC. An invertible encoding module takes the place of both encoder and decoder network in (a), and a knowledge distillation module is adopted during training stage.

Figure 1 shows the general frameworks of previous variational auto-encoder based lossy image compression [6, 7, 26]. Typically it consists of three modules: an encoder network to transform original image into a low-dimensional representation , an entropy model to estimate the density of the quantized representation , and a decoder network to reconstruct the image from . Figure 1 illustrates the sketch of the general framework of ILC. The main difference between ILC and previous methods is that instead of a commonly used encoder-decoder module, an invertible encoding module (IEM) is adopted to explicitly model the lost information during encoding, which mitigates the difficulty in reconstruction stage. In the forward procedure, image is transformed into a coding target and an auxiliary latent variable . We enforce the transformed to follow a specified distribution independent of by a network, which captures the statistical knowledge of the lost information for reconstruction. This can be achieved since it is guaranteed that any absolutely continuous distribution can be transformed bijectively to a standard Gaussian [18] through the knowledge in the network. Then for the inverse, we can utilize a sample to replace the original and reconstruct image from quantized through the inverse transformation. Furthermore, a knowledge distillation module (KDM) would be coupled during the training process to stable the optimization and accelerate the training convergence. We will describe them in detail as follows.

3.1 Invertible Encoding Module

Figure 2: The architecture of IEM, where (a): the invertible downsampling layer, (b): computation graph for forward propagation in the coupling layer and (c): details of the operation function and .

Our IEM consists of two basic invertible components, i.e., invertible downsampling layer and coupling layer. As shown in Figure 1, IEM takes the place of both the encoder and decoder modules in our framework. When the image is being compressed, IEM acts s the encoder by taking as input and feeding into the forward direction of the model. In contrast, IEM is inversely applied to serve as the decoder when the compressed image is being reconstructed.

Invertible Downsampling Layer Downsampling is necessary for the entropy model to enlarge the receptive field and decompose the spatial correlation in compression. However, the traditional downsampling layer in previous learning-based compression methods is obviously irreversible. Therefore, we carefully design an invertible downsampling layer to achieve our purpose. Figure 2 shows the detailed architecture. Firstly, we introduce a wavelet transformation as one of our basic modules. The inverse function of each wavelet transformation naturally exists. More importantly, strong prior knowledge is provided by wavelet transformations that explicitly separate the low- and high- frequency signals, from which the image compression algorithm would definitely benefit [2]

. On the contrary, all previous deep-learning-based methods fail to integrate this into their models. Specifically speaking, the wavelet transformation transforms the input tensor with height

, width , and channel into a tensor of shape , where the first slices are low-frequency contents equivalent to average pooling, and the rest are high-frequency components corresponding to residuals in the vertical, horizontal and diagonal directions. We employ Haar wavelet transformation in our framework, which is easy to implement while enlarging the receptive field without information loss and containing certain prior knowledge.

Although the non-learnable wavelet transformation can downsample the input image to enlarge the receptive fields, provide strong prior for the model and be strictly invertible, it suffers from the fixed split of information, i.e. only the first quarter are main low-frequency contents. A more flexible separation can adapt our channel split between and better and makes training more accessible. Therefore, we use a 11 invertible convolution after the wavelet transformation to refine the split of information. 11 invertible convolution is initially proposed in GLOW [21]

, which is originally used for the channel-wise permutation. Different from their purpose, we leverage it to make channel-wise refinement after the wavelet transformation and initialize its weight as an identity matrix.

Coupling Layer Processed by invertible downsampling layer, the feature map is roughly broken down into two segments, a low-frequency component carrying the majority of information and a high-frequency component mainly composed of the former’s residuals, and then fed into a stack of convolutional layers applied to further abstract the corresponding signals. Taking as an input of the coupling layer, the feature map is partitioned into two components by , denoted as and respectively. Notice that the ratio is consistent with the proportion of dimensions in the final outputs, and . To further decouple signals, we employ an affine transformation, proposed in RealNVP [13], on the high-frequency feature map and fuse the captured signal into . On the other hand, if the compression rate is low, we expect simple information to be transferred into so that we can model it in the latent space, rather than eliminated in quantization. As a result, we apply another affine transformation for the flow of information from the low- to the high-frequency component. The detailed structure is presented in 2.

The complete expression of affine transformation is shown in Eq.1, where the constant factor, , is set to 1, and all the transformation functions (i.e., and ) can be arbitrary as long as the dimensionalities of input and output are matched. To enhance the expressive power of our model while retaining the light-weight computation, we employ a simple but effective bottleneck-like structure as the transformation functions, which is shown in Figure 2.

3.2 Knowledge Distillation Module

Yet, there are still several challenges that may significantly influence the optimization process of IEM. On the one hand, due to the invertibility, IEM can transform the distribution of and to each other. However, the quantization on would apparently change the distribution on , and we expect that IEM would fail to be robust to such data jitter prompted by quantization, and induce a non-negligible drop in performance. On the other hand, guaranteeing the invertibility, it has to make independent from as much as possible and force both of them to follow the distributions as required. Considering the various conditions, the massive inequality in the amount of information between and would further provoke a remarkable increment in the difficulty of optimization.

Therefore, we introduce the KDM to mitigate these stubborn challenges. Observing that the encoder-decoder frameworks are much robust to this distribution mismatch caused by the quantization process, which is most likely due to the separate training procedure, we propose to use the encoder network as a teacher model and encourage IEM to mimic the output representation of it at the training stage. In this way, IEM can learn to output the representation with less distribution changing during the quantization process, which makes the inverse function of IEM more robust to the distribution mismatch than before. More importantly, KDM can obviously increase the utility and the ease of training in the early stage, such that IEM can refine the distribution of from the old encoder, and surrounding it, look for a better low-dimensional manifold of the data distribution that can be efficiently inverted.

Practically, it requires a fully specified prior, a teacher encoder and its affiliated entropy module, which could furnish at least a sub-optimal for the sophisticated knowledge in the distribution on with learnable uncertain components. Specifically speaking, the fully trained prior is utilized to initialize the entropy model for IEM. Throughout the training, the teacher’s knowledge is transferred through a distillation loss guiding , which is evaluated by first feeding the same into both the teacher encoder and IEM and then computing their difference on .

3.3 Optimization Objective

Our fundamental training objective follows previous learning-based lossy compression methods, i.e. to minimize a weighted sum of the rate and distortion , where the rate is lower-bounded by the entropy of the discrete probability distribution of the quantized vector , and the distortion is the difference between the reconstructed and the original input . In addition, two novel objectives are included for our invertible framework and efficient training: (1) A distribution matching loss for capturing the distribution of lost information, and (2) A distillation loss that stables our training procedure.

Rate Our basic goal is to minimize the rate of the coding target . Therefore, we leverage entropy as our objective, which is consistent with previous works [6, 7, 26]. As mentioned above, an entropy module is used to estimate the entropy on . We follow their loss definition and denote it as:


where is the approximation of the quantizaiton and is an additive i.i.d. uniform noise, with the same with as quantization bins, which in our case is one.

Distortion Because of the uncertainty of , and the quantization on , a distortion loss is still required to ensure the inverse reconstruction of our model. We denote the reverse process as . The distortion loss is able to encourage our model to adapt the new sample drawn from . We formulate the loss as:


where is a sample from . For training stability in practice, we empirically take the most-likely sample from the distribution for reconstruction. We employ loss as . Different from previous works that use the noisy representation as an approximation to in distortion, we directly employ with Straight-Through Estimator  [10] during optimization. It mainly results from the inconsistency of in training and in inference, which may negatively influence the generalization of IEM sharing the same parameters during encoding and decoding.

Distribution Matching This part of training objectives is mainly to enforce the transformation from the y-dependent lost information to a standard Gaussian representation. Denoting the distribution on that is transformed from the data distribution as , we aim to minimize its difference from the specified y-independent distribution . For practical optimization, we employ the cross-entropy (CE) to measure the difference, which leads to the objective:


The distribution loss encourages to follow the same target distribution for every , thereby encouraging independence between them.

Distillation As mentioned in the previous section, we leverage a distillation module to stable our training. The distillation loss based on a teacher model is defined as:


where is a difference metric, and we use loss in practice.

Total loss Combining these objectives, our total loss for training is:


where , , and are coefficients for balancing different loss terms.

4 Experiments

The dataset we used for training is a subset of ImageNet 

[12]. We filter out 9250 images with a size larger than , in order to easily preprocess on data. All the images for training are preprocessed by random rescaling and cropping. Evaluations were performed on the Kodak image dataset [15] commonly used as test data for compression problems, and a randomly sampled subset of the ImageNet with 100 uncompressed images, where none of them is used in training.

Our idea is verified on [6]

, since it has open-source and reproducible implementation. With the number of filters equal to 256, the teacher encoder is trained following the instructions in  

[6]. We use Adam optimization [20] algorithm for all the parameters with different learning rates on the convolutional auto-encoder and the entropy model, where the former is set to , and the other is set to . The training and evaluation are performed at each compression rate separately. The compression rate is controlled by adjusting the ratio, i.e. , of distortion and rate in its objective function. To guarantee the training to be thoroughly carried out, we train at least one million iterations and ensure that the performance no longer increases in a certain period of time.

Afterwards, following the architecture in Figure 1, the pre-trained encoder is loaded as our teacher model. We jointly optimize all the parameters in IEM and the entropy model. The training dataset and preprocessing methods are the same as before, except for the learning rate, where we make it gradually decay since 0.1 million iterations. Practically, for different bitrates, we use the same coefficient for distillation, , during training, and adjust on all the others, i.e. , and , in our objective function 7.

Figure 3: The rate-distortion comparison of the baseline method and ILC distilled from it, where each point is averaged over all the test images in the specified dataset. (a) compares PSNR (RGB) of images from Kodak, and (b) compares PSNR on luma component of 100 images from a randomly sampled test dataset of ImageNet.

Aligned with the previous work, we quantified our model with peak noise-signal ratio (PSNR) on both RGB and the luma component of images. We compare the rate-distortion performance of our method to the baseline approach, the teacher model in our framework. For evaluating PSNR, we observe that

has to be sampled from a distribution with lower variance to obtain a better performance. It is consistent with what GLOW  

[21] does, which samples from a distribution with shrinked variance to prevent mode collapse on generated images. In our case, the priority is not variance but the performance in reconstruction, so instead of sampling, we take the most-likely from the specified distribution. Compared with the baseline method, with the same compression ratio, ILC improves the PSNR performance by around 0.4dB. And empirically, as the compression ratio rises (i.e. bitrate decreases), ILC can better boost the performance. Intuitively, under more severe demands on bitrate, the convolutional auto-encoder tends to undertake more responsibility in further extracting the refined representation of data, which leads to more irreversible loss of information. In this case, explicitly modelling the high-frequency signals turns into the key to success.

To further verify that ILC’s superiority over auto-encoder brings the improvement, we use this result to compare with ILC guided by an early-stopped teacher model. It shows a strong positive correlation between the performance of ILC and its teacher model, which testifies our thoughts that ILC is strongly guided and would always look for a manifold surrounding it. ILC’s preponderance is mainly in the reconstruction, so it could be leveraged to improve the performance jointly with other more advanced entropy models.

5 Conclusion

In this paper, we propose a novel framework ILC for lossy image compression, with explicitly modelling the information loss during encoding. By doing this, the ill-posed problem of the post-reconstruction stage is largely mitigated. To achieve our purpose, we design an invertible encoding module to replace the encoder-decoder network in previous methods, which is obviously more suitable for this pair of inverse tasks. With the latent variable’s statistical knowledge, IEM can reconstruct the original input image from the compressed bit-stream with high quality by drawing a sample from a pre-specific distribution. To overcome the intractable challenges in optimization, we propose a knowledge distillation module to provide the soft labels from a teacher model, which significantly accelerates the training procedure. Extensive experiments demonstrate that our framework ILC significantly improves performance based on existing methods.


  • [1] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. Gool (2019) Generative adversarial networks for extreme learned image compression. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 221–231. Cited by: §1.
  • [2] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies (1992) Image coding using wavelet transform. IEEE Transactions on image processing 1 (2), pp. 205–220. Cited by: §3.1.
  • [3] L. Ardizzone, J. Kruse, S. Wirkert, D. Rahner, E. W. Pellegrini, R. S. Klessen, L. Maier-Hein, C. Rother, and U. Köthe (2019) Analyzing inverse problems with invertible neural networks. In Proceedings of the International Conference on Learning and Representations, Cited by: §2.2.
  • [4] L. Ardizzone, C. Lüth, J. Kruse, C. Rother, and U. Köthe (2019) Guided image generation with conditional invertible neural networks. arXiv preprint arXiv:1907.02392. Cited by: §2.2.
  • [5] J. Ba and R. Caruana (2014) Do deep nets really need to be deep?. In Advances in neural information processing systems, pp. 2654–2662. Cited by: §1.
  • [6] J. Ballé, V. Laparra, and E. P. Simoncelli (2017) End-to-end optimized image compression. ICLR. Cited by: §1, §1, §2.1, §3.3, §3, §4.
  • [7] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston (2018) Variational image compression with a scale hyperprior. ICLR. Cited by: §1, §1, §1, §2.1, §2.1, Figure 1, §3.3, §3.
  • [8] J. Ballé (2018) Efficient nonlinear transforms for lossy image compression. In 2018 Picture Coding Symposium (PCS), pp. 248–252. Cited by: §1, §2.1.
  • [9] F. Bellard (2015) BPG image format. URL https://bellard. org/bpg. Cited by: §1, §2.1.
  • [10] Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    arXiv preprint arXiv:1308.3432. Cited by: §3.3.
  • [11] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §1.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: Appendix 0.A, §4.
  • [13] L. Dinh, J. Sohl-Dickstein, and S. Bengio (2017) Density estimation using real NVP. In Proceedings of the International Conference on Learning Representations, Cited by: §2.2, §3.1.
  • [14] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani (2015) Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906. Cited by: §2.2.
  • [15] R. Franzen (1999) Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak 4. Cited by: §4.
  • [16] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse (2017)

    The reversible residual network: backpropagation without storing activations

    In Advances in neural information processing systems, pp. 2214–2224. Cited by: §2.2.
  • [17] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
  • [18] A. Hyvärinen and P. Pajunen (1999)

    Nonlinear independent component analysis: existence and uniqueness results

    Neural Networks 12 (3), pp. 429–439. Cited by: §3.
  • [19] N. Johnston, E. Eban, A. Gordon, and J. Ballé (2019) Computationally efficient neural image compression. arXiv preprint arXiv:1912.08771. Cited by: §1, §2.1.
  • [20] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix 0.A, §4.
  • [21] D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §2.2, §3.1, §4.
  • [22] J. Lee, S. Cho, and S. Beack (2019) Context-adaptive entropy model for end-to-end optimized image compression. ICLR. Cited by: §2.1.
  • [23] J. Lee, S. Cho, S. Jeong, H. Kwon, H. Ko, H. Y. Kim, and J. S. Choi (2019) Extended end-to-end optimized image compression method based on a context-adaptive entropy model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1.
  • [24] H. Liu, T. Chen, P. Guo, Q. Shen, X. Cao, Y. Wang, and Z. Ma (2019) Non-local attention optimized deep image compression. arXiv preprint arXiv:1904.09757. Cited by: §1, §1, §2.1, §2.1.
  • [25] S. Ma, X. Zhang, C. Jia, Z. Zhao, S. Wang, and S. Wanga (2019) Image and video compression with neural networks: a review. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
  • [26] D. Minnen, J. Ballé, and G. D. Toderici (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §1, §1, §1, §2.1, §2.1, Figure 1, §3.3, §3.
  • [27] M. Rabbani (2002) JPEG2000: image compression fundamentals, standards and practice. Journal of Electronic Imaging 11 (2), pp. 286. Cited by: §1, §2.1.
  • [28] J. Rissanen and G. Langdon (1981) Universal modeling and coding. IEEE Transactions on Information Theory 27 (1), pp. 12–23. Cited by: §1.
  • [29] M. Tschannen, E. Agustsson, and M. Lucic (2018) Deep generative models for distribution-preserving lossy compression. In Advances in Neural Information Processing Systems, pp. 5929–5940. Cited by: §1.
  • [30] J. Van Leeuwen (1976) On the construction of huffman trees.. In ICALP, pp. 382–410. Cited by: §2.1.
  • [31] T. van Rozendaal, G. Sautière, and T. S. Cohen (2020) Lossy compression with distortion constrained optimization. arXiv preprint arXiv:2005.04064. Cited by: §1, §2.1.
  • [32] G. K. Wallace (1992) The jpeg still picture compression standard. IEEE transactions on consumer electronics 38 (1), pp. xviii–xxxiv. Cited by: §1, §2.1.
  • [33] M. Xiao, S. Zheng, C. Liu, Y. Wang, D. He, G. Ke, J. Bian, Z. Lin, and T. Liu (2020) Invertible image rescaling. arXiv preprint arXiv:2005.05650. Cited by: §2.2, §2.2.

Appendix 0.A Experimental Details

Network Structure  The coupling layers are parametrized using neural networks with a bottleneck-like structure, which consists of three convolutional layers, where the first and last is , and the middle one is . This is designed to improve efficiency in using parameters by allowing a larger number of channels under an acceptable computing complexity. In our implementation, the detail of architecture is specified in Table 1.

Input Size Input Channel Width Kernel Size (K)
64 64 48 128 5 5
32 32 192 256 3 3
16 16 768 1024 3 3
Table 1: Architectures for IEM

Training Details  Throughout training, we use Adam optimizer [20] with and . For the learning rate, different initial values for IEM and the entropy model are used, denoted as and respectively. We let both of them decay from the 0.1 million iterations, which is computed as


, where , is a step counter, and is a decay factor that is practically around 0.999995. and are set to and respectively.

As for the training objective, the coefficients need to be carefully tuned for each bitrate. Empirically, we set and , and use grid search to look for optimal weights of and ranging from to . Notice, it does not mean is optimal in all the scenarios, but experimentally there is no significant difference between the change of performance when simultaneously changing and , and adjusting only.
Ramdomly Sampled Dataset for Evaluation  The subset of ImageNet[12] we used for evaluation is randomly sampled.

ILSVRC2012_test_00000505 ILSVRC2012_test_00001058 ILSVRC2012_test_00001246
ILSVRC2012_test_00001444 ILSVRC2012_test_00005577 ILSVRC2012_test_00006189
ILSVRC2012_test_00007775 ILSVRC2012_test_00008278 ILSVRC2012_test_00008440
ILSVRC2012_test_00008460 ILSVRC2012_test_00009563 ILSVRC2012_test_00010174
ILSVRC2012_test_00011618 ILSVRC2012_test_00012189 ILSVRC2012_test_00012215
ILSVRC2012_test_00012448 ILSVRC2012_test_00012492 ILSVRC2012_test_00012706
ILSVRC2012_test_00012814 ILSVRC2012_test_00013452 ILSVRC2012_test_00016082
ILSVRC2012_test_00020129 ILSVRC2012_test_00023175 ILSVRC2012_test_00023432
ILSVRC2012_test_00023806 ILSVRC2012_test_00024649 ILSVRC2012_test_00027574
ILSVRC2012_test_00028721 ILSVRC2012_test_00029016 ILSVRC2012_test_00029428
ILSVRC2012_test_00030152 ILSVRC2012_test_00030240 ILSVRC2012_test_00030535
ILSVRC2012_test_00030594 ILSVRC2012_test_00033552 ILSVRC2012_test_00035652
ILSVRC2012_test_00036628 ILSVRC2012_test_00037322 ILSVRC2012_test_00037832
ILSVRC2012_test_00038678 ILSVRC2012_test_00038827 ILSVRC2012_test_00039565
ILSVRC2012_test_00040346 ILSVRC2012_test_00040449 ILSVRC2012_test_00042420
ILSVRC2012_test_00042491 ILSVRC2012_test_00042549 ILSVRC2012_test_00042583
ILSVRC2012_test_00042631 ILSVRC2012_test_00043029 ILSVRC2012_test_00043873
ILSVRC2012_test_00046159 ILSVRC2012_test_00047576 ILSVRC2012_test_00049924
ILSVRC2012_test_00050181 ILSVRC2012_test_00051384 ILSVRC2012_test_00053071
ILSVRC2012_test_00053603 ILSVRC2012_test_00054755 ILSVRC2012_test_00055533
ILSVRC2012_test_00063855 ILSVRC2012_test_00063995 ILSVRC2012_test_00066697
ILSVRC2012_test_00066914 ILSVRC2012_test_00068457 ILSVRC2012_test_00068596
ILSVRC2012_test_00070325 ILSVRC2012_test_00071340 ILSVRC2012_test_00071954
ILSVRC2012_test_00073228 ILSVRC2012_test_00073412 ILSVRC2012_test_00074198
ILSVRC2012_test_00075878 ILSVRC2012_test_00076257 ILSVRC2012_test_00078714
ILSVRC2012_test_00078897 ILSVRC2012_test_00080599 ILSVRC2012_test_00081961
ILSVRC2012_test_00082349 ILSVRC2012_test_00085480 ILSVRC2012_test_00085775
ILSVRC2012_test_00086081 ILSVRC2012_test_00086280 ILSVRC2012_test_00086391
ILSVRC2012_test_00086845 ILSVRC2012_test_00087796 ILSVRC2012_test_00087939
ILSVRC2012_test_00089036 ILSVRC2012_test_00089377 ILSVRC2012_test_00089680
ILSVRC2012_test_00090568 ILSVRC2012_test_00090956 ILSVRC2012_test_00093267
ILSVRC2012_test_00093343 ILSVRC2012_test_00096391 ILSVRC2012_test_00096900
ILSVRC2012_test_00099475 ILSVRC2012_test_00099912 ILSVRC2015_test_00002901
Table 2: All the file names in the subset of ImageNet used for evaluation.

Appendix 0.B Performance without Quantization

We test our model without quantization and find that its performance is remarkably improved. Contrarily, with an irreversible auto-encoder, there is only a slight performance increment, which shows the preponderance of invertibility in the process of reconstruction. Such a huge gap result from quantization implies a tremendous potential in information modelling, which points out a direction for prospective works.

Baseline ILC
NQ (dB) / Q (dB) bpp NQ (dB) / Q (dB) bpp
31.61 / 29.84 0.2984 40.62 / 30.59 0.3047
33.45 / 31.63 0.4752 42.43 / 32.36 0.4917
34.05 / 32.24 0.5308 43.07 / 32.59 0.5295
35.22 / 33.42 0.674 44.11 / 33.4 0.6197
36.36 / 34.61 0.8232 45.49 / 34.62 0.7798
Table 3: We denote the evaluation performance with quantization as Q, and the one without quantization as NQ. The performance in ILC soars more significantly as quantization is removed.

Appendix 0.C More Results

Figure 4: R-D curve on Kodak-02 compared between our framework and baseline method. (a) illustrates the performance on (luma) PSNR, and (b) illustrates the performance on (luma) MS-SSIM
(a) Baseline (luma) PSNR: 31.68 dB, (rgb) PSNR: 30.62 dB, (luma) MS-SSIM: 0.9382, bpp: 0.2127
(b) ILC (luma) PSNR: 32.26 dB, (rgb) PSNR: 31.17 dB, (luma) MS-SSIM: 0.9507, bpp: 0.2174
(c) Original Image
Figure 5: The performance comparison on Kodak-02. (a) shows three images, which are the reconstruction from baseline, the reconstruction from ILC and the original image, respectively from left to right. (b), (c) and (d) are image patches from (a)-left, (a)-middle, and (a)-right respectively.
Figure 6: R-D curve on Kodak-03 compared between our framework and baseline method. (a) illustrates the performance on (luma) PSNR, and (b) illustrates the performance on (luma) MS-SSIM
(a) Baseline (luma) PSNR: 32.72 dB, (rgb) PSNR: 32.20 dB, (luma) MS-SSIM: 0.9640, bpp: 0.1924
(b) ILC (luma) PSNR: 33.39 dB, (rgb) PSNR: 32.72 dB, (luma) MS-SSIM: 0.9715, bpp: 0.1967
(c) Original Image
Figure 7: The performance comparison on Kodak-03. (a) shows three images, which are the reconstruction from baseline, the reconstruction from ILC and the original image, respectively from left to right. (b), (c) and (d) are image patches from (a)-left, (a)-middle, and (a)-right respectively.
Figure 8: R-D curve on Kodak-10 compared between our framework and baseline method. (a) illustrates the performance on (luma) PSNR, and (b) illustrates the performance on (luma) MS-SSIM
(a) Baseline (luma) PSNR: 31.72 dB, (rgb) PSNR: 30.91 dB, (luma) MS-SSIM: 0.9605, bpp: 0.2156
(b) ILC (luma) PSNR: 32.42 dB, (rgb) PSNR: 31.50 dB, (luma) MS-SSIM: 0.9698, bpp: 0.2235
(c) Original Image
Figure 9: The performance comparison on Kodak-10. (a) shows three images, which are the reconstruction from baseline, the reconstruction from ILC and the original image, respectively from left to right. (b), (c) and (d) are image patches from (a)-left, (a)-middle, and (a)-right respectively.
Figure 10: R-D curve on Kodak-15 compared between our framework and baseline method. (a) illustrates the performance on (luma) PSNR, and (b) illustrates the performance on (luma) MS-SSIM
(a) Baseline (luma) PSNR: 31.56 dB, (rgb) PSNR: 30.75 dB, (luma) MS-SSIM: 0.9565, bpp: 0.2195
(b) ILC (luma) PSNR: 32.09 dB, (rgb) PSNR: 31.19 dB, (luma) MS-SSIM: 0.9666, bpp: 0.2231
(c) Original Image
Figure 11: The performance comparison on Kodak-15. (a) shows three images, which are the reconstruction from baseline, the reconstruction from ILC and the original image, respectively from left to right. (b), (c) and (d) are image patches from (a)-left, (a)-middle, and (a)-right respectively.

Appendix 0.D Ablation Study

Figure 12: To further verify the efficiency of our framework, we conduct ablation study on several modules. The suffix -no11 means removing all the invertible convolutions. The suffix -noKDM means training without knowledge distillation, i.e. specifically speaking, there are only three components in the objective, corresponding to distortion, rate, and distribution matching.