1 Introduction
Lossy image compression has played a central role in the scenario of image storing and transferring for a long time, especially given the exploding amount of largesized images and comparatively limited storage or bandwidth nowadays. Specifically, transform coding based methods perform well and are widely adopted in practice, such as JPEG [32], JPEG2000 [27] and BPG [9]. Recently, deeplearningbased lossy image compression methods [6, 7, 26, 24, 19, 31, 1, 8, 29, 25] have generated great interests due to the impressing performance and low bitrate.
Unlike previous transform coding methods, deeplearningbased methods first transform an image
into a lowerdimensional latent representation vector
by an encoder network, then quantize as a discretevalued vector . Lossless entropy coding methods, such as arithmetic coding [28], are then applied to and compress it into a bitstream. Some prior works [6, 7, 26, 23, 24]adopt an auxiliary network as an entropy model to estimate the density and provide the statistics to the entropy model. A decoder network is used to approximate the inverse function that maps the latent variables back to pixels. There inevitably exists information loss after passing the encoder network by reducing the representation dimensions and quantization by rounding floating numbers.
The reduction in the dimensionality of the representation on induces a significant drop in the amount of information maintained by the original image
, which is inconsistent with the goal of compression that aims to reduce the entropy of the representation under a prior probability entropy model. Previous works
[7, 26] mainly focus on finding a reasonable estimation of the prior density to minimize the expected length of the bitstream, but few put effort into improving the expected distortion of the reconstructed image with respect to the original . The information loss poses a notable challenge to recovering the original image by using a decoder network only since the task is made illposed, which would obviously influence the ratedistortion optimization. One way to keep all the information during encoding is to preserve the dimensionality of the representation in , i.e. the same as the original image. The reconstruction stage would benefit from such an informative representation, while it is extremely tough to encode such highdimensional data into desired low bitrate with lossless entropy coding methods.
In this paper, we propose a novel framework for lossy image compression called Invertible Lossy Compression (ILC) to tackle this intractable problem by capturing the most knowledge of the lost information. The recovery of the original image can benefit from modelling the statistical behaviour of the lost information in expectation during encoding. To this end, the lost information should be expressed in an explicit form and can be encoded to hidden states without unconscious information loss. The encoderdecoder framework employed in previous image compression methods does not well meet such requirement since massive efforts have to be paid to make the encoder invertible and the decoder as the inverse of it, and even so there is still approximation error in the invertibility. Therefore, we adopt an invertible encoding module (IEM), which is strictly reversible and apparently be more satiable to this pair of inverse tasks. IEM consists of two essential components, i.e., invertible downsampling layers to enlarge the receptive field and decompose the spatial correlation among channels; and coupling layer to enhance the expressive power of the module. Since an invertible model gives a representation of the same dimensionality with input image , we split the encoded representation as and , where preserves the necessary information for image reconstruction and stores all left information (like highfrequency noise that doesn’t alter the original image much). For lossy compression, we discard and encode the quantized into a bitstream, while we also model the distribution of the lost information . For reconstruction, we feed the INN inversely with and a randomly drawn sample of from the learned distribution.
To achieve our purpose, as ensuring the representation of to be informative as well as coding favourable, it is crucial to make IEM capturing the knowledge in the distribution of , and meanwhile removing the dependency between and . To this end, we employ a distribution matching loss to encourage following a Gaussian distribution. More importantly, the quantization process would change the distribution on the representation from to , which disturbs the inverse procedure of IEM when recovering by combining with . It is challenging to train an invertible model with such a distribution mismatch. Observing that conventional model training is more robust to the quantization process, we propose a knowledge distillation module (KDM) to transfer the knowledge from encoderdecoder models to IEM with knowledge distillation [11, 5, 17]. It usually starts by training a teacher model, an encoderdecoder compression model in our task, and then optimize the target, invertible student model, so that it mimics the teacher model’s behaviours. Empirically, we choose the output of the encoder network as the objective for distillation, combining reconstruction loss, entropy loss, and distribution loss to optimize efficiency. The KDM would provide soft labels and accelerate the convergence of invertible models to match the teacher encoderdecoder model’s performance quickly. We conduct experiments on extensive benchmark datasets, and the comparison with the baseline method has shown significant improvement that proves the effectiveness of our proposed ILC.
The main contributions of this paper are highlighted as follows:

We propose a novel ILC framework by introducing an invertible encoding module into lossy image compression with an encoderdecoder framework to simultaneously produce lowdimensional informative representations and capture the knowledge in the distribution of the lost information during encoding.

We propose an efficient training objective of IEM to encourage the latent variable to obey a prespecified distribution, which can be easily drawn and provide rich information during decoding. We then employ a knowledge distillation module drawn from the teacher model to overcome the distribution mismatch problem brought by quantization in compression and accelerate the optimization of IEM.

Extensive experimental results demonstrate that simply replacing the encoder and decoder models by our ILC framework can provide a performance boost of reconstructed image compression results on various benchmark datasets.
2 Related Work
2.1 Lossy Image Compression
The traditional lossy compression algorithms for images [32, 27, 9] use prior knowledge to decouple the low and highfrequency signals, (e.g. discrete cosine transformation (DCT), discrete wavelet transformation (DWT), etc.), and perform reversible coding algorithms (e.g. Huffman Coding [30]) on the quantized signals to achieve compression.
In recent years, more and more deeplearningbased works came into sight and attained astonishing results [6, 7, 26, 24, 19, 31, 8, 22]
. Generally, they use autoencoders, which is widely employed in representation learning and generative problems. The informative bottleneck representation in the autoencoder framework is wellsuited for lossy compression. It is trying to use a neural network, instead of prior knowledge, as an encoder, to draw a lowerdimensional representation from the image straightly, and quantize the lowfrequency signal such that it can be coded into a bitstream. When decompressing the image, a decoder network is employed to reconstruct the original image from the decoded quantized latent representation.
However, quantization leads to differentiability issues. For the past a few years, to enable the entire optimization process to be carried out endtoend, it has reached a consensus that a uniform noise needs to be superimposed on lowfrequency signals as a soft quantization during training. On the other hand, the previous works are also roughly the same in the choice of coding algorithm, that is, the use of arithmetic coding, an efficient coding algorithm based on entropy modelling. It then raises the question of how to estimate the entropy of the signal variable better. Much work on this issue has accomplished positive progress. For example, hyperprior and autoregressive components are used
[7, 26] to jointly model the latent feature map such that a betterperforming entropy model can be constructed. Besides, there is also some work on autoencoder, which helps improve the neural network’s capacity to extract features. It introduced a nonlocal attention block to assist autoencoder to capture the local and global correlation of the latent feature map [24].Although previous methods share promising performance, they ignore the difficulty in the reconstruction stage brought by nonnegligible information loss during encoding. In this work, we identify the problem and propose to mitigate by explicitly modelling the lost information.
2.2 Invertible Neural Network
It has been shown in lots of scenarios that the neural network with invertibility as its core design principle can achieve the same or even better performance than the noninvertible neural networks [21, 33, 16].
As for generative models, denoting the input data as , an invertible neural network as , and as , the inverse function can be trivially obtained, such that can be easily sampled with , where . Furthermore, the density function of can be explicitly defined, which allows us to use the maximum likelihood method for training. Besides, GANs and VAEs are also wellknown in generative problems, but both have defects. In VAEs, the distribution of latent variables, , can only be approximately inferred by data, which means the entire training objective is not an exact form, but a variational lower bound on the loglikelihood of . And for GANs, due to the lack of encoder, it cannot perform inference, which significantly impedes the usability in our scenario. While, with invertibility, a neural network can not only accurately evaluate loglikelihood and perform inference, but be naturally superior in synthesis.
In order to ensure the invertibility of the system, each subblock of the neural network is designed to be invertible, which makes maintaining the model capacity become a top priority task while limiting the design of structure. To address this concern, we take the coupling layer with affine transformation, introduced in RealNVP [13], as a general solution. Consider the coupling layer. Given a dimensional input , and a slicing position , the dimensional output of an affine coupling layer follows the equations:
(1) 
where , , and are arbitrary dimensional invariant functions, and is a constant factor served as a clamp.
For the inverse, given a dimensional , and a slicing position , the dimensional follows:
(2) 
INN has also been applied to paired data and this idea has been demonstrated in different problems. For example, Ardizzone et al. [3] analyzed realworld problems from medicine and astrophysics. In image compression tasks, the classical Maximum Mean Discrepancy (MMD) [14]
method fails to measure the difference in such highdimensional probability distributions. A conditional INN
[4]is designed is applied to guided image generation and colorization. In their task, the guidance
is given as a condition that is obviously not suitable for our aim. Recently, Xiao et al.[33] propose to use INN as a transformation between high and low resolution images. In their task, no explicit constraint of entropy on the lowdimensional representation is considered, which would be one of the biggest challenges in image compression tasks.3 Invertible Lossy Compression
Figure 1 shows the general frameworks of previous variational autoencoder based lossy image compression [6, 7, 26]. Typically it consists of three modules: an encoder network to transform original image into a lowdimensional representation , an entropy model to estimate the density of the quantized representation , and a decoder network to reconstruct the image from . Figure 1 illustrates the sketch of the general framework of ILC. The main difference between ILC and previous methods is that instead of a commonly used encoderdecoder module, an invertible encoding module (IEM) is adopted to explicitly model the lost information during encoding, which mitigates the difficulty in reconstruction stage. In the forward procedure, image is transformed into a coding target and an auxiliary latent variable . We enforce the transformed to follow a specified distribution independent of by a network, which captures the statistical knowledge of the lost information for reconstruction. This can be achieved since it is guaranteed that any absolutely continuous distribution can be transformed bijectively to a standard Gaussian [18] through the knowledge in the network. Then for the inverse, we can utilize a sample to replace the original and reconstruct image from quantized through the inverse transformation. Furthermore, a knowledge distillation module (KDM) would be coupled during the training process to stable the optimization and accelerate the training convergence. We will describe them in detail as follows.
3.1 Invertible Encoding Module
Our IEM consists of two basic invertible components, i.e., invertible downsampling layer and coupling layer. As shown in Figure 1, IEM takes the place of both the encoder and decoder modules in our framework. When the image is being compressed, IEM acts s the encoder by taking as input and feeding into the forward direction of the model. In contrast, IEM is inversely applied to serve as the decoder when the compressed image is being reconstructed.
Invertible Downsampling Layer Downsampling is necessary for the entropy model to enlarge the receptive field and decompose the spatial correlation in compression. However, the traditional downsampling layer in previous learningbased compression methods is obviously irreversible. Therefore, we carefully design an invertible downsampling layer to achieve our purpose. Figure 2 shows the detailed architecture. Firstly, we introduce a wavelet transformation as one of our basic modules. The inverse function of each wavelet transformation naturally exists. More importantly, strong prior knowledge is provided by wavelet transformations that explicitly separate the low and high frequency signals, from which the image compression algorithm would definitely benefit [2]
. On the contrary, all previous deeplearningbased methods fail to integrate this into their models. Specifically speaking, the wavelet transformation transforms the input tensor with height
, width , and channel into a tensor of shape , where the first slices are lowfrequency contents equivalent to average pooling, and the rest are highfrequency components corresponding to residuals in the vertical, horizontal and diagonal directions. We employ Haar wavelet transformation in our framework, which is easy to implement while enlarging the receptive field without information loss and containing certain prior knowledge.Although the nonlearnable wavelet transformation can downsample the input image to enlarge the receptive fields, provide strong prior for the model and be strictly invertible, it suffers from the fixed split of information, i.e. only the first quarter are main lowfrequency contents. A more flexible separation can adapt our channel split between and better and makes training more accessible. Therefore, we use a 11 invertible convolution after the wavelet transformation to refine the split of information. 11 invertible convolution is initially proposed in GLOW [21]
, which is originally used for the channelwise permutation. Different from their purpose, we leverage it to make channelwise refinement after the wavelet transformation and initialize its weight as an identity matrix.
Coupling Layer Processed by invertible downsampling layer, the feature map is roughly broken down into two segments, a lowfrequency component carrying the majority of information and a highfrequency component mainly composed of the former’s residuals, and then fed into a stack of convolutional layers applied to further abstract the corresponding signals. Taking as an input of the coupling layer, the feature map is partitioned into two components by , denoted as and respectively. Notice that the ratio is consistent with the proportion of dimensions in the final outputs, and . To further decouple signals, we employ an affine transformation, proposed in RealNVP [13], on the highfrequency feature map and fuse the captured signal into . On the other hand, if the compression rate is low, we expect simple information to be transferred into so that we can model it in the latent space, rather than eliminated in quantization. As a result, we apply another affine transformation for the flow of information from the low to the highfrequency component. The detailed structure is presented in 2.
The complete expression of affine transformation is shown in Eq.1, where the constant factor, , is set to 1, and all the transformation functions (i.e., and ) can be arbitrary as long as the dimensionalities of input and output are matched. To enhance the expressive power of our model while retaining the lightweight computation, we employ a simple but effective bottlenecklike structure as the transformation functions, which is shown in Figure 2.
3.2 Knowledge Distillation Module
Yet, there are still several challenges that may significantly influence the optimization process of IEM. On the one hand, due to the invertibility, IEM can transform the distribution of and to each other. However, the quantization on would apparently change the distribution on , and we expect that IEM would fail to be robust to such data jitter prompted by quantization, and induce a nonnegligible drop in performance. On the other hand, guaranteeing the invertibility, it has to make independent from as much as possible and force both of them to follow the distributions as required. Considering the various conditions, the massive inequality in the amount of information between and would further provoke a remarkable increment in the difficulty of optimization.
Therefore, we introduce the KDM to mitigate these stubborn challenges. Observing that the encoderdecoder frameworks are much robust to this distribution mismatch caused by the quantization process, which is most likely due to the separate training procedure, we propose to use the encoder network as a teacher model and encourage IEM to mimic the output representation of it at the training stage. In this way, IEM can learn to output the representation with less distribution changing during the quantization process, which makes the inverse function of IEM more robust to the distribution mismatch than before. More importantly, KDM can obviously increase the utility and the ease of training in the early stage, such that IEM can refine the distribution of from the old encoder, and surrounding it, look for a better lowdimensional manifold of the data distribution that can be efficiently inverted.
Practically, it requires a fully specified prior, a teacher encoder and its affiliated entropy module, which could furnish at least a suboptimal for the sophisticated knowledge in the distribution on with learnable uncertain components. Specifically speaking, the fully trained prior is utilized to initialize the entropy model for IEM. Throughout the training, the teacher’s knowledge is transferred through a distillation loss guiding , which is evaluated by first feeding the same into both the teacher encoder and IEM and then computing their difference on .
3.3 Optimization Objective
Our fundamental training objective follows previous learningbased lossy compression methods, i.e. to minimize a weighted sum of the rate and distortion , where the rate is lowerbounded by the entropy of the discrete probability distribution of the quantized vector , and the distortion is the difference between the reconstructed and the original input . In addition, two novel objectives are included for our invertible framework and efficient training: (1) A distribution matching loss for capturing the distribution of lost information, and (2) A distillation loss that stables our training procedure.
Rate Our basic goal is to minimize the rate of the coding target . Therefore, we leverage entropy as our objective, which is consistent with previous works [6, 7, 26]. As mentioned above, an entropy module is used to estimate the entropy on . We follow their loss definition and denote it as:
(3) 
where is the approximation of the quantizaiton and is an additive i.i.d. uniform noise, with the same with as quantization bins, which in our case is one.
Distortion Because of the uncertainty of , and the quantization on , a distortion loss is still required to ensure the inverse reconstruction of our model. We denote the reverse process as . The distortion loss is able to encourage our model to adapt the new sample drawn from . We formulate the loss as:
(4) 
where is a sample from . For training stability in practice, we empirically take the mostlikely sample from the distribution for reconstruction. We employ loss as . Different from previous works that use the noisy representation as an approximation to in distortion, we directly employ with StraightThrough Estimator [10] during optimization. It mainly results from the inconsistency of in training and in inference, which may negatively influence the generalization of IEM sharing the same parameters during encoding and decoding.
Distribution Matching This part of training objectives is mainly to enforce the transformation from the ydependent lost information to a standard Gaussian representation. Denoting the distribution on that is transformed from the data distribution as , we aim to minimize its difference from the specified yindependent distribution . For practical optimization, we employ the crossentropy (CE) to measure the difference, which leads to the objective:
(5)  
The distribution loss encourages to follow the same target distribution for every , thereby encouraging independence between them.
Distillation As mentioned in the previous section, we leverage a distillation module to stable our training. The distillation loss based on a teacher model is defined as:
(6) 
where is a difference metric, and we use loss in practice.
Total loss Combining these objectives, our total loss for training is:
(7) 
where , , and are coefficients for balancing different loss terms.
4 Experiments
The dataset we used for training is a subset of ImageNet
[12]. We filter out 9250 images with a size larger than , in order to easily preprocess on data. All the images for training are preprocessed by random rescaling and cropping. Evaluations were performed on the Kodak image dataset [15] commonly used as test data for compression problems, and a randomly sampled subset of the ImageNet with 100 uncompressed images, where none of them is used in training.Our idea is verified on [6]
, since it has opensource and reproducible implementation. With the number of filters equal to 256, the teacher encoder is trained following the instructions in
[6]. We use Adam optimization [20] algorithm for all the parameters with different learning rates on the convolutional autoencoder and the entropy model, where the former is set to , and the other is set to . The training and evaluation are performed at each compression rate separately. The compression rate is controlled by adjusting the ratio, i.e. , of distortion and rate in its objective function. To guarantee the training to be thoroughly carried out, we train at least one million iterations and ensure that the performance no longer increases in a certain period of time.Afterwards, following the architecture in Figure 1, the pretrained encoder is loaded as our teacher model. We jointly optimize all the parameters in IEM and the entropy model. The training dataset and preprocessing methods are the same as before, except for the learning rate, where we make it gradually decay since 0.1 million iterations. Practically, for different bitrates, we use the same coefficient for distillation, , during training, and adjust on all the others, i.e. , and , in our objective function 7.
Aligned with the previous work, we quantified our model with peak noisesignal ratio (PSNR) on both RGB and the luma component of images. We compare the ratedistortion performance of our method to the baseline approach, the teacher model in our framework. For evaluating PSNR, we observe that
has to be sampled from a distribution with lower variance to obtain a better performance. It is consistent with what GLOW
[21] does, which samples from a distribution with shrinked variance to prevent mode collapse on generated images. In our case, the priority is not variance but the performance in reconstruction, so instead of sampling, we take the mostlikely from the specified distribution. Compared with the baseline method, with the same compression ratio, ILC improves the PSNR performance by around 0.4dB. And empirically, as the compression ratio rises (i.e. bitrate decreases), ILC can better boost the performance. Intuitively, under more severe demands on bitrate, the convolutional autoencoder tends to undertake more responsibility in further extracting the refined representation of data, which leads to more irreversible loss of information. In this case, explicitly modelling the highfrequency signals turns into the key to success.To further verify that ILC’s superiority over autoencoder brings the improvement, we use this result to compare with ILC guided by an earlystopped teacher model. It shows a strong positive correlation between the performance of ILC and its teacher model, which testifies our thoughts that ILC is strongly guided and would always look for a manifold surrounding it. ILC’s preponderance is mainly in the reconstruction, so it could be leveraged to improve the performance jointly with other more advanced entropy models.
5 Conclusion
In this paper, we propose a novel framework ILC for lossy image compression, with explicitly modelling the information loss during encoding. By doing this, the illposed problem of the postreconstruction stage is largely mitigated. To achieve our purpose, we design an invertible encoding module to replace the encoderdecoder network in previous methods, which is obviously more suitable for this pair of inverse tasks. With the latent variable’s statistical knowledge, IEM can reconstruct the original input image from the compressed bitstream with high quality by drawing a sample from a prespecific distribution. To overcome the intractable challenges in optimization, we propose a knowledge distillation module to provide the soft labels from a teacher model, which significantly accelerates the training procedure. Extensive experiments demonstrate that our framework ILC significantly improves performance based on existing methods.
References

[1]
(2019)
Generative adversarial networks for extreme learned image compression.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 221–231. Cited by: §1.  [2] (1992) Image coding using wavelet transform. IEEE Transactions on image processing 1 (2), pp. 205–220. Cited by: §3.1.
 [3] (2019) Analyzing inverse problems with invertible neural networks. In Proceedings of the International Conference on Learning and Representations, Cited by: §2.2.
 [4] (2019) Guided image generation with conditional invertible neural networks. arXiv preprint arXiv:1907.02392. Cited by: §2.2.
 [5] (2014) Do deep nets really need to be deep?. In Advances in neural information processing systems, pp. 2654–2662. Cited by: §1.
 [6] (2017) Endtoend optimized image compression. ICLR. Cited by: §1, §1, §2.1, §3.3, §3, §4.
 [7] (2018) Variational image compression with a scale hyperprior. ICLR. Cited by: §1, §1, §1, §2.1, §2.1, Figure 1, §3.3, §3.
 [8] (2018) Efficient nonlinear transforms for lossy image compression. In 2018 Picture Coding Symposium (PCS), pp. 248–252. Cited by: §1, §2.1.
 [9] (2015) BPG image format. URL https://bellard. org/bpg. Cited by: §1, §2.1.

[10]
(2013)
Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §3.3.  [11] (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §1.

[12]
(2009)
Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: Appendix 0.A, §4.  [13] (2017) Density estimation using real NVP. In Proceedings of the International Conference on Learning Representations, Cited by: §2.2, §3.1.
 [14] (2015) Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906. Cited by: §2.2.
 [15] (1999) Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak 4. Cited by: §4.

[16]
(2017)
The reversible residual network: backpropagation without storing activations
. In Advances in neural information processing systems, pp. 2214–2224. Cited by: §2.2.  [17] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.

[18]
(1999)
Nonlinear independent component analysis: existence and uniqueness results
. Neural Networks 12 (3), pp. 429–439. Cited by: §3.  [19] (2019) Computationally efficient neural image compression. arXiv preprint arXiv:1912.08771. Cited by: §1, §2.1.
 [20] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix 0.A, §4.
 [21] (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §2.2, §3.1, §4.
 [22] (2019) Contextadaptive entropy model for endtoend optimized image compression. ICLR. Cited by: §2.1.
 [23] (2019) Extended endtoend optimized image compression method based on a contextadaptive entropy model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1.
 [24] (2019) Nonlocal attention optimized deep image compression. arXiv preprint arXiv:1904.09757. Cited by: §1, §1, §2.1, §2.1.
 [25] (2019) Image and video compression with neural networks: a review. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
 [26] (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §1, §1, §1, §2.1, §2.1, Figure 1, §3.3, §3.
 [27] (2002) JPEG2000: image compression fundamentals, standards and practice. Journal of Electronic Imaging 11 (2), pp. 286. Cited by: §1, §2.1.
 [28] (1981) Universal modeling and coding. IEEE Transactions on Information Theory 27 (1), pp. 12–23. Cited by: §1.
 [29] (2018) Deep generative models for distributionpreserving lossy compression. In Advances in Neural Information Processing Systems, pp. 5929–5940. Cited by: §1.
 [30] (1976) On the construction of huffman trees.. In ICALP, pp. 382–410. Cited by: §2.1.
 [31] (2020) Lossy compression with distortion constrained optimization. arXiv preprint arXiv:2005.04064. Cited by: §1, §2.1.
 [32] (1992) The jpeg still picture compression standard. IEEE transactions on consumer electronics 38 (1), pp. xviii–xxxiv. Cited by: §1, §2.1.
 [33] (2020) Invertible image rescaling. arXiv preprint arXiv:2005.05650. Cited by: §2.2, §2.2.
Appendix 0.A Experimental Details
Network Structure The coupling layers are parametrized using neural networks with a bottlenecklike structure, which consists of three convolutional layers, where the first and last is , and the middle one is . This is designed to improve efficiency in using parameters by allowing a larger number of channels under an acceptable computing complexity. In our implementation, the detail of architecture is specified in Table 1.
Input Size  Input Channel  Width  Kernel Size (K) 

64 64  48  128  5 5 
32 32  192  256  3 3 
16 16  768  1024  3 3 
Training Details Throughout training, we use Adam optimizer [20] with and . For the learning rate, different initial values for IEM and the entropy model are used, denoted as and respectively. We let both of them decay from the 0.1 million iterations, which is computed as
(8) 
, where , is a step counter, and is a decay factor that is practically around 0.999995. and are set to and respectively.
As for the training objective, the coefficients need to be carefully tuned for each bitrate. Empirically, we set and , and use grid search to look for optimal weights of and ranging from to . Notice, it does not mean is optimal in all the scenarios, but experimentally there is no significant difference between the change of performance when simultaneously changing and , and adjusting only.
Ramdomly Sampled Dataset for Evaluation The subset of ImageNet[12] we used for evaluation is randomly sampled.
ILSVRC2012_test_00000505  ILSVRC2012_test_00001058  ILSVRC2012_test_00001246 
ILSVRC2012_test_00001444  ILSVRC2012_test_00005577  ILSVRC2012_test_00006189 
ILSVRC2012_test_00007775  ILSVRC2012_test_00008278  ILSVRC2012_test_00008440 
ILSVRC2012_test_00008460  ILSVRC2012_test_00009563  ILSVRC2012_test_00010174 
ILSVRC2012_test_00011618  ILSVRC2012_test_00012189  ILSVRC2012_test_00012215 
ILSVRC2012_test_00012448  ILSVRC2012_test_00012492  ILSVRC2012_test_00012706 
ILSVRC2012_test_00012814  ILSVRC2012_test_00013452  ILSVRC2012_test_00016082 
ILSVRC2012_test_00020129  ILSVRC2012_test_00023175  ILSVRC2012_test_00023432 
ILSVRC2012_test_00023806  ILSVRC2012_test_00024649  ILSVRC2012_test_00027574 
ILSVRC2012_test_00028721  ILSVRC2012_test_00029016  ILSVRC2012_test_00029428 
ILSVRC2012_test_00030152  ILSVRC2012_test_00030240  ILSVRC2012_test_00030535 
ILSVRC2012_test_00030594  ILSVRC2012_test_00033552  ILSVRC2012_test_00035652 
ILSVRC2012_test_00036628  ILSVRC2012_test_00037322  ILSVRC2012_test_00037832 
ILSVRC2012_test_00038678  ILSVRC2012_test_00038827  ILSVRC2012_test_00039565 
ILSVRC2012_test_00040346  ILSVRC2012_test_00040449  ILSVRC2012_test_00042420 
ILSVRC2012_test_00042491  ILSVRC2012_test_00042549  ILSVRC2012_test_00042583 
ILSVRC2012_test_00042631  ILSVRC2012_test_00043029  ILSVRC2012_test_00043873 
ILSVRC2012_test_00046159  ILSVRC2012_test_00047576  ILSVRC2012_test_00049924 
ILSVRC2012_test_00050181  ILSVRC2012_test_00051384  ILSVRC2012_test_00053071 
ILSVRC2012_test_00053603  ILSVRC2012_test_00054755  ILSVRC2012_test_00055533 
ILSVRC2012_test_00063855  ILSVRC2012_test_00063995  ILSVRC2012_test_00066697 
ILSVRC2012_test_00066914  ILSVRC2012_test_00068457  ILSVRC2012_test_00068596 
ILSVRC2012_test_00070325  ILSVRC2012_test_00071340  ILSVRC2012_test_00071954 
ILSVRC2012_test_00073228  ILSVRC2012_test_00073412  ILSVRC2012_test_00074198 
ILSVRC2012_test_00075878  ILSVRC2012_test_00076257  ILSVRC2012_test_00078714 
ILSVRC2012_test_00078897  ILSVRC2012_test_00080599  ILSVRC2012_test_00081961 
ILSVRC2012_test_00082349  ILSVRC2012_test_00085480  ILSVRC2012_test_00085775 
ILSVRC2012_test_00086081  ILSVRC2012_test_00086280  ILSVRC2012_test_00086391 
ILSVRC2012_test_00086845  ILSVRC2012_test_00087796  ILSVRC2012_test_00087939 
ILSVRC2012_test_00089036  ILSVRC2012_test_00089377  ILSVRC2012_test_00089680 
ILSVRC2012_test_00090568  ILSVRC2012_test_00090956  ILSVRC2012_test_00093267 
ILSVRC2012_test_00093343  ILSVRC2012_test_00096391  ILSVRC2012_test_00096900 
ILSVRC2012_test_00099475  ILSVRC2012_test_00099912  ILSVRC2015_test_00002901 
ILSVRC2015_test_00010740 
Appendix 0.B Performance without Quantization
We test our model without quantization and find that its performance is remarkably improved. Contrarily, with an irreversible autoencoder, there is only a slight performance increment, which shows the preponderance of invertibility in the process of reconstruction. Such a huge gap result from quantization implies a tremendous potential in information modelling, which points out a direction for prospective works.
Baseline  ILC  

NQ (dB) / Q (dB)  bpp  NQ (dB) / Q (dB)  bpp 
31.61 / 29.84  0.2984  40.62 / 30.59  0.3047 
33.45 / 31.63  0.4752  42.43 / 32.36  0.4917 
34.05 / 32.24  0.5308  43.07 / 32.59  0.5295 
35.22 / 33.42  0.674  44.11 / 33.4  0.6197 
36.36 / 34.61  0.8232  45.49 / 34.62  0.7798 