Lossy image compression has played a central role in the scenario of image storing and transferring for a long time, especially given the exploding amount of large-sized images and comparatively limited storage or bandwidth nowadays. Specifically, transform coding based methods perform well and are widely adopted in practice, such as JPEG , JPEG2000  and BPG . Recently, deep-learning-based lossy image compression methods [6, 7, 26, 24, 19, 31, 1, 8, 29, 25] have generated great interests due to the impressing performance and low bitrate.
Unlike previous transform coding methods, deep-learning-based methods first transform an image
into a lower-dimensional latent representation vectorby an encoder network, then quantize as a discrete-valued vector . Lossless entropy coding methods, such as arithmetic coding , are then applied to and compress it into a bit-stream. Some prior works [6, 7, 26, 23, 24]
adopt an auxiliary network as an entropy model to estimate the density and provide the statistics to the entropy model. A decoder network is used to approximate the inverse function that maps the latent variables back to pixels. There inevitably exists information loss after passing the encoder network by reducing the representation dimensions and quantization by rounding floating numbers.
The reduction in the dimensionality of the representation on induces a significant drop in the amount of information maintained by the original image
, which is inconsistent with the goal of compression that aims to reduce the entropy of the representation under a prior probability entropy model. Previous works[7, 26] mainly focus on finding a reasonable estimation of the prior density to minimize the expected length of the bit-stream, but few put effort into improving the expected distortion of the reconstructed image with respect to the original . The information loss poses a notable challenge to recovering the original image by using a decoder network only since the task is made ill-posed, which would obviously influence the rate-distortion optimization. One way to keep all the information during encoding is to preserve the dimensionality of the representation in , i.e. the same as the original image
. The reconstruction stage would benefit from such an informative representation, while it is extremely tough to encode such high-dimensional data into desired low bitrate with lossless entropy coding methods.
In this paper, we propose a novel framework for lossy image compression called Invertible Lossy Compression (ILC) to tackle this intractable problem by capturing the most knowledge of the lost information. The recovery of the original image can benefit from modelling the statistical behaviour of the lost information in expectation during encoding. To this end, the lost information should be expressed in an explicit form and can be encoded to hidden states without unconscious information loss. The encoder-decoder framework employed in previous image compression methods does not well meet such requirement since massive efforts have to be paid to make the encoder invertible and the decoder as the inverse of it, and even so there is still approximation error in the invertibility. Therefore, we adopt an invertible encoding module (IEM), which is strictly reversible and apparently be more satiable to this pair of inverse tasks. IEM consists of two essential components, i.e., invertible downsampling layers to enlarge the receptive field and decompose the spatial correlation among channels; and coupling layer to enhance the expressive power of the module. Since an invertible model gives a representation of the same dimensionality with input image , we split the encoded representation as and , where preserves the necessary information for image reconstruction and stores all left information (like high-frequency noise that doesn’t alter the original image much). For lossy compression, we discard and encode the quantized into a bit-stream, while we also model the distribution of the lost information . For reconstruction, we feed the INN inversely with and a randomly drawn sample of from the learned distribution.
To achieve our purpose, as ensuring the representation of to be informative as well as coding favourable, it is crucial to make IEM capturing the knowledge in the distribution of , and meanwhile removing the dependency between and . To this end, we employ a distribution matching loss to encourage following a Gaussian distribution. More importantly, the quantization process would change the distribution on the representation from to , which disturbs the inverse procedure of IEM when recovering by combining with . It is challenging to train an invertible model with such a distribution mismatch. Observing that conventional model training is more robust to the quantization process, we propose a knowledge distillation module (KDM) to transfer the knowledge from encoder-decoder models to IEM with knowledge distillation [11, 5, 17]. It usually starts by training a teacher model, an encoder-decoder compression model in our task, and then optimize the target, invertible student model, so that it mimics the teacher model’s behaviours. Empirically, we choose the output of the encoder network as the objective for distillation, combining reconstruction loss, entropy loss, and distribution loss to optimize efficiency. The KDM would provide soft labels and accelerate the convergence of invertible models to match the teacher encoder-decoder model’s performance quickly. We conduct experiments on extensive benchmark datasets, and the comparison with the baseline method has shown significant improvement that proves the effectiveness of our proposed ILC.
The main contributions of this paper are highlighted as follows:
We propose a novel ILC framework by introducing an invertible encoding module into lossy image compression with an encoder-decoder framework to simultaneously produce low-dimensional informative representations and capture the knowledge in the distribution of the lost information during encoding.
We propose an efficient training objective of IEM to encourage the latent variable to obey a pre-specified distribution, which can be easily drawn and provide rich information during decoding. We then employ a knowledge distillation module drawn from the teacher model to overcome the distribution mismatch problem brought by quantization in compression and accelerate the optimization of IEM.
Extensive experimental results demonstrate that simply replacing the encoder and decoder models by our ILC framework can provide a performance boost of reconstructed image compression results on various benchmark datasets.
2 Related Work
2.1 Lossy Image Compression
The traditional lossy compression algorithms for images [32, 27, 9] use prior knowledge to decouple the low- and high-frequency signals, (e.g. discrete cosine transformation (DCT), discrete wavelet transformation (DWT), etc.), and perform reversible coding algorithms (e.g. Huffman Coding ) on the quantized signals to achieve compression.
. Generally, they use auto-encoders, which is widely employed in representation learning and generative problems. The informative bottleneck representation in the auto-encoder framework is well-suited for lossy compression. It is trying to use a neural network, instead of prior knowledge, as an encoder, to draw a lower-dimensional representation from the image straightly, and quantize the low-frequency signal such that it can be coded into a bit-stream. When decompressing the image, a decoder network is employed to reconstruct the original image from the decoded quantized latent representation.
However, quantization leads to differentiability issues. For the past a few years, to enable the entire optimization process to be carried out end-to-end, it has reached a consensus that a uniform noise needs to be superimposed on low-frequency signals as a soft quantization during training. On the other hand, the previous works are also roughly the same in the choice of coding algorithm, that is, the use of arithmetic coding, an efficient coding algorithm based on entropy modelling. It then raises the question of how to estimate the entropy of the signal variable better. Much work on this issue has accomplished positive progress. For example, hyperprior and autoregressive components are used[7, 26] to jointly model the latent feature map such that a better-performing entropy model can be constructed. Besides, there is also some work on auto-encoder, which helps improve the neural network’s capacity to extract features. It introduced a non-local attention block to assist auto-encoder to capture the local and global correlation of the latent feature map .
Although previous methods share promising performance, they ignore the difficulty in the reconstruction stage brought by non-negligible information loss during encoding. In this work, we identify the problem and propose to mitigate by explicitly modelling the lost information.
2.2 Invertible Neural Network
It has been shown in lots of scenarios that the neural network with invertibility as its core design principle can achieve the same or even better performance than the non-invertible neural networks [21, 33, 16].
As for generative models, denoting the input data as , an invertible neural network as , and as , the inverse function can be trivially obtained, such that can be easily sampled with , where . Furthermore, the density function of can be explicitly defined, which allows us to use the maximum likelihood method for training. Besides, GANs and VAEs are also well-known in generative problems, but both have defects. In VAEs, the distribution of latent variables, , can only be approximately inferred by data, which means the entire training objective is not an exact form, but a variational lower bound on the log-likelihood of . And for GANs, due to the lack of encoder, it cannot perform inference, which significantly impedes the usability in our scenario. While, with invertibility, a neural network can not only accurately evaluate log-likelihood and perform inference, but be naturally superior in synthesis.
In order to ensure the invertibility of the system, each sub-block of the neural network is designed to be invertible, which makes maintaining the model capacity become a top priority task while limiting the design of structure. To address this concern, we take the coupling layer with affine transformation, introduced in RealNVP , as a general solution. Consider the coupling layer. Given a dimensional input , and a slicing position , the dimensional output of an affine coupling layer follows the equations:
where , , and are arbitrary dimensional invariant functions, and is a constant factor served as a clamp.
For the inverse, given a dimensional , and a slicing position , the dimensional follows:
INN has also been applied to paired data and this idea has been demonstrated in different problems. For example, Ardizzone et al.  analyzed real-world problems from medicine and astrophysics. In image compression tasks, the classical Maximum Mean Discrepancy (MMD) 
method fails to measure the difference in such high-dimensional probability distributions. A conditional INN
is designed is applied to guided image generation and colorization. In their task, the guidanceis given as a condition that is obviously not suitable for our aim. Recently, Xiao et al. propose to use INN as a transformation between high- and low- resolution images. In their task, no explicit constraint of entropy on the low-dimensional representation is considered, which would be one of the biggest challenges in image compression tasks.
3 Invertible Lossy Compression
Figure 1 shows the general frameworks of previous variational auto-encoder based lossy image compression [6, 7, 26]. Typically it consists of three modules: an encoder network to transform original image into a low-dimensional representation , an entropy model to estimate the density of the quantized representation , and a decoder network to reconstruct the image from . Figure 1 illustrates the sketch of the general framework of ILC. The main difference between ILC and previous methods is that instead of a commonly used encoder-decoder module, an invertible encoding module (IEM) is adopted to explicitly model the lost information during encoding, which mitigates the difficulty in reconstruction stage. In the forward procedure, image is transformed into a coding target and an auxiliary latent variable . We enforce the transformed to follow a specified distribution independent of by a network, which captures the statistical knowledge of the lost information for reconstruction. This can be achieved since it is guaranteed that any absolutely continuous distribution can be transformed bijectively to a standard Gaussian  through the knowledge in the network. Then for the inverse, we can utilize a sample to replace the original and reconstruct image from quantized through the inverse transformation. Furthermore, a knowledge distillation module (KDM) would be coupled during the training process to stable the optimization and accelerate the training convergence. We will describe them in detail as follows.
3.1 Invertible Encoding Module
Our IEM consists of two basic invertible components, i.e., invertible downsampling layer and coupling layer. As shown in Figure 1, IEM takes the place of both the encoder and decoder modules in our framework. When the image is being compressed, IEM acts s the encoder by taking as input and feeding into the forward direction of the model. In contrast, IEM is inversely applied to serve as the decoder when the compressed image is being reconstructed.
Invertible Downsampling Layer Downsampling is necessary for the entropy model to enlarge the receptive field and decompose the spatial correlation in compression. However, the traditional downsampling layer in previous learning-based compression methods is obviously irreversible. Therefore, we carefully design an invertible downsampling layer to achieve our purpose. Figure 2 shows the detailed architecture. Firstly, we introduce a wavelet transformation as one of our basic modules. The inverse function of each wavelet transformation naturally exists. More importantly, strong prior knowledge is provided by wavelet transformations that explicitly separate the low- and high- frequency signals, from which the image compression algorithm would definitely benefit 
. On the contrary, all previous deep-learning-based methods fail to integrate this into their models. Specifically speaking, the wavelet transformation transforms the input tensor with height, width , and channel into a tensor of shape , where the first slices are low-frequency contents equivalent to average pooling, and the rest are high-frequency components corresponding to residuals in the vertical, horizontal and diagonal directions. We employ Haar wavelet transformation in our framework, which is easy to implement while enlarging the receptive field without information loss and containing certain prior knowledge.
Although the non-learnable wavelet transformation can downsample the input image to enlarge the receptive fields, provide strong prior for the model and be strictly invertible, it suffers from the fixed split of information, i.e. only the first quarter are main low-frequency contents. A more flexible separation can adapt our channel split between and better and makes training more accessible. Therefore, we use a 11 invertible convolution after the wavelet transformation to refine the split of information. 11 invertible convolution is initially proposed in GLOW 
, which is originally used for the channel-wise permutation. Different from their purpose, we leverage it to make channel-wise refinement after the wavelet transformation and initialize its weight as an identity matrix.
Coupling Layer Processed by invertible downsampling layer, the feature map is roughly broken down into two segments, a low-frequency component carrying the majority of information and a high-frequency component mainly composed of the former’s residuals, and then fed into a stack of convolutional layers applied to further abstract the corresponding signals. Taking as an input of the coupling layer, the feature map is partitioned into two components by , denoted as and respectively. Notice that the ratio is consistent with the proportion of dimensions in the final outputs, and . To further decouple signals, we employ an affine transformation, proposed in RealNVP , on the high-frequency feature map and fuse the captured signal into . On the other hand, if the compression rate is low, we expect simple information to be transferred into so that we can model it in the latent space, rather than eliminated in quantization. As a result, we apply another affine transformation for the flow of information from the low- to the high-frequency component. The detailed structure is presented in 2.
The complete expression of affine transformation is shown in Eq.1, where the constant factor, , is set to 1, and all the transformation functions (i.e., and ) can be arbitrary as long as the dimensionalities of input and output are matched. To enhance the expressive power of our model while retaining the light-weight computation, we employ a simple but effective bottleneck-like structure as the transformation functions, which is shown in Figure 2.
3.2 Knowledge Distillation Module
Yet, there are still several challenges that may significantly influence the optimization process of IEM. On the one hand, due to the invertibility, IEM can transform the distribution of and to each other. However, the quantization on would apparently change the distribution on , and we expect that IEM would fail to be robust to such data jitter prompted by quantization, and induce a non-negligible drop in performance. On the other hand, guaranteeing the invertibility, it has to make independent from as much as possible and force both of them to follow the distributions as required. Considering the various conditions, the massive inequality in the amount of information between and would further provoke a remarkable increment in the difficulty of optimization.
Therefore, we introduce the KDM to mitigate these stubborn challenges. Observing that the encoder-decoder frameworks are much robust to this distribution mismatch caused by the quantization process, which is most likely due to the separate training procedure, we propose to use the encoder network as a teacher model and encourage IEM to mimic the output representation of it at the training stage. In this way, IEM can learn to output the representation with less distribution changing during the quantization process, which makes the inverse function of IEM more robust to the distribution mismatch than before. More importantly, KDM can obviously increase the utility and the ease of training in the early stage, such that IEM can refine the distribution of from the old encoder, and surrounding it, look for a better low-dimensional manifold of the data distribution that can be efficiently inverted.
Practically, it requires a fully specified prior, a teacher encoder and its affiliated entropy module, which could furnish at least a sub-optimal for the sophisticated knowledge in the distribution on with learnable uncertain components. Specifically speaking, the fully trained prior is utilized to initialize the entropy model for IEM. Throughout the training, the teacher’s knowledge is transferred through a distillation loss guiding , which is evaluated by first feeding the same into both the teacher encoder and IEM and then computing their difference on .
3.3 Optimization Objective
Our fundamental training objective follows previous learning-based lossy compression methods, i.e. to minimize a weighted sum of the rate and distortion , where the rate is lower-bounded by the entropy of the discrete probability distribution of the quantized vector , and the distortion is the difference between the reconstructed and the original input . In addition, two novel objectives are included for our invertible framework and efficient training: (1) A distribution matching loss for capturing the distribution of lost information, and (2) A distillation loss that stables our training procedure.
Rate Our basic goal is to minimize the rate of the coding target . Therefore, we leverage entropy as our objective, which is consistent with previous works [6, 7, 26]. As mentioned above, an entropy module is used to estimate the entropy on . We follow their loss definition and denote it as:
where is the approximation of the quantizaiton and is an additive i.i.d. uniform noise, with the same with as quantization bins, which in our case is one.
Distortion Because of the uncertainty of , and the quantization on , a distortion loss is still required to ensure the inverse reconstruction of our model. We denote the reverse process as . The distortion loss is able to encourage our model to adapt the new sample drawn from . We formulate the loss as:
where is a sample from . For training stability in practice, we empirically take the most-likely sample from the distribution for reconstruction. We employ loss as . Different from previous works that use the noisy representation as an approximation to in distortion, we directly employ with Straight-Through Estimator  during optimization. It mainly results from the inconsistency of in training and in inference, which may negatively influence the generalization of IEM sharing the same parameters during encoding and decoding.
Distribution Matching This part of training objectives is mainly to enforce the transformation from the y-dependent lost information to a standard Gaussian representation. Denoting the distribution on that is transformed from the data distribution as , we aim to minimize its difference from the specified y-independent distribution . For practical optimization, we employ the cross-entropy (CE) to measure the difference, which leads to the objective:
The distribution loss encourages to follow the same target distribution for every , thereby encouraging independence between them.
Distillation As mentioned in the previous section, we leverage a distillation module to stable our training. The distillation loss based on a teacher model is defined as:
where is a difference metric, and we use loss in practice.
Total loss Combining these objectives, our total loss for training is:
where , , and are coefficients for balancing different loss terms.
The dataset we used for training is a subset of ImageNet. We filter out 9250 images with a size larger than , in order to easily preprocess on data. All the images for training are preprocessed by random rescaling and cropping. Evaluations were performed on the Kodak image dataset  commonly used as test data for compression problems, and a randomly sampled subset of the ImageNet with 100 uncompressed images, where none of them is used in training.
Our idea is verified on 
, since it has open-source and reproducible implementation. With the number of filters equal to 256, the teacher encoder is trained following the instructions in. We use Adam optimization  algorithm for all the parameters with different learning rates on the convolutional auto-encoder and the entropy model, where the former is set to , and the other is set to . The training and evaluation are performed at each compression rate separately. The compression rate is controlled by adjusting the ratio, i.e. , of distortion and rate in its objective function. To guarantee the training to be thoroughly carried out, we train at least one million iterations and ensure that the performance no longer increases in a certain period of time.
Afterwards, following the architecture in Figure 1, the pre-trained encoder is loaded as our teacher model. We jointly optimize all the parameters in IEM and the entropy model. The training dataset and preprocessing methods are the same as before, except for the learning rate, where we make it gradually decay since 0.1 million iterations. Practically, for different bitrates, we use the same coefficient for distillation, , during training, and adjust on all the others, i.e. , and , in our objective function 7.
Aligned with the previous work, we quantified our model with peak noise-signal ratio (PSNR) on both RGB and the luma component of images. We compare the rate-distortion performance of our method to the baseline approach, the teacher model in our framework. For evaluating PSNR, we observe that
has to be sampled from a distribution with lower variance to obtain a better performance. It is consistent with what GLOW does, which samples from a distribution with shrinked variance to prevent mode collapse on generated images. In our case, the priority is not variance but the performance in reconstruction, so instead of sampling, we take the most-likely from the specified distribution. Compared with the baseline method, with the same compression ratio, ILC improves the PSNR performance by around 0.4dB. And empirically, as the compression ratio rises (i.e. bitrate decreases), ILC can better boost the performance. Intuitively, under more severe demands on bitrate, the convolutional auto-encoder tends to undertake more responsibility in further extracting the refined representation of data, which leads to more irreversible loss of information. In this case, explicitly modelling the high-frequency signals turns into the key to success.
To further verify that ILC’s superiority over auto-encoder brings the improvement, we use this result to compare with ILC guided by an early-stopped teacher model. It shows a strong positive correlation between the performance of ILC and its teacher model, which testifies our thoughts that ILC is strongly guided and would always look for a manifold surrounding it. ILC’s preponderance is mainly in the reconstruction, so it could be leveraged to improve the performance jointly with other more advanced entropy models.
In this paper, we propose a novel framework ILC for lossy image compression, with explicitly modelling the information loss during encoding. By doing this, the ill-posed problem of the post-reconstruction stage is largely mitigated. To achieve our purpose, we design an invertible encoding module to replace the encoder-decoder network in previous methods, which is obviously more suitable for this pair of inverse tasks. With the latent variable’s statistical knowledge, IEM can reconstruct the original input image from the compressed bit-stream with high quality by drawing a sample from a pre-specific distribution. To overcome the intractable challenges in optimization, we propose a knowledge distillation module to provide the soft labels from a teacher model, which significantly accelerates the training procedure. Extensive experiments demonstrate that our framework ILC significantly improves performance based on existing methods.
Generative adversarial networks for extreme learned image compression.
Proceedings of the IEEE International Conference on Computer Vision, pp. 221–231. Cited by: §1.
-  (1992) Image coding using wavelet transform. IEEE Transactions on image processing 1 (2), pp. 205–220. Cited by: §3.1.
-  (2019) Analyzing inverse problems with invertible neural networks. In Proceedings of the International Conference on Learning and Representations, Cited by: §2.2.
-  (2019) Guided image generation with conditional invertible neural networks. arXiv preprint arXiv:1907.02392. Cited by: §2.2.
-  (2014) Do deep nets really need to be deep?. In Advances in neural information processing systems, pp. 2654–2662. Cited by: §1.
-  (2017) End-to-end optimized image compression. ICLR. Cited by: §1, §1, §2.1, §3.3, §3, §4.
-  (2018) Variational image compression with a scale hyperprior. ICLR. Cited by: §1, §1, §1, §2.1, §2.1, Figure 1, §3.3, §3.
-  (2018) Efficient nonlinear transforms for lossy image compression. In 2018 Picture Coding Symposium (PCS), pp. 248–252. Cited by: §1, §2.1.
-  (2015) BPG image format. URL https://bellard. org/bpg. Cited by: §1, §2.1.
Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §3.3.
-  (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §1.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: Appendix 0.A, §4.
-  (2017) Density estimation using real NVP. In Proceedings of the International Conference on Learning Representations, Cited by: §2.2, §3.1.
-  (2015) Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906. Cited by: §2.2.
-  (1999) Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak 4. Cited by: §4.
The reversible residual network: backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224. Cited by: §2.2.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
Nonlinear independent component analysis: existence and uniqueness results. Neural Networks 12 (3), pp. 429–439. Cited by: §3.
-  (2019) Computationally efficient neural image compression. arXiv preprint arXiv:1912.08771. Cited by: §1, §2.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix 0.A, §4.
-  (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §2.2, §3.1, §4.
-  (2019) Context-adaptive entropy model for end-to-end optimized image compression. ICLR. Cited by: §2.1.
-  (2019) Extended end-to-end optimized image compression method based on a context-adaptive entropy model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1.
-  (2019) Non-local attention optimized deep image compression. arXiv preprint arXiv:1904.09757. Cited by: §1, §1, §2.1, §2.1.
-  (2019) Image and video compression with neural networks: a review. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
-  (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §1, §1, §1, §2.1, §2.1, Figure 1, §3.3, §3.
-  (2002) JPEG2000: image compression fundamentals, standards and practice. Journal of Electronic Imaging 11 (2), pp. 286. Cited by: §1, §2.1.
-  (1981) Universal modeling and coding. IEEE Transactions on Information Theory 27 (1), pp. 12–23. Cited by: §1.
-  (2018) Deep generative models for distribution-preserving lossy compression. In Advances in Neural Information Processing Systems, pp. 5929–5940. Cited by: §1.
-  (1976) On the construction of huffman trees.. In ICALP, pp. 382–410. Cited by: §2.1.
-  (2020) Lossy compression with distortion constrained optimization. arXiv preprint arXiv:2005.04064. Cited by: §1, §2.1.
-  (1992) The jpeg still picture compression standard. IEEE transactions on consumer electronics 38 (1), pp. xviii–xxxiv. Cited by: §1, §2.1.
-  (2020) Invertible image rescaling. arXiv preprint arXiv:2005.05650. Cited by: §2.2, §2.2.
Appendix 0.A Experimental Details
Network Structure The coupling layers are parametrized using neural networks with a bottleneck-like structure, which consists of three convolutional layers, where the first and last is , and the middle one is . This is designed to improve efficiency in using parameters by allowing a larger number of channels under an acceptable computing complexity. In our implementation, the detail of architecture is specified in Table 1.
|Input Size||Input Channel||Width||Kernel Size (K)|
|64 64||48||128||5 5|
|32 32||192||256||3 3|
|16 16||768||1024||3 3|
Training Details Throughout training, we use Adam optimizer  with and . For the learning rate, different initial values for IEM and the entropy model are used, denoted as and respectively. We let both of them decay from the 0.1 million iterations, which is computed as
, where , is a step counter, and is a decay factor that is practically around 0.999995. and are set to and respectively.
As for the training objective, the coefficients need to be carefully tuned for each bitrate. Empirically, we set and , and use grid search to look for optimal weights of and ranging from to . Notice, it does not mean is optimal in all the scenarios, but experimentally there is no significant difference between the change of performance when simultaneously changing and , and adjusting only.
Ramdomly Sampled Dataset for Evaluation The subset of ImageNet we used for evaluation is randomly sampled.
Appendix 0.B Performance without Quantization
We test our model without quantization and find that its performance is remarkably improved. Contrarily, with an irreversible auto-encoder, there is only a slight performance increment, which shows the preponderance of invertibility in the process of reconstruction. Such a huge gap result from quantization implies a tremendous potential in information modelling, which points out a direction for prospective works.
|NQ (dB) / Q (dB)||bpp||NQ (dB) / Q (dB)||bpp|
|31.61 / 29.84||0.2984||40.62 / 30.59||0.3047|
|33.45 / 31.63||0.4752||42.43 / 32.36||0.4917|
|34.05 / 32.24||0.5308||43.07 / 32.59||0.5295|
|35.22 / 33.42||0.674||44.11 / 33.4||0.6197|
|36.36 / 34.61||0.8232||45.49 / 34.62||0.7798|