1 Introduction
Image compression typically consists of a transformation step (including quantization) and an entropy coding step that attempts to capture the probability distribution of a transformed context to generate a smaller compressed bitstream. Entropy coding ranges in complexity from simple nonadaptive encoders
[jpeg, jpegxsentropy] to complex arithmetic coders with adaptive context models [cabac, jpegxl]. The entropy coding strategy has been revised to address the specificities of learned compression. More specifically, for recent works that make use of a convolutional autoencoder [autoencoder] (AE) as the allinclusive transformation and quantization step, the entropy coder relies on a cumulative probability model (CPM) trained alongside the AE [balle2017]. This model estimates the cumulative distribution function (CDF) of each channel coming out of the AE and passes these learned CDFs to an entropy coder such as range encoding
[rangeencoding].Such a simple method outperforms traditional codecs like JPEG2000 but work is still needed to surpass complex codecs like BPG. Johannes Ballé et al. (2018) [balle2018] proposed analyzing the output of the convolutional encoder with another AE to generate a floatingpoint scale parameter that differs for every variable that needs to be encoded by the entropy coder, thus for every location in every channel. This method has been widely used in subsequent works but introduces substantial complexity in the entropy coding step because a different CDF is needed to encode every variable in the latent representation of the image, whereas the single AE method by Ballé et al. (2017) [balle2017] reused the same CDF table for every latent spatial location.
Our work uses the principle of competition of experts [competitionOfExperts0, competitionOfExperts1] to get the best out of both worlds. Multiple prior distributions compete for the lowest bit cost on every spatial location in the quantized latent representation. During training, only the best prior distribution is updated in each spatial location, further improving the prior distributions specialization. CDF tables are fixed at the end of training. Hence, at testing, the CDF table resulting in the lowest bitcost is assigned to each spatial location of the latent representation. The ratedistortion (RD) performance obtained is comparable to that obtained with a parametrized distribution [balle2018], yet the entropy coding process is greatly simplified since it does not require a pervariable CDF and can build on lookuptables (LUT) rather than the computation of analytical distributions.
2 Background
Entropy coders such as range encoding [rangeencoding] require cdfs where, for each variable to be encoded, the probability that a smaller or equal value appears is defined for every allowable value in the latent representation space. Johannes Ballé et al.’s seminal work (2017) [balle2017] consists of an AE, computing a latent image representation consisting in channels of size , and a CPM, consisting of one CDF per latent output channel, which are trained conjointly. The latent representation coming out of the encoder is quantized then passed through the CPM. The CPM defines, in a parametrized and differentiable manner, a CDF per channel. At the end of training, the CPM is evaluated at every possible value^{1}^{1}footnotemark: 1 to generate the static CDF table. The CDF table is not differentiable, but going from a differentiable CPM to a static CDF table speeds up the encoding and decoding process. The CDF table is used to compress latent representations with an entropy coder, the approximate bit cost of a symbol is the binary logarithm of its probability.
Ballé et al. (2018) improved the RD efficiency by replacing the unique CDF table with a Gaussian distribution parametrized with a hyperprior (HP) subnetwork
[balle2018]. The HP generates a scale parameter, and in turn a different CDF, for every variable to be encoded. Thus, complexity is added by exploiting the parametrized Gaussian prior during the entropy coding process, since a different CDF is required for each variable in the channel and spatial dimensions.Minnen et al. proposed a scheme where one of multiple probability distributions is chosen to adapt the entropy model locally [minnenmp]. However, these distributions are defined a posteriori, given the encoder trained with a global entropy model. Thus [minnenmp] does not perform as well as the HP scheme [balle2018] per [minnen, Fig. 2a]. In contrast, the present method jointly optimizes the local entropy models and the AE in an endtoend fashion that results in greater performance. Minnen et al. [minnen] later proposed to improve RD with the use of an autoregressive sequential context model. However, as highlighted in [liujiaheng]
, this is obtained at the cost of increased runtime by several orders of magnitude. Subsequent works have attempted to reduce complexity of the neural network architecture
[johnston] and to bridge the RD gap with Minnen’s work [liujiaheng], but entropy coding complexity has remained largely unaddressed and has instead evolved towards increased complexity [minnen, gmm, minnen2020] compared to [balle2018]. The present work builds on Ballé et al. (2017) [balle2017] and achieves the performance of Ballé et al. (2018) [balle2018] without the complexity introduced by a pervariable parametrized probability distribution. We chose Ballé et al. (2017) as a baseline because it corresponds to the basic unit adopted as a common reference and starting point for most models proposed in the recent literature to improve compression quality [balle2018, minnen, liujiaheng, minnen2020]. Due to its generic nature, our contribution remains relevant for the newer, often computationally more complex, incremental improvements on Ballé et al. (2017).3 Competition of prior distributions
Our proposed method introduces competitions of expert [competitionOfExperts0, competitionOfExperts1] prior distributions: a single AE transforms the image and a set of prior distributions are trained to model the CDF of the latent representation in each spatial location. For each latent spatial dimension the CDF table which minimizes bit cost is selected; that prior is either further optimized on the features it won in the training mode, or its index is stored for decoding in the inference mode. This scheme is illustrated in Figure 1, a set of 16 optimized CDF tables is shown in Figure 2, and three sample images are segmented by “winning” CDF table in Figure 3.
All prior distributions are estimated in parallel by considering
CDF tables, and selecting, as a function of the encoded latent spatial location, the one that minimizes the entropy coder bitcount. The CDF table index is determined for each spatial location by evaluating each CDF table in inference. This can be done in a vectorized operation given sufficient memory. During training the CPM is evaluated instead of CDF tables such that the probabilities are up to date and the model is differentiable, and the bit cost is returned as it contributes to the loss function. The cost of CDF table indices has been shown to be neglectable due to the reasonably small number of priors, which in turns results from the fact that little gain in latent code entropy has been obtained by increasing the number of priors.
In all our experiments , the AE architecture follows the one in Ballé et al. (2018) [balle2018], without the HP, since we found that the AE from [balle2018] offers better RD than the one described in Ballé et al. (2017) [balle2017], even with a single CDF table. A functional training loop is described in Algorithm 1.
4 Experiments
4.1 Method
These experiments are based on the PyTorch implementation of Ballé et al. (2018)
[balle2018] published by Liu Jiaheng [ptcompression, liujiaheng]. To implement our proposed method, the HP is omitted in favor of competition of expert prior distributions. The CPM is that defined in [ptcompression] with an additional dimension to compute all CDF tables in parallel. Theoretical results are verified using the torchac range coder [torchac, torchaccode, rangeencoding]. A functional training loop is described in Algorithm 1, and source code is provided on https://github.com/trougnouf/Manypriors. To ensure that all priors get an opportunity to train, the prior distributions that have not been used for at least fifty steps are randomly assigned to spatial locations with largest bitcounts, to be forced to train. The Adam optimizer [adam] is used with a starting learning rate (LR) of 0.0001 for the AE and 0.001 for the CPM. Performance is tested every 2500 steps in inference mode on the validation set, and the LR is decayed by a factor of 0.99 if the performance have not improved for two tests. Reported performance is the one of the model taht minimizes on the validation set at the end of training. Base models are trained for six million steps at with the mean squared error (MSE) loss. Smaller values and MSSSIM models are trained for four million steps starting from the base model with their LR and optimizer reset. All models use (hidden layers channels) and (output channels) such that a single base model is needed for each prior configuration. The training and validation dataset is made of freelicense images from Wikimedia Commons [commons]; mainly “Category:Featured pictures on Wikimedia Common” which consists of 13928 images of the highest quality. The images are cropped into pixels patches on disk to speed up further resizing, then they are resized onthefly by a random factor down to pixels during training. A batch size of 4 patches is used. The kodak set [kodak] is used as a validation set and the CLIC professional test dataset [clictest] is used for testing.The RD curve of our “multiprior” model is compared with that of the HP model [balle2018], which is trained from scratch using Liu Jiaheng’s PyTorch implementation [ptcompression, liujiaheng]. Liu Jiaheng’s code differs slightly from the paper’s definition [balle2018]
in that a Laplace distribution is used in place of the normal distribution to stabilize training. Complexity is measured as the number of GMac (billion multiplyaccumulate operation) using the ptflops counter
[ptflops] and the number of memory lookup operations is calculated manually.4.2 Results
The PSNR RD curve measured on the CLIC professional test set [clictest] is shown on top of Figure 4. The performance of a 64priors model is in line with that of the HP model : they both perform slightly better than BPG at high bpp, and achieve significantly better RD than the singleprior model. In the middle, the RD value at , the highest bitrate, is shown for 1, 2, 4, 8, 16, 32, 64, and 128 prior distributions. 128priors offer marginal gains and costs an increased training time (1.5) and encoding time. MSSSIM performance of finetuned models is shown in the bottom of Figure 4; the 64priors model still performs similarly to [balle2018], and both learned compression models benefit from this more perceptual metric compared with traditional codecs. A visual comparison of images compressed with the MSE loss () and the equivalent bitrate settings in conventional codecs is shown in Figure 5.
()  Hyperprior  Manypriors  ratio MPHP  
Encoding  GMac  main encoder  769.82  769.82  
hyper encoder  23.75  
hyper decoder  23.86  
total  817.43  769.82  0.942  
Lookups  indices  530.84 M  
CDF  829.44 K *  32.400 K  
total  829.44 K *  530.87 M  
Decoding  GMac  hyper decoder  23.154  
main decoder  769.60  769.60  
total  792.75  769.60  0.971  
Lookups  CDF (total)  829.44 K *  32.400 K 
Computational complexity of our Manypriors has been compared to the one of the HP model [balle2018]). This complexity is expressed in GMac for the neural network parts and number of memory lookup operations. It is summarized in Table 1. The lack of a HP AE saves 3 % to 6 % GMac, depending on whether only the HP decoder (image decoding) or the whole HP codec (image encoding) is used. Decoding with the Manypriors scheme is greatly simplified compared to [balle2018] because the CDF tables generation process takes the optimal indices stored as sideinformation and looks up one static CDF table per latent spatial dimension, that is (typically 256) fewer lookups than with a HP. During encoding, the Manypriors scheme must lookup every latent variable with every CDF table in order to determine the most cost effective CDF tables. This results in (typically 64) times more lookup operations than the HP scheme overall, although these lookup operations are relatively cheap because only two values are needed (variable0.5), whereas each CDF table lookup in [balle2018] returns L probabilities. Moreover, it is challenging to make an accurate CDF LUT for the HP scheme, because quantizing the distribution scale parameter reduces the accuracy of the resulting CDFs, negatively impacting the bitrate. This challenge is exacerbated when the distribution has multiple parameters [minnen] or a mixture of distributions [gmm] is used. In Figure 4, LUT are replaced by accurate but complex Laplace distribution computation for the HP scheme in order to maximize the reported RD performance.
Time complexity is measured for every step on CPU, where it can be reliably profiled due to synchroneous execution. It is summarized in Table 2 with the following distinct subcategories: NN (neural network) is the time spent in the AE, CDF generation is the time spent building the CDF tables for a specific image, and entropy is the bitstream generation. All operations are done using the PyTorch framework in python, except for entropy encoding which makes use of the torchac range coding library [torchac, torchaccode], written in C++, and the prior indices are compressed using the LZMA library [lzma]. The total encoding time of the 64priors model is 0.32 time that of the HP model and the decoding time is 0.42 times that of the HP model. The timing is more significant when it is broken down by subcategory because each component has a different response time depending on the hardware (and software) architecture in place. The AE (“NN”) encoding time is 0.90 that of the HP scheme and decoding time is 0.95 time as much as the HP. Both the hyperencoder and hyperdecoder are called during encoding, thus it appears that each part of the HP subnetwork costs 5 % of the AE time. The time taken to build the CDF tables for the HP model was measured both by estimating the pervariable Laplace distributions (“fullprecision”) and with a quantized scale parameter LUT. In any case, finding the best indices of a 64priors model appears to be relatively inexpensive and the total CDF tables generation time is only 0.17 to 0.48 that of the HP model (depending on whether the HP model uses fullprecision or LUT) for encoding. During decoding, the 64priors model spends 0.05 to 0.14 as much time building the CDF tables as the HP model, because the optimal CDF table indices have already been determined during encoding and they are included in the bitstream.
() 





Encoding  NN encode: main + hyperprior  3.81 + 0.41  3.79 + 0.00  0.90  
entropy encode, main + hyperprior  0.15 + 0.02  0.15 + 0.00  
CDF: select indices + gather tables 
0.00 +

1.90 + 0.81 


encode (total) 

6.65 


Decoding  NN decode : main + hyperprior  10.66 + 0.34  10.50  0.95  
CDF : gather tables 

0.81 


entropy decode : main + hyperprior  0.24 + 0.02  0.24  0.92  
decode (total) 

11.54 

5 Conclusion
Convolutional autoencoders trained for compression are optimized for both rate and distortion. Rate is estimated with a cumulative probability model, which in turns generates a CDF for every latent variable to be encoded. A single CDF per latent channel is not sufficient to capture the statistics at the output of the encoder, nor to allow the encoder to express a wide variety of features. To support multiple statistics, the hyperprior [balle2018] parametrizes a standard distribution, but this introduces a great deal of complexity in the entropy coding stage because the CDF differs for every latent variable to be encoded. The proposed method uses multiple prior distributions working as a competition of experts to capture the relevant features which they specialize on. This approach is advantageous because the learned CDF tables are stored in a static LUT once training is finished, and a model trained with 64 prior distributions performs with a similar RD as one trained with a HP subnetwork. Moreover, a learned CDF table includes the CDF for all channels in the latent code. Hence, accessing the CDF table for a spatial location provides the CDF for each of its channels and the number of lookups is reduced to the number of latent spatial locations. In our experiments, CDF tables generation in the encoding step takes 0.17 to 0.48 as much time with a 64priors model as it does with the HP model (depending on the precision of the HP model). This ratio is lowered to 0.05 to 0.14 during decoding because the prior indices have already been determined during the encoding.
6 Acknowledgements
This research has been funded by the Walloon Region. Computational resources have been provided by the supercomputing facilities of the Université catholique de Louvain (CISM/UCL) and the Consortium des Équipements de Calcul Intensif en Fédération Wallonie Bruxelles (CÉCI) funded by the Fond de la Recherche Scientifique de Belgique (F.R.S.FNRS) under convention 2.5020.11 and by the Walloon Region.
Comments
There are no comments yet.