L3C-PyTorch
PyTorch Implementation of the CVPR'19 Paper "Practical Full Resolution Learned Lossless Image Compression"
view repo
We propose the first practical learned lossless image compression system, L3C, and show that it outperforms the popular engineered codecs, PNG, WebP and JPEG2000. At the core of our method is a fully parallelizable hierarchical probabilistic model for adaptive entropy coding which is optimized end-to-end for the compression task. In contrast to recent autoregressive discrete probabilistic models such as PixelCNN, our method i) models the image distribution jointly with learned auxiliary representations instead of exclusively modeling the image distribution in RGB space, and ii) only requires three forward-passes to predict all pixel probabilities instead of one for each pixel. As a result, L3C obtains over three orders of magnitude speedups compared to the fastest PixelCNN variant (Multiscale-PixelCNN). Furthermore, we find that learning the auxiliary representation is crucial and outperforms predefined auxiliary representations such as an RGB pyramid significantly.
READ FULL TEXT VIEW PDFPyTorch Implementation of the CVPR'19 Paper "Practical Full Resolution Learned Lossless Image Compression"
Entropy coding / arithmetic coding for PyTorch
Original: https://github.com/fab-jul/L3C-PyTorch
While it is known that likelihood-based discrete generative models can in theory be used for lossless compression [37]
, recent work on learned compression using deep neural networks has solely focused on lossy compression
[4, 38, 39, 32, 1, 3]. Indeed, the literature on discrete generative models [42, 41, 33, 30, 18] has largely ignored the application as a lossless compression system, with neither bitrates nor runtimes being compared with classical codecs such as PNG [29], WebP [43], JPEG2000-lossless [35], and FLIF [36]. This is not surprising as (lossless) entropy coding using likelihood-based discrete generative models amounts to a decoding complexity essentially identical to the sampling complexity of the model, which renders many of the recent state-of-the-art autoregressive models such as PixelCNN
[42], PixelCNN++ [33], and Multiscale-PixelCNN [30] impractical, requiring minutes or hours on GPU to generate moderately large images, typically px (see Table 2). The computational complexity of these models is mainly caused by the sequential nature of the sampling (and thereby decoding) operation, where a forward pass needs to be computed for every single (sub) pixel of the image in a raster scan order.In this paper, we address these challenges and develop a fully parallelizeable learned lossless compression system, outperforming the popular classical systems PNG, WebP and JPEG2000.
Our system (see Fig. 1
for an overview) is based on a hierarchy of fully parallel learned feature extractors and predictors which are trained jointly for the compression task. The role of the feature extractors is to build an auxiliary feature representation which helps the predictors to model both the image, as well as the auxiliary features themselves. Our experiments show that learning the feature representations is crucial, and heuristic (predefined) choices such as a multiscale RGB pyramid lead to suboptimal performance.
In more detail, to encode an image
, we feed it to the feature extractors and predictors, and obtain the predictions of the probability distribution of
as well as all auxiliary variables in parallel in a single forward pass. These predictions are then used with an adaptive arithmetic encoder to obtain a compressed bitstream of and the auxiliary variables (see Sec. 3.1 for an introduction to adaptive arithmetic coding). However, the arithmetic decoder now needs predictions to be able to decode the bitstream. Starting from the lowest level of auxiliary variables (for which we assume a uniform prior), the decoder obtains a prediction of the distribution of the next set of auxiliary variables, and can thus decode them from the bitstream. Prediction and decoding is alternated until finally the arithmetic decoder obtains the image .In practice, we only need to use three feature extractors and predictors for our model, so when decoding we only need to perform three fully parallel forward passes (which has the same complexity as the single forward pass through the predictors when encoding) in combination with the adaptive arithmetic coding.
The parallel nature of our model enables it to be orders of magnitude faster for decoding than autoregressive models, while learning enables us to obtain compression rates competitive with state-of-the-art engineered lossless codecs.
In summary, our contributions are the following:
[leftmargin=*]
We propose a fully parallel hierarchical probabilistic model, learning both the feature extractors that produce an auxiliary feature representation to help the prediction task, as well as the predictors which model the joint distribution of all variables (see Section 3).
We show that entropy coding based on our non-autoregressive probabilistic model optimized for discrete log-likelihood can obtain compression rates outperforming WebP, JPEG2000 and PNG, the latter by a large margin. We are only marginally outperformed by the state-of-the-art, FLIF, while being conceptually much simpler (see Section 5.1).
As already mentioned, essentially all likelihood-based discrete generative models can be used with an arithmetic coder for lossless compression. A prominent group of models that obtain state-of-the-art performance are variants of the auto-regressive PixelRNN/PixelCNN [42, 41]. PixelRNN and PixelCNN organize the pixels of the image distribution as a sequence and predict the distribution of each pixel conditionally on (all) previous pixels using an RNN and a CNN with masked convolutions, respectively. These models hence require a number of network evaluations equal to the number of predicted sub-pixels. PixelCNN++ [33] simplifies the structure of PixelCNN and adds regularization, leading to faster training and better predictive performance. MS-PixelCNN [30] parallelizes PixelCNN by reducing dependencies between blocks of pixels and processing them in parallel with shallow PixelCNNs. [18] equips PixelCNN with auxiliary variables (grayscale version of the image or RGB pyramid) to encourage modeling of high-level features, thereby improving the overall the modeling performance. [6, 27] propose autoregressive models similar to PixelCNN/PixelRNN, but additionally rely on attention mechanisms to increase the receptive field.
The well-known PNG [29] operates in two stages: first the image is reversibly transformed to a more compressible representation with a simple autoregressive filter that updates pixels based on surrounding pixels, then it is compressed with the deflate algorithm [10]. WebP [43] uses more involved transformations, including the use of entire image fragments to encode new pixels and a custom entropy coding scheme. JPEG 2000 [35] includes a lossless mode where tiles are reversibly transformed before the coding step, instead of irreversibly removing frequencies. The current state-of-the-art (non-learned) algorithm is FLIF [36]. It relies on powerful preprocessing and a sophisticated entropy coding method based on CABAC [31]
called MANIAC, which grows a dynamic decision tree per channel as an adaptive context model during encoding.
In lossy compression, context models have been studied as a way to efficiently (losslessly) encode lossy representations of images. Classical approaches are discussed in [22, 24, 25, 46, 44]. Recent learned approaches include [20, 23, 26], where shallow autoregressive models over latents are learned.
The objective of continuous likelihood models, such as VAEs [17] and RealNVP [11], where is a continuous distribution, is closely related to its discrete counterpart. In particular, by setting where is the discrete image and is uniform quantization noise, the continuous likelihood of is a lower bound on the likelihood of the discrete [37]. However, there are two challenges for deploying such models for compression. First, the discrete likelihood needs to be available (which involves a non-trivial integration step). Additionally, the memory complexity of (adaptive) arithmetic coding depends on the size of the domain of the variables of the factorization of (see Sec.3.1 on (adaptive) arithmetic coding). Since the domain grows exponentially in the number of pixels in , unless is factorizable, it is not feasible to use it with adaptive arithmetic coding.
In general, in lossless compression, we are given some stream of symbols drawn independently and identically distributed (i.i.d.) from a set according to the probability mass function . The goal is to encode this stream into a bitstream of minimal length using a “code”, s.t. a receiver can decode the symbols from the bitstream. Ideally, an encoder minimizes the expected bits per symbol , where is the length of encoding symbol (i.e., more probable symbols should obtain shorter codes). Information theory states (e.g., [8]) that for any possible code, where is the Shannon entropy.
A strategy that almost achieves the lower bound (for long enough symbol streams) is arithmetic coding [45].^{1}^{1}1We use (adaptive) arithmetic coding for simplicity of exposition, but any adaptive entropy-achieving coder can be used with our method. It encodes the entire stream into a single number , by subdividing in each step (encoding one symbol) as follows: Let the bounds of the current step (initialized to and for the initial interval ). We divide the interval into sections where the length of the -th section is . Then we pick the interval corresponding to the current symbol, i.e., we update to be the boundaries of this interval. We proceed recursively until no symbols are left. Finally, we transmit , which is rounded to the smallest number of bits s.t. . Receiving together with the knowledge of the number of encoded symbols and uniquely specifies the stream and allows the receiver to decode.
In contrast to the i.i.d. setting we just described, we are interested in losslessly encoding the pixels of an image which are not i.i.d at all. Let be the sub-pixels of an image , and the joint distribution of all sub-pixels. We can then consider the factorization . Now, to encode , we can consider the sub-pixels as our symbol stream and encode the -th symbol/sub-pixel using . Note that this corresponds to varying the of the previous paragraph during encoding, and is in general referred to as adaptive arithmetic coding (AAC) [45]. For AAC the receiver also needs to know the varying at every step, i.e., they must either be known a priori or the factorization must be causal (as above) so that the receiver can calculate them from already decoded symbols.
In practice, the exact
is usually unknown, and instead is estimated by a model
. Thus, instead of using length to encode a symbol , we need (sub-optimal) length . Then(1) |
is the resulting expected (sub-optimal) bits per symbol, and is called the cross-entropy [8].
Thus, given some , we can minimize the bitcost needed to encode a symbol stream with symbols distributed according to by minimizing Eq. (1). This naturally generalizes to the non i.i.d. case described in the previous paragraph by using different and for each symbol and minimizing .
The following sections describe how we model a hierarchical causal factorization of for natural images to be able to do learned lossless image compression (L3C) efficiently.
A high-level overview of the architecture is given in Fig. 1. Unlike autoregressive models such as PixelCNN and PixelRNN, which factorize the image distribution autoregressively over sub-pixels as , we jointy model all the sub-pixels and introduce a learned hierarchy of auxiliary feature representations to simplify the modeling task.^{2}^{2}2We fix the dimensions of to be , where the number of channels
is a hyperparameter (
in our reported models), and given a -dimensional image. Considering that is quantized, this conveniently upper bounds the information that can be contained within each , however, other dimensions could be explored. Specifically, we model the joint distribution of the image and the feature representations aswhere
is a uniform distribution. The feature representations can be hand designed or learned. Specifically, on one side, we consider an RGB pyramid with
, where is the bicubic (spatial) subsampling operator with subsampling factor . On the other side, we consider a learned representation using a feature extractor . We use the hierarchical model shown in Fig. 1 using the composition , where the are feature extractor blocks and is a scalar differentiable quantization function (see Section 3.3). The in Fig. 1 are predictor blocks, and we parametrize andLetting , we parametrize the conditional distributions for all as
using the features of the predictor
where , i.e., the final predictor only sees .
A detailed description of the feature extractor and predictor architecture is given in Fig. 2
. The predictor is based on the super-resolution architecture from EDSR
[21], motivated by the fact that our prediction tasked is somewhat related to super-resolution in that both are dense prediction tasks involving spatial upsampling. We mirror the predictor to obtain the feature extractor, and follow [21] in not using BatchNorm [15]. Inspired by the “atrous spatial pyramid pooling” from [5], we insert a similar layer before predicting the channels of : In , we use three atrous convolutions in parallel, with rates 1, 2, and 4, then concatenate the resulting feature maps to a -dimensional feature map.We use the scalar quantization approach proposed in [23] to quantize the output of : Given levels , we use nearest neighbor assignments to quantize each entry as
(2) |
but use differentiable “soft quantization”
(3) |
to compute gradients for the backward pass, where is a hyperparameter relating to the “softness” of the quantization. For simplicity, we fix to be evenly spaced values in .
Letting again , we model the conditional distributions using a generalization of the discretized logistic mixture model with components proposed in [33]. Specifically, consider the -dimensional feature representation whose entries take values in for and in otherwise. Let index the channel and the spatial location. We assume the entries of to be independent across but we share mixture weights across channels , i.e.,
(4) |
using the mixture distribution
where the mixture weights and the parameters of the mixture components and are predicted from the features . The mixture components follow a logistic distribution,
Here, is the bin width of the quantization grid ( for and otherwise; the edge-cases and occurring for are handled as described in [33, Sec. 2.1]). For the distribution of , we follow [33, Sec. 2.2] by allowing to linearly depend on the RGB pixels of previous channels (for simplicity, this is not reflected in our notation). We emphasize that in contrast to [33], our model is not autoregressive over pixels, i.e., is modelled as independent across .
We are now ready to define the loss, which is a generalization of the discrete logistic mixture loss introduced in [33]. Recall form Sec. 3.1 that our goal is to model the true joint distribution of and the representations , i.e., as accurately as possible using our model . Thereby, the are defined using the learned feature extractor blocks , and is a product of discretized (conditional) logistic mixture models with parameters defined through the , which are in turn computed using the learned predictor blocks . As discussed in Sec. 3.1, the expected coding cost incurred by coding w.r.t. our model is the cross entropy .
We therefore directly minimize w.r.t. the parameters of the feature extractor blocks and predictor blocks over samples. Specifically, given training samples , let . We minimize
(5) |
Note that the loss decomposes into the sum of the cross-entropies of the different representations. Also note that this loss corresponds to the negative log-likelihood of the data w.r.t. our model which is typically the perspective taken in the generative modeling literature (see, e.g., [42]).
[bpsp] | Method | Open Images | RAISE1K | DIV2K | |||
---|---|---|---|---|---|---|---|
Ours | L3C | 2.629 | 2.952 | 3.159 | |||
Learned Baselines | RGB Shared | 3.060 | 3.588 | 3.870 | |||
RGB | 3.008 | 3.517 | 3.751 | ||||
Non-Learned Approaches | PNG | 3.779 | 4.167 | 4.527 | |||
JPEG2000 | 2.778 | 3.094 | 3.331 | ||||
WebP | 2.666 | 2.972 | 3.234 | ||||
FLIF | 2.473 | 2.822 | 3.046 |
We emphasize that in contrast to the generative model literature we learn the representations, propagating gradients to both and , since each component of our loss depends on though the parametrization of the logistic distribution and on because of the differentiable . Thereby, our network can autonomously learn to navigate the trade-off between a) making the output of feature extractor more easily estimable for the predictor and b) putting enough information into for predictor to predict .
When the feature extractors in our approach are restricted to a non-learned multiscale RGB pyramid as auxiliary variables (see baselines in the next section), it shares some similarities with MS-PixelCNN [30]. In particular, [30] combines such a pyramid with upscaling networks which play the same role as the predictors in our architecture. Crucially however, they rely on i) combining such predictors with a shallow PixelCNN and ii) upscaling one dimension at a time (. While their complexity is reduced from forward passes needed for PixelCNN [42] to , their approach is in practice still an order of magnitude slower than ours (See Table 2). Furthermore, we stress that these similarities only apply for our RGB baseline model, whereas our best models are obtained using learned feature extractors trained jointly with the predictors.
The main models are trained on 213 487 images randomly selected from the Open Images Train data set [19]. We evaluate on 500 randomly selected images from Open Images Test, 500 randomly selected images from RAISE1k [9], as well as 100 images from the DIV2K super-resolution data set [2]. All data sets are downscaled to 768 pixels on the longer side to remove potential artifacts from previous compression, where we discarded images were rescaling did not result in at least downscaling. We also discarded any high saturation images, i.e., with mean or in the HSV color space (this removes 3 images from DIV2K, for the other test data sets we do this before selecting 500 images). We additionally investigate the ImageNet64 data set [7] with 1 281 151 training images and 50 000 validation images, each pixels.
We compare our main model (L3C) to two learned baselines: For the RGB Shared baseline we use bicubic subsampling as feature extractors, i.e., , and only train one predictor . During testing, we can obtain multiple using and apply the predictor as often as needed.^{3}^{3}3We choose 4 applications as this results in the final requiring a negligible number of bits. The RGB baseline also uses bicubic subsampling, however, we train predictors, one for each scale, to capture the different distributions of different RGB scales.
For our main model, L3C, we additionally learn feature extractors. We emphasize that the only difference to the RGB baseline is that the representations are learned. We train all these models for 700k iterations, where they converge. Additionally, in order to compare to the PixelCNN literature, we train L3C also on ImageNet64. Due to the smaller images and bigger data set, we increase the batch size to 120 and decay
every epoch.
We find that adding BatchNorm slightly degrades performance. Furthermore, replacing the stacked atrous convolutions with a single convolution, slightly degrades performance as well. By stopping gradients being propagated through the targets of our loss, we get significantly worse performance—in fact, the optimizer does not manage to pull down the cross-entropy of any of the learned representations significantly.
We find the choice of for has impacts on training: [23] suggests setting it s.t. resembles identity, which we found a good starting point. However, we found it beneficial to let be slightly smoother, yielding better gradients for the encoder. We use .
Additionally, we explored the impact of varying (number of channels of ) and the number of levels and found it more beneficial to increase instead of increasing , i.e., it is beneficial for training to have a finer quantization grid.
Method | px | px | |
---|---|---|---|
BS=1 |
PixelCNN++ | 47.4 s | min |
L3C (Ours) | 0.0142 s | 0.0214 s | |
BS=30 |
PixelCNN++ | 11.3 s | min |
L3C (Ours) | 0.000514 s | 0.00850 s | |
PixelCNN [42] | 120 s | hours | |
MS-PixelCNN [30] | 1.17 s | min |
resolution, times from other approaches are interpolated due to the long runtimes or unavailability of code.
Table 1 shows a comparison of our approach (L3C) and the learned baselines to the other codecs, on our testsets. All of our methods outperform the widely-used PNG, which is at least larger on all data sets. We also outperform WebP and JPEG2000-LL everywhere by a smaller margin of up to . We note that FLIF still marginally outperforms our model but remind the reader of the many hand-engineered highly specialized techniques involved in FLIF (see Section 2
). In contrast, we use a simple convolutional feed-forward neural network architecture. The RGB baseline with
learned predictors outperforms the RGB Shared baseline on all data sets, showing the importance of learning a predictor for each scale. Using our main model (L3C), where we additionally learn the feature extractors, we outperform both baselines: The outputs are at least larger everywhere, showing the benefits of learning the representation.Method | bpsp | Time for | Learned |
L3C (Ours) | 4.52 | 0.0021 s | ✓ |
PixelCNN [42] | 3.57 | min | ✓ |
MS-PixelCNN [30] | 3.70 | s | |
PNG | 5.74 | ||
JPEG2000-LL | 4.93 | ||
WebP | 4.64 | ||
FLIF | 4.54 | ||
Table 2 shows a speed comparison to three PixelCNN-based approaches. The original PixelCNN [42] conditions the probability of seeing some sub-pixel value on all previously seen pixels in the current channel (R, G, or B) and all pixels from all previously seen channels. To decode a full image, forward passes are needed. PixelCNN++ [33] improves on this in various ways, including modeling the joint distribution of each pixel, thereby eliminating conditioning on previous channels and reducing to forward passes, whereas MS-PixelCNN [30] needs forward passes (see Section 3.6).
We note that the reported times only account for the time spent evaluating when sampling/decoding. For all approaches, a pass with an entropy coder is needed to actually encode/decode images. Note, however, that state-of-the-art adaptive entropy coders typically require on the order of milliseconds per MB (see [12] and in particular [13] for benchmarks on adaptive entropy coding), and the time spent for entropy coding should thus be in the same order as computing using L3C.
We observe that our approach is considerably faster than the three PixelCNN approaches. When comparing on crops, we observe massive speedups of for batch size (BS) 1 and for BS 30 compared to the original PixelCNN [42], and for BS 30 we are faster than MS-PixelCNN, the fastest PixelCNN-based approach.^{4}^{4}4We estimate the speedup of L3C compared to MS-PixelCNN from the results reported in [30] for PixelCNN and MS-PixelCNN, assuming PixelCNN++ is not slower than PixelCNN to get a conservative estimate (PixelCNN++ is in fact around faster than PixelCNN).
To put the runtimes reported in Table 2 into perspective, we further evaluate the bitcost on ImageNet64, for which PixelCNN and MS-PixelCNN were trained, in Table 3. The reported times are interpolated from the results in Table 2 and again refer only to the time needed to evaluate when sampling/decoding. We observe our outputs to be larger than MS-PixelCNN and larger than the original PixelCNN, but smaller than all classical approaches. However, we note again that our model (L3C) is over 3 orders of magnitude faster than MS-PixelCNN.
4.245 bpsp stored: 0,1,2,3 | 1.662 bpsp stored: 1,2,3 |
0.391 bpsp stored: 2,3 | 0.121 bpsp stored: 3 |
We stress that we study image compression and not image generation. Nevertheless, our method produces models from which and can be sampled. Therefore, we visualize the output when sampling part of the representations from our model in Fig. 3: the top left shows an image from the RAISE1k test set, when we store all scales (losslessly). When we store but not and instead sample from , we only need of the total bits without noticeably degrading visual quality. Sampling and leads to some blur while reducing the number of stored bits to of the full bitcost. Finally, only storing (containing values from and of the full bitcost) and sampling , , and produces significant artifacts. However, the original image is still recognizable, showing the ability of our networks to learn a hierarchical representation capturing global image structure.
We visualize the representations in Fig. 4. It can be seen that the global image structure is preserved over scales, with representations corresponding to smaller modeling more detail. This shows potential for efficiently performing image understanding tasks on partially decoded images similarly as described in [40] for lossy learned compression: instead of training a feature extractor for a given task on , one could directly use the features from our network.
We proposed and evaluated a fully parallel hierarchical probabilistic model with auxiliary feature representations. Our L3C model outperformed PNG and JPEG-2000 on all data sets and reached performance comparable to WebP. Furthermore, it significantly outperformed the RGB Shared and RGB baselines which rely on predefined heuristic feature representations, showing that learning the representations is crucial. Additionally, we observed that using PixelCNN-based methods for losslessly compressing full-resolution images takes three to five orders of magnitude longer than L3C.
To further improve L3C, future work could investigate weak forms of autoregression across pixels and/or dynamic adaptation of the model network to the current image. Furthermore, it would be interesting to explore domain-specific applications, e.g., for medical image data.
Soft-to-hard vector quantization for end-to-end learning compressible representations.
In Advances in Neural Information Processing Systems (NIPS), pages 1141–1151, 2017.The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
, July 2017.Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.
Batch renormalization: Towards reducing minibatch dependence in batch-normalized models.
In Advances in Neural Information Processing Systems, pages 1945–1953, 2017.Lossy image compression with compressive autoencoders.
In International Conference on Learning Representations (ICLR), 2017.Full resolution image compression with recurrent neural networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5435–5443. IEEE, 2017.