We propose a new approach to the problem of optimizing autoencoders for lossy image compression. New media formats, changing hardware technology, as well as diverse requirements and content types create a need for compression algorithms which are more flexible than existing codecs. Autoencoders have the potential to address this need, but are difficult to optimize directly due to the inherent non-differentiabilty of the compression loss. We here show that minimal changes to the loss are sufficient to train deep autoencoders competitive with JPEG 2000 and outperforming recently proposed approaches based on RNNs. Our network is furthermore computationally efficient thanks to a sub-pixel architecture, which makes it suitable for high-resolution images. This is in contrast to previous work on autoencoders for compression using coarser approximations, shallower architectures, computationally expensive methods, or focusing on small images.READ FULL TEXT VIEW PDF
Advances in training of neural networks have helped to improve performance in a number of domains, but neural networks have yet to surpass existing codecs in lossy image compression. Promising first results have recently been achieved using autoencoders(Ballé et al., 2016; Toderici et al., 2016b) – in particular on small images (Toderici et al., 2016a; Gregor et al., 2016; van den Oord et al., 2016b) – and neural networks are already achieving state-of-the-art results in lossless image compression (Theis & Bethge, 2015; van den Oord et al., 2016a).
Autoencoders have the potential to address an increasing need for flexible lossy compression algorithms. Depending on the situation, encoders and decoders of different computational complexity are required. When sending data from a server to a mobile device, it may be desirable to pair a powerful encoder with a less complex decoder, but the requirements are reversed when sending data in the other direction. The amount of computational power and bandwidth available also changes over time as new technologies become available. For the purpose of archiving, encoding and decoding times matter less than for streaming applications. Finally, existing compression algorithms may be far from optimal for new media formats such as lightfield images, 360 video or VR content. While the development of a new codec can take years, a more general compression framework based on neural networks may be able to adapt much quicker to these changing tasks and environments.
Unfortunately, lossy compression is an inherently non-differentiable problem. In particular, quantization
is an integral part of the compression pipeline but is not differentiable. This makes it difficult
to train neural networks for this task. Existing transformations have typically
been manually chosen (e.g., the DCT transformation used in JPEG) or have been optimized for a task different from
lossy compression (e.g. Testa & Rossi, 2016 , used denoising autoencoders for compression)
, used denoising autoencoders for compression). In contrast to most previous work, but in line with Ballé et al. (2016), we here aim at directly optimizing the rate-distortion tradeoff produced by an autoencoder. We propose a simple but effective approach for dealing with the non-differentiability of rounding-based quantization, and for approximating the non-differentiable cost of coding the generated coefficients.
Using this approach, we achieve performance similar to or better than JPEG 2000 when evaluated for perceptual quality. Unlike JPEG 2000, however, our framework can be optimized for specific content (e.g., thumbnails or non-natural images), arbitrary metrics, and is readily generalizable to other forms of media. Notably, we achieve this performance using efficient neural network architectures which would allow near real-time decoding of large images even on low-powered consumer devices.
We define a compressive autoencoder (CAE) to have three components: an encoder , a decoder , and a probabilistic model ,
The discrete probability distribution defined byis used to assign a number of bits to representations based on their frequencies, that is, for entropy coding. All three components may have parameters and our goal is to optimize the tradeoff between using a small number of bits and having small distortion,
Here, controls the tradeoff, square brackets indicate quantization through rounding to the nearest integer, and measures the distortion introduced by coding and decoding. The quantized output of the encoder is the code used to represent an image and is stored losslessly. The main source of information loss is the quantization (Appendix A.3). Additional information may be discarded by the encoder, and the decoder may not perfectly decode the available information, increasing distortion.
Unfortunately we cannot optimize Equation 2 directly using gradient-based techniques, as and are non-differentiable. The following two sections propose a solution to deal with this problem.
The derivative of the rounding function is zero everywhere except at integers, where it is undefined. We propose to replace its derivative in the backward pass of backpropagation(Rumelhart et al., 1986) with the derivative of a smooth approximation, , that is, effectively defining the derivative to be
Importantly, we do not fully replace the rounding function with a smooth approximation but only its derivative, which means that quantization is still performed as usual in the forward pass. If we replaced rounding with a smooth approximation completely, the decoder might learn to invert the smooth approximation, thereby removing the information bottle neck that forces the network to compress information.
Empirically, we found the identity, , to work as well as more sophisticated choices. This makes this operation easy to implement, as we simply have to pass gradients without modification from the decoder to the encoder.
Note that the gradient with respect to the decoder’s parameters can be computed without resorting to approximations, assuming is differentiable. In contrast to related approaches, our approach has the advantage that it does not change the gradients of the decoder, since the forward pass is kept the same.
In the following, we discuss alternative approaches proposed by other authors. Motivated by theoretical links to dithering, Ballé et al. (2016) proposed to replace quantization by additive uniform noise,
where is the floor operator. In the backward pass, the derivative is replaced with the derivative of the expectation,
Figure 1 shows the effect of using these two alternatives as part of JPEG, whose encoder and decoder are based on a block-wise DCT transformation (Pennebaker & Mitchell, 1993). Note that the output is visibly different from the output produced with regular quantization by rounding and that the error signal sent to the autoencoder depends on these images. Whereas in Fig. 1B the error signal received by the decoder would be to remove blocking artefacts, the signal in Fig. 1D will be to remove high-frequency noise. We expect this difference to be less of a problem with simple metrics such as mean-squared error and to have a bigger impact when using more perceptually meaningful measures of distortion.
An alternative would be to use the latter approximations only for the gradient of the encoder but not for the gradients of the decoder. While this is possible, it comes at the cost of increased computational and implementational complexity, since we would have to perform the forward and backward pass through the decoder twice: once using rounding, once using the approximation. With our approach the gradient of the decoder is correct even for a single forward and backward pass.
Since is a discrete function, we cannot differentiate it with respect to its argument, which prevents us from computing a gradient for the encoder. To solve this problem, we use a continuous, differentiable approximation. We upper-bound the non-differentiable number of bits by first expressing the model’s distribution in terms of a probability density ,
An upper bound is given by:
where the second step follows from Jensen’s inequality (see also Theis et al., 2016)
. An unbiased estimate of the upper bound is obtained by samplingfrom the unit cube . If we use a differentiable density, this estimate will be differentiable in and therefore can be used to train the encoder.
In practice we often want fine-gained control over the number of bits used. One way to achieve this is to train an autoencoder for different rate-distortion tradeoffs. But this would require us to train and store a potentially large number of models. To reduce these costs, we finetune a pre-trained autoencoder for different rates by introducing scale parameters111To ensure positivity, we use a different parametrization and optimize log-scales rather than scales. ,
Here, indicates point-wise multiplication and division is also performed point-wise. To reduce the number of trainable scales, they may furthermore be shared across dimensions. Where and are convolutional, for example, we share scale parameters across spatial dimensions but not across channels.
Perhaps most closely related to our work is the work of Ballé et al. (2016). The main differences lie in the way we deal with quantization (see Section 2.1) and entropy rate estimation. The transformations used by Ballé et al. (2016)
consist of a single linear layer combined with a form of contrast gain control, while our framework relies on more standard deep convolutional neural networks.
Toderici et al. (2016a)
proposed to use recurrent neural networks (RNNs) for compression. Instead of entropy coding as in our work, the network tries to minimize the distortion for a given number of bits. The image is encoded in an iterative manner, and decoding is performed in each step to be able to take into account residuals at the next iteration. An advantage of this design is that it allows for progressive coding of images. A disadvantage is that compression is much more time consuming than in our approach, as we use efficient convolutional neural networks and do not necessarily require decoding at the encoding stage.
Gregor et al. (2016) explored using variational autoencoders with recurrent encoders and decoders for compression of small images. This type of autoencoder is trained to maximize the lower bound of a log-likelihood, or equivalently to minimize
where plays the role of the encoder, and plays the role of the decoder. While Gregor et al. (2016)
used a Gaussian distribution for the encoder, we can link their approach to the work ofBallé et al. (2016) by assuming it to be uniform,
. If we also assume a Gaussian likelihood with fixed variance,, the objective function can be written
Here, is a constant which encompasses the negative entropy of the encoder and the normalization constant of the Gaussian likelihood. Note that this equation is identical to a rate-distortion trade-off with and quantization replaced by additive uniform noise. However, not all distortions have an equivalent formulation as a variational autoencoder (Kingma & Welling, 2014). This only works if is normalizable in and the normalization constant does not depend on , or otherwise will not be constant. An direct empirical comparison of our approach with variational autoencoders is provided in Appendix A.5.
Ollivier (2015) discusses variational autoencoders for lossless compression as well as connections to denoising autoencoders.
, who demonstrated that super-resolution can be achieved much more efficiently by operating in the low-resolution space, that is, by convolving images and then upsampling instead of upsampling first and then convolving an image.
The first two layers of the encoder perform preprocessing, namely mirror padding and a fixed pixel-wise normalization. The mirror-padding was chosen such that the output of the encoder has the same spatial extent as an 8 times downsampled image. The normalization centers the distribution of each channel’s values and ensures it has approximately unit variance. Afterwards, the image is convolved and spatially downsampled while at the same time increasing the number of channels to 128. This is followed by three residual blocks(He et al., 2015), where each block consists of an additional two convolutional layers with 128 filters each. A final convolutional layer is applied and the coefficients downsampled again before quantization through rounding to the nearest integer.
The decoder mirrors the architecture of the encoder (Figure 9). Instead of mirror-padding and valid
convolutions, we use zero-padded convolutions. Upsampling is achieved through convolution followed by a reorganization of the coefficients. This reorganization turns a tensor with many channels into a tensor of the same dimensionality but with fewer channels and larger spatial extent(for details, see Shi et al., 2016). A convolution and reorganization of coefficients together form a sub-pixel convolution layer. Following three residual blocks, two sub-pixel convolution layers upsample the image to the resolution of the input. Finally, after denormalization, the pixel values are clipped to the range of 0 to 255. Similar to how we deal with gradients of the rounding function, we redefine the gradient of the clipping function to be 1 outside the clipped range. This ensures that the training signal is non-zero even when the decoded pixels are outside this range (Appendix A.1).
To model the distribution of coefficients and estimate the bit rate, we use independent Gaussian scale mixtures (GSMs),
where and iterate over spatial positions, and iterates over channels of the coefficients for a single image . GSMs are well established as useful building blocks for modelling filter responses of natural images (e.g., Portilla et al., 2003). We used 6 scales in each GSM. Rather than using the more common parametrization above, we parametrized the GSM so that it can be easily used with gradient based methods, optimizing log-weights and log-precisions rather than weights and variances. We note that the leptokurtic nature of GSMs (Andrews & Mallows, 1974) means that the rate term encourages sparsity of coefficients.
All networks were implemented in Python using Theano(2016) and Lasagne (Dieleman et al., 2015). For entropy encoding of the quantized coefficients, we first created Laplace-smoothed histogram estimates of the coefficient distributions across a training set. The estimated probabilities were then used with a publicly available BSD licensed implementation of a range coder222https://github.com/kazuho/rangecoder/.
All models were trained using Adam (Kingma & Ba, 2015) applied to batches of 32 images pixels in size. We found it beneficial to optimize coefficients in an incremental manner (Figure 3B). This is done by introducing an additional binary mask ,
Initially, all but 2 entries of the mask are set to zero. Networks are trained until performance improvements reach below a threshold, and then another coefficient is enabled by setting an entry of the binary mask to 1. After all coefficients have been enabled, the learning rate is reduced from an initial value of to . Training was performed for up to updates but usually reached good performance much earlier.
After a model has been trained for a fixed rate-distortion trade-off (), we introduce and fine-tune scale parameters (Equation 9) for other values of while keeping all other parameters fixed. Here we used an initial learning rate of and continuously decreased it by a factor of , where is the current number of updates performed, , and . Scales were optimized for 10,000 iterations. For even more fine-grained control over the bit rates, we interpolated between scales optimized for nearby rate-distortion tradeoffs.
We trained compressive autoencoders on 434 high quality images licensed under creative commons and obtained from flickr.com. The images were downsampled to below pixels and stored as lossless PNGs to avoid compression artefacts. From these images, we extracted
crops to train the network. Mean squared error was used as a measure of distortion during training. Hyperparameters affecting network architecture and training were evaluated on a small set of held-out Flickr images. For testing, we use the commonly used Kodak PhotoCD dataset of 24 uncompressedpixel images333http://r0k.us/graphics/kodak/.
We compared our method to JPEG (Wallace, 1991), JPEG 2000 (Skodras et al., 2001), and the RNN-based method of (Toderici et al., 2016b)444We used the code which was made available on https://github.com/tensorflow/models/tree/2390974a/compression. We note that at the time of this writing, this implementation does not include entropy coding as in the paper of Toderici et al. (2016b).. Bits for header information were not counted towards the bit rate of JPEG and JPEG 2000. Among the different variants of JPEG, we found that optimized JPEG with 4:2:0 chroma sub-sampling generally worked best (Appendix A.2).
While fine-tuning a single compressive autoencoder for a wide range of bit rates worked well, optimizing all parameters of a network for a particular rate distortion trade-off still worked better. We here chose the compromise of combining autoencoders trained for low, medium or high bit rates (see Appendix A.4 for details).
For each image and bit rate, we choose the autoencoder producing the smallest distortion. This increases the time needed to compress an image, since an image has to be encoded and decoded multiple times. However, decoding an image is still as fast, since it only requires choosing and running one decoder network. A more efficient but potentially less performant solution would be to always choose the same autoencoder for a given rate-distortion tradeoff. We added 1 byte to the coding cost to encode which autoencoder of an ensemble is used.
Rate-distortion curves averaged over all test images are shown in Figure 4. We evaluated the different methods in terms of PSNR, SSIM (Wang et al., 2004a), and multiscale SSIM (MS-SSIM; Wang et al., 2004b). We used the implementation of van der Walt et al. (2014) for SSIM and the implementation of Toderici et al. (2016b) for MS-SSIM. We find that in terms of PSNR, our method performs similar to JPEG 2000 although slightly worse at low and medium bit rates and slightly better at high bit rates. In terms of SSIM, our method outperforms all other tested methods. MS-SSIM produces very similar scores for all methods, except at very low bit rates. However, we also find these results to be highly image dependent. Results for individual images are provided as supplementary material555https://figshare.com/articles/supplementary_zip/4210152.
In Figure 5 we show crops of images compressed to low bit rates. In line with quantitative results, we find that JPEG 2000 reconstructions appear visually more similar to CAE reconstructions than those of other methods. However, artefacts produced by JPEG 2000 seem more noisy than CAE’s, which are smoother and sometimes appear Gábor-filter-like.
To quantify the subjective quality of compressed images, we ran a mean opinion score (MOS) test. While MOS tests have their limitations, they are a widely used standard for evaluating perceptual quality (Streijl et al., 2014). Our MOS test set included the 24 full-resolution uncompressed originals from the Kodak dataset, as well as the same images compressed using each of four algorithms at or near three different bit rates: , and bits per pixel. Only the low-bit-rate CAE was included in this test.
For each image, we chose the CAE setting which produced the highest bit rate but did not exceed the target bit rate. The average bit rates of CAE compressed images were , , and , respectively. We then chose the smallest quality factor for JPEG and JPEG 2000 for which the bit rate exceeded that of the CAE. The average bit rates for JPEG were , and , for JPEG 2000 , and . For some images the bit rate of the CAE at the lowest setting was still higher than the target bit rate. These images were excluded from the final results, leaving 15, 21, and 23 images, respectively.
The perceptual quality of the resulting images was rated by non-expert evaluators. One evaluator did not finish the experiment, so her data was discarded. The images were presented to each individual in a random order. The evaluators gave a discrete opinion score for each image from a scale between (bad) to (excellent). Before the rating began, subjects were presented an uncompressed calibration image of the same dimensions as the test images (but not from the Kodak dataset). They were then shown four versions of the calibration image using the worst quality setting of all four compression methods, and given the instruction “These are examples of compressed images. These are some of the worst quality examples.”
shows average MOS results for each algorithm at each bit rate. 95% confidence intervals were computed via bootstrapping. We found that CAE and JPEG 2000 achieved higher MOS than JPEG or the method ofToderici et al. (2016b) at all bit rates we tested. We also found that CAE significantly outperformed JPEG 2000 at () and ().
We have introduced a simple but effective way of dealing with non-differentiability in training autoencoders for lossy compression. Together with an incremental training strategy, this enabled us to achieve better performance than JPEG 2000 in terms of SSIM and MOS scores. Notably, this performance was achieved using an efficient convolutional architecture, combined with simple rounding-based quantization and a simple entropy coding scheme. Existing codecs often benefit from hardware support, allowing them to run at low energy costs. However, hardware chips optimized for convolutional neural networks are likely to be widely available soon, given that these networks are now key to good performance in so many applications.
While other trained algorithms have been shown to provide similar results as JPEG 2000 (e.g. van den Oord & Schrauwen, 2014), to our knowledge this is the first time that an end-to-end trained architecture has been demonstrated to achieve this level of performance on high-resolution images. An end-to-end trained autoencoder has the advantage that it can be optimized for arbitrary metrics. Unfortunately, research on perceptually relevant metrics suitable for optimization is still in its infancy (e.g., Dosovitskiy & Brox, 2016; Ballé et al., 2016). While perceptual metrics exist which correlate well with human perception for certain types of distortions (e.g., Wang et al., 2004a; Laparra et al., 2016), developing a perceptual metric which can be optimized is a more challenging task, since this requires the metric to behave well for a much larger variety of distortions and image pairs.
In future work, we would like to explore the optimization of compressive autoencoders for different metrics. A promising direction was presented by Bruna et al. (2016), who achieved interesting super-resolution results using metrics based on neural networks trained for image classification. Gatys et al. (2016) used similar representations to achieve a breakthrough in perceptually meaningful style transfer. An alternative to perceptual metrics may be to use generative adversarial networks (GANs; Goodfellow et al., 2014). Building on the work of Bruna et al. (2016) and Dosovitskiy & Brox (2016), Ledig et al. (2016) recently demonstrated impressive super-resolution results by combining GANs with feature-based metrics.
We would like to thank Zehan Wang, Aly Tejani, Clément Farabet, and Luke Alonso for helpful feedback on the manuscript.
Scale mixtures of normal distributions.Journal of the Royal Statistical Society, Series B, 36(1):99–102, 1974.
Journal of Machine Learning Research, 15(1):2061–2086, 2014.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8(3-4):229–256, 1992.
We redefine the gradient of the clipping operation to be constant,
Consider how this affects the gradients of a squared loss,
Assume that is larger than 255. Without redefinition of the derivative, the error signal will be 0 and not helpful. Without any clipping, the error signal will depend on the value of , even though any value above 255 will have the same effect on the loss at test time. On the other hand, using clipping but a different signal in the backward pass is intuitive, as it yields an error signal which is proportional to the error that would also be incurred at test time.
We compared optimized and non-optimized JPEG with (4:2:0) and without (4:4:4) chroma sub-sampling. Optimized JPEG computes a Huffman table specific to a given image, while unoptimized JPEG uses a predefined Huffman table. We did not count bits allocated to the header of the file format, but for optimized JPEG we counted the bits required to store the Huffman table. We found that on average, chroma-subsampled and optimized JPEG performed better on the Kodak dataset (Figure 7).
Since a single real number can carry as much information as a high-dimensional entity, dimensionality reduction alone does not amount to compression. However, if we constrain the architecture of the encoder, it may be forced to discard certain information. To better understand how much information is lost due to dimensionality reduction and how much information is lost due to quantization, Figure 8 shows reconstructions produced by a compressive autoencoder with and without quantization. The effect of dimensionality reduction is minimal compared to the effect of quantization.
To bring the parameter controlling the rate-distortion trade-off into a more intuitive range, we rescaled the distortion term and expressed the objective as follows:
Here, is the number of coefficients produced by the encoder and is the dimensionality of (i.e., 3 times the number of pixels).
The high-bit-rate CAE was trained with and 96 output channels, the medium-bit-rate CAE was trained with and 96 output channels, and the low-bit-rate CAE was trained with and 64 output channels.
Below we show complete images corresponding to the crops in Figure 5. For each image, we show the original image (top left), reconstructions using CAE (top right), reconstructions using JPEG 2000 (bottom left) and reconstructions using the method of (Toderici et al., 2016b) (bottom right). The images are best viewed on a monitor screen.