The task of image compression has been thoroughly examined over the years by researchers and teams such as the Joint Pictures Experts Group, who designed the ubiquitous JPEG and JPEG 2000 (jpeg2000) image formats. More recently, the WebP algorithm was proposed in order to further improve image compression rates (webp:2015)
, especially for the high-resolution images that have become more common in recent years. All these efforts approach the compression problem from an empirical standpoint: human experts design various heuristics to reduce the amount of information needing to be retained, then determine ways to transform the resulting data in a way that’s amenable to lossless compression. As this work is almost exclusively focused on the compression of large images, low-resolution thumbnail images are usually ignored (and even harmed, e.g., by requiring more data in file headers).
Standard image compression algorithms tend to make assumptions about image scale. For example, we usually assume that a patch from a high-resolution natural image will contain a lot of redundant information. In fact, the higher-resolution an image is, the more likely it is that its component patches will contain mostly low-frequency information. This fact is exploited by most image codecs and, as such, these codecs tend to be very efficient at compressing high-resolution images. However, such assumptions are broken when creating thumbnails from high-resolution natural images, as a patch taken from a thumbnail is much more likely to contain difficult-to-compress, high-frequency information.
Large-scale compression of thumbnails (e.g., 3232 images) is an important application, both in terms of reducing disk storage and making better use of limited Internet bandwidth. Enormous numbers of thumbnails are currently transmitted across the web for page previews, photo galleries, search engine results, and numerous other applications. As such, any improvements to thumbnail compression will significantly improve the experience of users accessing content over low-bandwidth connections.
In recent years, neural networks have become commonplace to perform tasks that had for decades been accomplished by ad hoc algorithms and heuristics. For instance, in image recognition and object detection, the current state-of-the-art algorithms are all based on neural networks. It is only natural to ask if we can also employ this powerful class of methods to further improve the task of image compression, especially for image sizes for which we do not have carefully designed, hand-tuned compressors.
If we consider an image codec broadly as an analysis/synthesis problem with a bottleneck in the middle, then we can find a significant body of research aimed toward teaching neural networks to discover compressive representations. Most of this work (e.g., denton2015; gregor2015), has been on synthesis of small images: often 3232 in part due to CIFAR10 (Krizhevsky09learningmultiple). Much of this work has focused on a class of neural networks known as autoencoders (Krizhevsky2011). However, standard autoencoders operate under a number of hard constraints that have so far made them infeasible as a drop-in replacement for standard image codecs. Some of these constraints are that variable-rate encoding is typically not possible (one network is trained per compression rate); the visual quality of the output is hard to ensure; and they’re typically trained for a particular scale, being able to capture redundancy only at that scale.
We explore several different ways in which neural network-driven image compression can improve compression rates while allowing similar flexibility to modern codecs. To achieve this flexibility, the network architectures we discuss must meet all of the following requirements: the compression rate should be capable of being restricted to a prior bit budget; the compressor should be able to encode simpler patches more cheaply (analogously to modern codecs which may allocate more bits to areas of the image which contain important visual features); and the model should be able to learn from large amounts of existing imagery in order to optimize this compression process toward real-world data.
2 Related Work
The basic principles of using feed-forward neural networks for image compression have been known for some time(Jiang1999). In this context, networks can assist or even entirely take over many of the processes used as part of a traditional image compression pipeline: to learn more efficient frequency transforms, more effective quantization techniques, improved predictive coding, etc.
More recently, autoencoder architectures (Hinton2006) have become
viable as a means of implementing end-to-end compression. A typical compressing
autoencoder has three parts:
an encoder which consumes an input (e.g., a fixed-dimension image or
patch) and transforms it into
a bottleneck representing the compressed data, which can then be
a decoder into something resembling the original input.
These three elements are trained end-to-end, but during deployment the encoder
and decoder are normally used independently.
The bottleneck is often simply a flat neural net layer, which allows the
compression rate and visual fidelity of the encoded images to be controlled by
adjusting the number of nodes in this layer before training. For some types of
autoencoder, encoding the bottleneck as a simple bit vector can be
The bottleneck is often simply a flat neural net layer, which allows the compression rate and visual fidelity of the encoded images to be controlled by adjusting the number of nodes in this layer before training. For some types of autoencoder, encoding the bottleneck as a simple bit vector can be beneficial(Krizhevsky2011). In neural net-based classification tasks, images are repeatedly downsampled through convolution and pooling operations, and the entire output of the network might be contained in just a single node. In the decoder half of an autoencoder, however, the network must proceed in the opposite direction and convert a short bit vector into a much larger image or image patch. When this upsampling process is spatially-aware, resembling a “backward convolution,” it is commonly referred to as deconvolution (Long2014). Long short-term memory (LSTM) networks are a type of recurrent neural network (lstm:1997) that have proven very successful for tasks such as speech recognition (Graves2013) and machine translation (Sutskever2014). Many extensions to the standard LSTM model are possible including explicitly incorporating spatial information, which leads to various types of convolutional LSTMs (Shi2015) that may be better suited for image compression. We experiment with such models and also try simpler recurrent architectures that use the residual error of one autoencoder as the input to another.
3 Variable Rate Compression Architectures
We start by describing a general neural network-based compression framework and then discuss the details of multiple instantiations of this architecture. Each subsection describes a different architecture that builds on the previous model and improves the compression results.
For each architecture, we will discuss a function
that takes an image patch as input and produces an encoded representation. This representation is then processed by a binarization function, which is the same across architectures, and is discussed in Section 3.2. Finally, for each architecture we also consider a decoder function , which takes the binary representation produced by and generates a reconstructed output patch. Taken together, these three components form an autoencoder, , which is the basic building block for all of the compression networks.
For all architectures, an offset and scale are applied to the 8-bit RGB input images to give a range of values between -0.9 and 0.9. This range is compatible with the values that can be emitted by .
3.1 Image Compression Framework
The neural network architectures that we use share the same conceptual stages: an encoder network, followed by a quantizer, and a decoder network. In addition, our framework is tuned for image compression and supports variable compression rates without the need for retraining or for storing multiple encodings of the same image.
To make it possible to transmit incremental information, the design should take into account the fact that image decoding will be progressive. With this design goal in mind, we can consider architectures that are built on top of residuals with the goal of minimizing the residual error in the reconstruction as additional information becomes available to the decoder.
Formally, we chain multiple copies of a residual autoencoder, , defined as:
This chaining is explicit, in the case of our feed-forward-only networks (Section 3.3 and Section 3.5) and is implicit, through the recurrent structure, in the case of our LSTM networks (described in Section 3.4 and Section 3.6). In all cases, we set to be equal to the original input patch, and then for represents the residual error after stages. For non-LSTM architectures (described in Sections 3.3 and 3.5), has no memory, and so we only expect it to predict the residual itself. In this case, the full reconstruction is recovered by summing over all of the residuals, and each stage is penalized due to the difference between the prediction and the previous residual:
On the other hand, LSTM-based architectures (described in Sections 3.4 and 3.6) do hold state, and so we expect them to predict the original image patch in each stage. Accordingly, we compute the residual relative to the original patch:
In both cases, the full, multi-stage network is trained by minimizing for , where is the total number of residual autoencoders in the model.
3.2 Binary Representation
In our networks, we employ a binarization technique first proposed
by williams1992simple, and similar to Krizhevsky2011 and
binarization. This binarization has three benefits:
bit vectors are trivially serializable/deserializable for image
transmission over the wire,
control of the network compression rate is achieved simply by putting
constraints on the bit allowance, and
a binary bottleneck helps force the network to learn efficient
representations compared to standard floating-point layers, which may have
many redundant bit patterns that have no effect on the output.
The binarization process consists of two parts. The first part consists of
generating the required number of outputs (equal to the desired number of
output bits) in the continuous interval . The second part involves
taking this real-valued representation as input and producing a discrete
output in the set for each value.
For the first step in the binarization process, we use a fully-connected layer
with activations. For the second part, following raiko:2015,
one possible binarization of is defined as:
corresponds to quantization noise. We will use the
regularization provided by the randomized
quantization to allow us to cleanly backpropagate
gradients through this binarization layer.
corresponds to quantization noise. We will use the regularization provided by the randomized quantization to allow us to cleanly backpropagate gradients through this binarization layer.Therefore, the full binary encoder function is:
3.3 Feed-Forward Fully-Connected Residual Encoder
In the simplest instantiation of our variable rate compression architecture, we set and to be composed of stacked fully-connected layers. In order to make the search for architectures more feasible we decided to set the number of outputs in each fully-connected layer to be constant (512) and only used the nonlinearity.
Given that and can be functions of the encoding stage number, and since the statistics of the residuals change when going from stage to we considered two distinct approaches: in the first we share weights across all stages, while in the second, we learn the distinct weights independently in each stage. The details of this architecture are given in Figure 1.
3.4 LSTM-based Compression
In this architecture, we explore the use of LSTM models for both the encoder and the decoder. In particular, both and consist of stacked LSTM layers.
Following the LSTM formulation and notation proposed by zaremba2014recurrent, we use superscripts to indicate the layer number, and subscripts to indicate time steps. Let denote the hidden state of -th LSTM layer at time step . We define to be an affine transform . Finally, let denote element-wise multiplication, and let be the input to the first LSTM layer at time step .
Using this notation, the LSTM architecture can be written succinctly as proposed by graves2013generating:
denotes the sigmoid function.
In these equations, and are applied element-wise. This alternate formulation of LSTM is useful because it reduces the numbers of separate operations needed to evaluate one step, which allows for an efficient implementation on GPU.
For the encoder, we use one fully-connected layer followed by two stacked LSTM layers. The decoder has the opposite structure: two stacked LSTM layers followed by a fully-connected layer with a nonlinearity that predicts RGB values (we omit this layer in the diagrams to reduce clutter). The exact architecture used in the experiments is given in Figure 2 (minus the RGB conversion).
3.5 Feed-Forward Convolutional/Deconvolutional Residual Encoder
Section 3.3 proposed a fully-connected residual autoencoder. We extend this architecture by replacing the fully-connected layers with convolution operators in the encoder and with deconvolutional operators in the decoder . The final layer of the decoder consists of a 11 convolution with three filters that converts the decoded representation into RGB values. We depict this architecture in Figure 3 (minus the RGB conversion).
The deconvolutional operator is defined as the transpose of the convolutional operator. Let
denote the convolutional operator with stride, and let denote the stride operator with stride factor , i.e., for 2D multi-channel image and pixel coordinate . Then . Note that the transpose of is the “inflation” operator :
Thus we can define the deconvolutional operator with stride as follows:
3.6 Convolutional/Deconvolutional LSTM Compression
The final architecture combines the convolutional and deconvolutional operators with LSTM. We define convolutional LSTM by replacing the transformation in equation (8) with convolutions plus bias. Then the transformation function for convolutional LSTM with stride is
The subscript belonging to now refers to the depth (number of features) in the output feature maps. Note that the second convolution term represents the recurrent relation of convolutional LSTM so both its input and output must have the same size. Therefore, when a convolutional LSTM has a stride greater than one, the stride is only applied to the first convolution term, while the second term is always computed with a stride of one. Finally, to build the encoder for this architecture, we replace the second and third convolutional layers from Figure 3 with convolutional LSTM layers.
For the decoder, we cannot replace all convolutional operations with deconvolution due to the fact that the input to deconvolution often has a different spatial dimension than the output. For the purposes of defining a deconvolutional LSTM, becomes
Here we use the subscripts and to differentiate between the weights associated to the convolution and deconvolution operations. To construct the deconvolutional LSTM decoder, we replace the second and third deconvolutional layers of the deconvolutional decoder from Figure 3 with deconvolutional LSTM.
3.7 Dynamic Bit Assignment
For the non-convolutional approaches presented here, it is natural to assign a varying number of bits per patch by allowing a varying number of iterations of the encoder. This could be determined by a target quality metric (e.g., PSNR). While not as natural, in the case of the convolutional approaches, a similar method may also be employed. The input image needs to be split into patches, and each patch processed independently, thereby allowing a different number of bits per region. However, this approach has disadvantages that will be discussed at the end of this paper.
4 Experiments & Analysis
In order to train the various neural network configurations, we used the Adam algorithm proposed by kingma:2014. We experimented with learning rates of . The loss was normalized by the number of pixels in the patch and also by the number of total time steps (i.e., number of iterations unrolled) needed to fully encode the patch. We employed no perceptual weighting to improve the compression for evaluation under the SSIM measure. During training we used the unmodified error measure.
We experimented with the number of steps needed to encode each patch, varying this from 8 to 16. For the fully connected networks, we chose to use 8 bits per step for an 88 patch, allowing us to fine tune the compression rate in increments of 8 bits. When scaled up to a 3232 patch size, this allowed us to control the compression in increments of 128 bits.
For the convolutional/deconvolutional networks, the encoders reduce the 3232 input patch down to 88 through convolution operations with strides. We experimented with a binary output of 2 bits per pixel at this resolution, yielding a tunable compression rate with increments of 16 bytes per 3232 block.
4.2 Evaluation Protocol and Metrics
Evaluating image compression algorithms is a non-trivial task. The metric commonly used in this context is the peak signal-to-noise ratio (PSNR), however, PSNR is biased toward algorithms which have been tuned to minimize loss. This would not be a fair comparison against methods like JPEG which have been tuned to minimize a form of perceptual loss.
In our evaluation protocol we instead employ the Structural Similarity Index (SSIM), a popular perceptual similarity measure proposed by ssim. Since we’re evaluating compression performance on small 3232 images, we do not smooth the images (a typical preprocess for SSIM). In addition, since we’re interested in quantifying how well local details are preserved, we split the images into 88 patches and compute the SSIM on each patch and on each color channel independently. The final score is the average SSIM over all patches and channels.
|Header-less JPEG 2000||
|Conv/Deconv LSTM Compressor||3232||0.77||0.87|
When analyzing the results, a higher score implies a better reconstruction, with a score of 1.0 representing a perfect reconstruction. The lowest possible score is 0.0. Note that while there are other metrics (e.g., psnrhvsm) which emulate the human visual system better than SSIM, we chose to use SSIM here due to its ubiquity and ease of comparison with previous work.
4.3 3232 Benchmark
Our 3232 benchmark dataset contains 216 million random color images collected from the public internet. To be included in the dataset, each image must originally have more than 32 pixels on both axes. Qualified images were then downsampled to 3232, losing their original aspect ratios. This downsampling eliminates pre-existing compression artifacts for most images. The final 3232 images were then stored losslessly (as PNG) before being used for training and testing. For training the LSTM models, of the images were used; the remaining were set aside for evaluation. For evaluating the image codecs, we use a subset of this data containing 100k random images.
Table 1 summarizes the results on the 3232 benchmark, comparing our two LSTM approaches to two JPEG codecs and to WebP. To avoid unfairly penalizing the codecs due to the unavoidable cost of their file headers, we exclude the header size from all metrics. Note also that since these standard codecs can not be tuned to an exact byte budget (e.g., 64 bytes excluding the file header), we search for the encoder quality setting that leads to a file whose size is as close as possible, but never less than, the target size. On average, this leads to each JPEG and WebP image consuming slightly more space than we allow for the LSTM models.
These 3232 images contain considerable detail that is perceptually relevant. As can be seen in Figure 4, compressing these images without destroying salient visual information or hallucinating false details is challenging. At these very low bitrates and spatial resolution, JPEG block artifacts become extremely prominent, and WebP either introduces blocking or overly blurs the image depending on the strength of the internal filter. Color smearing artifacts due to the codecs’ default (4:2:0) chroma subsampling are also clearly visible.
Compared to JPEG, the non-convolutional LSTM model slightly reduces inter-block boundaries on some images but can also lead to increased color bleeding (e.g., on mandrill as shown in Figure 4). Furthermore, the visual quality never exceeds JPEG on average as measured by SSIM and shown in Figure 5. This motivates the (de)convolutional LSTM model, which eliminates block artifacts while avoiding excessive smoothing. It strikes the best balance between preserving real detail and avoiding color smearing, false gradients, and hallucinated detail not present in the original image.
Note that the (de)convolutional LSTM model exhibits perceptual quality levels that are equal to or better than both JPEG and WebP at – lower average bitrate. We see this improvement despite the fact that, unlike JPEG and WebP, the LSTMs do not perform chroma subsampling as a preprocess. However, at the JPEG quality levels used in Figure 4, disabling subsampling (i.e., using 4:4:4 encoding) leads to a costly increase in JPEG’s bitrate: - bpp instead of - bpp, or greater. This means that if we desired to preserve chroma fidelity, we would need to drastically reduce JPEG encoding quality in order to produce 4:4:4 JPEGs at a comparable bitrate to the LSTM models.
In terms of coding efficiency, we took an autoencoder architecture (one iteration of the model presented in Section 3.5) with a given bit budget of either 64 or 128 bytes, and compared its SSIM against the (de)convolutional LSTM encoder at these targets. In both cases, the LSTM model produces SSIM values that are equivalent to the autoencoder, even though the resulting model is more flexible.
5 Conclusion & Future Work
We describe various methods for variable-length encoding of image patches using neural networks, and demonstrate that for the given benchmark, the fully-connected LSTM model can perform on par with JPEG, while the convolutional/deconvolutional LSTM model is able to significantly outperform JPEG on the SSIM perceptual metric.
While our current approach gives favorable results versus modern codecs on small images, codecs that include an entropy coder element tend to improve (in a bits-per-pixel sense) with greater resolution, meaning that by choosing an arbitrarily large test image it is always possible to defeat an approach like that described in this work. Therefore, an obvious need is to extend the current work to function on arbitrarily large images, taking advantage of spatial redundancy in images in a manner similar to entropy coding.
Although we presented a solution for dynamic bit assignment in the convolutional case, it is not a fully satisfactory solution as it has the potential to introduce encoding artifacts at patch boundaries. Another topic for future work is determining a dynamic bit assignment algorithm that is compatible with the convolutional methods we present, while not creating such artifacts.
The algorithms that we present may also be extended to work on video, which we believe to be the next grand challenge for neural network-based compression.
6 Appendix: Bitwise Encoding & Decoding
In order to better understand the network architecture proposed in Section 3.4, we initially limited it in terms of its capacity (bottleneck size) and target (complexity of reconstruction). Namely, we restricted the output per step to one bit, and trained the network to compress grayscale images. We took this simpler network and encoded a popular image of a cat one bit at a time. Figure 6 shows the effect of the first four steps of this encoding.
Figure 7 depicts the behavior of additional bits on four 88 blocks from the cat image using the same network. In this zoomed-in version it is apparent that the network first learns to differentiate between “dark” and “light” patches using the first bit. Given an additional bit, the network is able to introduce new solid shades of gray. One more bit starts introducing simple gradients, which are further refined with a fourth bit, and so on.