Image compression takes advantage of the redundancy of information in an image to reduce its size. Both the storage and network bandwidth required for the image are reduced by compression and for large datasets the savings can be large. While there are lossless compression formats such as PNG ), the bigger reductions are obtained using lossy compression formats such as JPEG , JPEG2000  or BPG 
). These lossy formats are handcrafted (not learned). While learned image compression from data using neural networks is not new[29, 18, 26], there has recently been a resurgence of deep learning based techniques for solving this problem[4, 5, 27, 34, 23]. These compression schemes often consist of an encoder-decoder network. The loss function usually trades-off distortion and the bit rate . The encoder creates a latent embedding from the image. With this embedding as input and a combination of a quantizer and an entropy coder generates a compact bit-stream for storage. For decompression, the entropy coding is reversed to produce an embedding which is then fed into a decoder to give a reconstructed approximate image as output.
To evaluate the quality of the reconstructed image with respect to the original, measures such as structural similarity (SSIM)  or PSNR (a function of MSE - mean squared error) have been proposed in the past. In recent work, multi-scale structure similarity (MS-SSIM)  has become more popular. PSNR and MS-SSIM were originally formulated as perceptual metrics but don’t seem to completely capture certain type of distortions created by learned compression methods.
. The choice of whether PSNR or MS-SSIM is used for evaluation dictates which loss function is used since optimizing for the evaluation metric ensures that the technique achieves a high number on it and a lower number on the other metric. Several claims have been made that such approaches are better compared to engineered compression formats due to their higher MS-SSIM but as Figure1 shows this misleading. From left to right we have four different techniques ranked in descending order of MS-SSIM values. It is clear that the first two images have many more artifacts than the last two images (the text is not readable in the first two images). Clearly, PSNR and MS-SSIM scores do not reflect image quality or human perception well. While such scores may be reasonable for measuring engineered codecs which cannot directly optimize these measures, deep learning techniques can directly optimize such metrics leading to this situation. The work done by  experimentally proves (using human evaluation) that this problem is indeed not confined to one image but occurs across different image compression datasets.
This paper proposes deep perceptual compression (DPC) - a deep learning approach for image compression which uses a Learned Perceptual Image Patch Similarity (LPIPS) metric  (deep perceptual metric) as a loss function. Zhang et al  use a CNN to compute this metric. Since a CNN in general computes a function, we use their CNN to compute the deep perceptual loss metric. This perceptual metric was trained by Zhang et al  on user judgments on distorted images. To regularize this network we combine the deep perceptual metric with an MS-SSIM loss in a multi-task learning setup (Figure 3) and train the network end-to-end.
. In an attempt to minimize checkerboard patters in the reconstructed images, we set the deconvolution up-sampling in a way that kernel sizes are divisible by strides to avoid overlap issue (see for a detailed explanation of checkerboard pattern problem).
We show that DPC is better (as judged by humans) than a couple of deep learning techniques [27, 4] as well as JPEG-2000 at a number of bit-rates by doing experiments on several standard compression datasets 111Since we use human judgments, and therefore, require images from each technique for all datasets we were constrained to using deep learning techniques for which the researchers made models available and for this we are thankful. DPC is better than BPG at some bit-rates while BPG is better at others. Since humans are more sensitive to certain compression artifacts as compared to others, as an alternative to human judgments, we take a pre-trained object detector (ResNet-101) on the COCO-dataset and run it on the images output by each compression algorithm. Absent fine-tuning all algorithms cause some degradation in the object detector performance but DPC suffers the least degradation while at higher bit-rates BPG comes close.
2 Related Work
We discuss some related work on learned image compression and perceptual image quality. Many compression models use autoencoders. One difference between models is how the entropy of the data is learned. The entropy model is jointly trained with the encoder and decoder with a rate-distortion trade-off as a loss function i.e ( is the rate and is distortion)). To learn optimal R for a particular D, some have used a fully factorized entropy model [4, 42], others use context in the quantized space to improve compression using auto-regressive  approaches [27, 34, 24].  jointly use factorized and auto-regressive approaches to learn entropy.
Apart from the innovation in entropy modeling, some papers have improved on the encoder decoder architectures. 
use GDN activation instead of RELU,
adopt ideas from super-resolution work to use pixel-shuffle in the decoder to better reconstruct the image. [37, 34, 2] use adversarial training. On metrics such as MS-SSIM or PSNR they do better than JPEG , JPEG2000  and some do better than BPG . However, as we discussed in the introduction these metrics are misleading.
network trained for the classification task on ImageNet
may be used as a perceptual loss function. In other contexts, this approach has been used for neural style transfer, for conditional image synthesis [10, 12]. Recently, Zhang et al.[48, 11] investigate the effectiveness of these deep CNN’s as a perceptual similarity metric. They first show humans a triplet of images which include two distorted versions of an image patch and the original patch and ask which distorted patch is closer to the original. They create a net where the feature responses of standard CNN architectures such as AlexNet  or VGG-16  (pre-trained on ImageNet) are fed to layers which learn to output distance metrics which reflect the low-level human judgments.
In our work we show that using the deep perceptual metric as a loss function leads to improved image compression results as judged by humans. We do need to regularize this with an MS-SSIM loss.
3 Deep Perceptual Compression
3.1 Compression Model
We adapt the architecture of Mentzer  with certain essential modifications to optimize on deep perceptual loss. We explicitly keep certain components (such as quantization and entropy coding) of the original approach  to investigate the effect of the proposed deep perceptual loss. An auto-encoder framework is used consisting of stacked residual blocks for the encoder and decoder. In the bottleneck, a quantizer is used for lossy data transformation and an auto-regressive , a Decoder , a Quantizer and an Entropy coding model ; where and are the learnable parameters represented by a deep residual neural network, is the number of centers for the lossy quantizer and is a learnable parameters for the entropy model represented by a 3D pixel-CNN . All these modules are trained and optimized jointly on a Rate-Distortion loss. Please see Fig.3 for a high-level illustration of the model architecture.
Encoder is a function that takes an image and computes where . i.e. for an input image , we get a float-point latent representation where is a point in a dimensional space. Next, the Quantizer discretizes to . It’s a bounded discrete space with centers. Note, this is a lossy transformation. In this work we adopt the differentiable soft-quantization idea from [42, 1]. We use nearest neighbor assignments to compute where . In our model we use centers.
Entropy model: Furthermore, learns the probability distribution of the quantized co-efficients . is then losslessly encoded into a binary bit-stream using arithmetic coding. In this work (and in ), a variant of an auto-regressive model called PixelCNN is used which requires that the next quantized value is conditional on previously seen quantized values . Softmax-cross-entropy loss is used as the Rate R coding cost.
Importance map: We also used variable bit-allocation, since in images there is great variability in information content across spatial locations. Specifically, we take the last layer of Encoder and add a single-channel 2D output of the form: . This is further expanded into a mask with the same dimensionality as . The following rules determine the values of the map : where denotes the value at . This mask is then point-wise multiplied with i.e. to give a spatially adaptive quantized feature map . Please refer to the original work  for more specifics.
Finally, the Decoder reconstructs the quantized latent vector back to an image, i.e. . The goal is to learn a compact quantized latent representation such that the distortion between the original image and the reconstructed image is minimum. This is achieved by using a rate-distortion based loss function i.e. . In Mentzer et al , MS-SSIM is used for measuring distortion between images .
Checkerboard patterns: Image generation models with deconvolution based up-sampling are known to generate certain checkerboard patterns depending on the loss function. A number of proposals have been made to solve them[36, 30]. To minimize checkerboard patters in the reconstructed images, we set deconvolution up-sampling in a way that kernel sizes are divisible by strides to avoid overlap issue. We refer the readers to  for more on the checkerboard pattern problem. Specifically, we use we kernels of size 2 and stride 2 in the Decoder.
The combination of the above mentioned modules match the state-of-the-art compression performance on Kodak dataset on MS-SSIM metric (circa CVPR 2018). Newer work since then has shown some improvements on MS-SSIM or PSRN ([5, 28, 23]). However, as we pointed out in the introduction MS-SSIM is not a good evaluation metric. We will instead use human judgments and these models are not available for generating images across several datasets for human judgments. We now discuss the internals of Perceptual loss and its effects on lossy image compression.
3.2 Deep Perceptual Loss
Zhang et al.  show the utility of deep CNNs to measure perceptual similarity. It has been observed that comparing internal activations from deep CNNs such as VGG-16  or AlexNet  acts as a better perceptual similarity metric than MS-SSIM or PSNR. We use the deep perceptual metric for both training and as one of the evaluation metric on test data. We make use of activations from five layers after each block in the VGG-16 
architecture with batch normalizations.
Feed-forward is performed on the VGG-16 for both the original () and reconstructed images (). Let be the set of layers used for loss calculation (five for our setup) and, a function denotes feed-forward on an input image . and return two stacks of feature activations for all layers. The deep perceptual loss is then computed as follows:
and are unit-normalized in the channel dimension. Let us call these, where . ( are the spatial dimensions of the given activation map and is the number of channels).
are scaled channel wise by multiplying with the vector
The distance is then computed and an average over spatial dimensions are taken.
Finally, a channel-wise sum is taken, outputing the deep perceptual loss.
Equation. 1 and Figure. 4 summarize the deep perceptual loss computation. Note that the weights in are learned for image classification on the ImageNet dataset  and are kept fixed. are the linear weights learned on top of on the Berkeley-Adobe Perceptual Patch Similarity Dataset . Note we use the trained model provided by Zhang et al.  to compute the loss.
For training we make use of MS-SSIM for regularization, resulting in the final distortation loss to be: . In practice, we use in our training setup.
3.3 Training Details
We make use of the Adam optimizer  with an initial learning rate of and a batch-size of . The learning rate is decayed by a factor of
in every two epochs (step-decay). The overall loss function is:, where is the rate loss and is a distortion loss, which for DPC is a linear combination of the deep perceptual loss (see 3.2) and MS-SSIM (weighted equally). Further, similar to , we clip the rate term to to make the model converge to a certain bit-rate, . The training is done on the training set of ImageNet dataset the from Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) , with the mentioned setup, we observe convergence in six epochs.
By varying the model hyper-parameters such as the number of channels in the bottleneck, weight for distortion loss (), target bit-rate (), we obtain multiple models in the bit-per-pixel range of to . Similarly we reproduce the models for [27, 4] at different bpp values. 222Note that in the case of  we used an MS-SSIM loss instead of MSE loss as was done in the original paper but this does not change the general conclusions of the paper.
We extensively evaluate image compression techniques for human perceptual similarity using a two alternative forced choice (2AFC) approach. 2AFC is a known way of performing perceptual similarity evaluation and has been used by  for evaluating super-resolution techniques. The study is conducted on the Amazon MTurk platform where an evaluator is show the original image along with the compressed images from two techniques on each side. They are asked to choose the image which is more similar to the original. We show the entire image along with a synchronized (on all three) magnifying glass to observe finer details. This gives them a global context of the whole image and at the same time provides a quick way to access local regions. No time limit was placed for this human experiment.
In this setup, we compare the proposed DPC, two engineered (JPEG-2000 , BPG ) and two learning based (Mentzer , Ballé ) compression techniques by choosing all possible combinations (ten pairs in total). Further, we do this at four different compression levels - i.e. bits-per-pixel (bpp) values: . The study is conducted on four standard datasets: Kodak , Urban100 , Set14  and Set5 .
We have a total of pairs ( pairs for five methods, bpp values and images in total). For each such pair, we obtain evaluations resulting in a total of HITs.
4.1 Image Compression Results
For each test dataset, we compress all the images using each model. For each model and each bpp we compute an average for all images for each metric. We do the same for different bpp’s so that we get multiple points on the deep perceptual metric vs bpp, MS-SSIM vs bpp and PSNR vs bpp curves. We interpolate the values between two such points and we do not extrapolate the values outside the bpp range. Note that for the deep perceptual metric, a lower value is better and for MS-SSIM and PSNR higher is better.
For human evaluation, for each image and a given bpp value we have pair-wise votes in the form of method-A vs method-B. Since we have all possible pairs for the five methods under consideration, we aggregate these votes and obtain the method which does best for the given image (at a particular bpp value) based on maximum votes. In the Figures 5, 6, 7 and 8, we show the number of images (y-axis) for which a method performs best.
The comparisons for Kodak  are shown in Figure 5. We observe that Mentzer  despite the highest MS-SSIM score for all bpp values ranks in the human study. Ballé  obtains a higher MS-SSIM score compared to DPC, BPG and JPEG-2000 after bpp, despite which it ranks worst among the five. DPC with the lowest deep perceptual metric scores performs better than Mentzer , Ballé  and JPEG-2000 at all bit-rates. The performance is comparable to BPG and better at bpp value. PSNR plots on all four datasets are shown in Figure. 9, it can be observed that the conventional methods (JPEG-2000, BPG) have significantly higher PSNR scores, although DPC outperforms JPEG-2000 and is comparable to BPG in human study. These observations show that both PSNR and MS-SSIM are inadequate metrices to judge perceptual similarity for learned compression techniques.
4.2 Object Detection Results
While the compressed images need to be perceptually good, they should also be useful for subsequent computer vision tasks. It was observed by Dwibedi that for object detectors such as Faster-RCNN  region-based consistency is important and pixel level artifacts can significantly affect the performace. In this section, we evaluate different compression techniques for a subsequent task of object detection on MS-COCO validation dataset .
We use a pre-trained Faster-RCNN  model with a  based backbone for its relatively high average precision and capability to detect smaller objects. The performance is measured using, average precision (AP), AP is the average over multiple IoU (the minimum IoU to consider a positive match). We use AP@[.5:.95] which corresponds to the average AP for IoU from 0.5 to 0.95 with a step size of 0.05. With the original MS-COCO images, this model attains a performance of AP . For each compression method, we compress and reconstruct the image at four different bit-rate values: (same values as used for human evaluation) and then we evaluate them for object detection. The performance of competing compression methods are reported in Figure. 10. It can be clearly seen that at low bit-rates the proposed DPC significantly outperforms the competing methods. At bit-rate, the performance is very close to that of BPG. Please note, as we did not fine-tune networks with the compressed images, there is degradation in performance from the current state-of-the-art.
We have demonstrated that using a deep perceptual metric as a loss with MS-SSIM as a regularizer one can obtain good image compression as judged by humans on several standard compression datasets. MS-SSIM and PSNR are not good metrics for evaluating image compression and human judgments are more reliable. We also show that DPC compression causes less degradation in a pre-trained object detector than a number of other approaches.
We would like to thank Joel Chan and Peter Hallinan for helping us in setting up the human evaluations.
-  (2017) Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems, pp. 1141–1151. Cited by: §3.1.
-  (2018) Generative adversarial networks for extreme learned image compression. arXiv preprint arXiv:1804.02958. Cited by: §2.
-  (2015) Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281. Cited by: §2.
-  (2016) End-to-end optimized image compression. arXiv preprint arXiv:1611.01704. Cited by: Figure 1, §1, §1, §2, §3.3, §4.1, §4, footnote 2.
Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436. Cited by: §1, §1, §3.1.
-  (2017) BPG image format (http://bellard.org/bpg/) accessed: 2017-01-30. [online] available: http://bellard.org/bpg/. Cited by: §2.
-  (2014)(Website) External Links: Cited by: Figure 1, §1, §4.
-  (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. Cited by: Figure 8, Figure 9, §4.1, §4.
-  (1997) PNG (portable network graphics) specification version 1.0. Technical report Cited by: §1.
-  (2017) Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520. Cited by: §2.
-  (2018) Towards a semantic perceptual image metric. In 2018 25th IEEE International Conference on Image Processing (ICIP), Cited by: §2.
-  (2016) Generating images with perceptual similarity metrics based on deep networks. In Advances in neural information processing systems, Cited by: §2.
-  (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1301–1310. Cited by: §4.2.
-  (1999) Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak 4. Cited by: Figure 5, Figure 9, §4.1, §4.
Image style transfer using convolutional neural networks. In
Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proc. Conf. Comput. Vision Pattern Recognition, pp. 770–778. Cited by: §4.2.
-  (2015) Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Figure 6, Figure 9, §4.1, §4.
-  (1999) Image compression with neural networks–a survey. Signal processing: image Communication. Cited by: §1.
-  (2017) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. structure 10, pp. 23. Cited by: §1, §2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
-  (2012) Imagenet classification with deep convolutional neural networks. In Neural Inform. Process. Syst., pp. 1097–1105. Cited by: §2, §3.2.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.
-  (2019) Context-adaptive entropy model for end-to-end optimized image compression. ICLR. Cited by: §1, §1, §3.1.
-  (2018) Learning convolutional networks for content-weighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3214–3223. Cited by: §2.
-  (2014) Microsoft COCO: common objects in context. CoRR abs/1405.0312. Cited by: Figure 10, §4.2.
-  (1988) Image compression using a neural network. In Proc. IGARSS, Cited by: §1.
-  (2018) Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 1, §1, §1, §1, §2, §3.1, §3.1, §3.1, §3.1, §3.3, §3.3, §4.1, §4.
-  (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §2, §3.1.
-  (1989) Image compression by back propagation: an example of extensional programming. Models of cognition: rev. of cognitive science. Cited by: §1.
-  (2016) Deconvolution and checkerboard artifacts. Distill. Cited by: §1, §3.1.
-  (2016) . arXiv preprint arXiv:1601.06759. Cited by: §2, §3.1, §3.1.
-  (2019) Human evaluations for image compression. . Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §4.2, §4.2.
-  (2017) Real-time adaptive image compression. arXiv preprint arXiv:1705.05823. Cited by: §1, §1, §2, §2.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision. Cited by: §2, §3.2, §3.3.
-  (2017) Enhancenet: single image super-resolution through automated texture synthesis. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 4501–4510. Cited by: §1, §3.1, §4.
-  (2018) Generative compression. In 2018 Picture Coding Symposium (PCS), pp. 258–262. Cited by: §2.
-  (1948) A mathematical theory of communication. Bell system technical journal 27 (3), pp. 379–423. Cited by: §1, §2.
-  (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2, §3.2.
-  (2001) The jpeg 2000 still image compression standard. IEEE Signal processing magazine. Cited by: Figure 1, §1, §2, §4.
-  (2017) Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395. Cited by: §2, §2, §3.1.
-  (1992) The jpeg still picture compression standard. IEEE transactions on consumer electronics. Cited by: §1, §2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing. Cited by: §1.
-  (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §1, §3.1.
-  (2010) On single image scale-up using sparse-representations. In International conference on curves and surfaces, Cited by: Figure 7, Figure 9, §4.1, §4.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §1, §3.2, §3.2.
-  (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: §2, §3.2.