Images, that capture the vivid scenes and events, are stored and shared extensively every day. For instance, over 500 Million photos are uploaded and exchanged at Facebook per day. Thus, image compression plays a vital role to ensure the efficient storage and sharing at the entire Internet scale. Traditional image compression methods, such as JPEG, JPEG2000  and HEVC Intra Profile based BPG 
, as well as recent deep neural network (DNN) based image compression technologies[3, 4, 5, 6] have presented significant advances in image compression efficiency. Typically, these DNN-based schemes exhibit better visual quality than the traditional methodologies, at the same bit rate . However, both of them fail to represent the image efficiently with a pleasant reconstruction quality at a very low bitrate (e.g., targeting for bits per pixel (bpp)) , as shown in the bottom of Fig. 1.
Our extreme image compression framework via Multi-Scale AutoEncoder (MSAE) with GAN optimization (a) overall structure, (b) autoencoder. The encoder network contains 1 convolutional layer with stride 1 and 4 convolutional layers with stride 2; all the residual block in information augmentation are with the same with convolutional kernel size 3 and stride 1; the decoder network is a mirror version of the encoder, which contains 4 transpose convolutional layers with stride 2 and 1 convolutional layer with stride 1.
This is mainly due to the reason that visual sensitive information (i.e., perceptual significance) can not be well preserved using the conventional quality optimization criteria, such as peak signal-to-noise ratio (PSNR) and multiscale structural similarity (MS-SSIM) , at such extreme compression scenario. Recent explorations have shown that adversarial loss could be a tentative solution to capture global semantic information and local texture, yielding the pleasant and appealing reconstructions [10, 8]. Thus, Agustsson et al.  developed a GAN-based extreme image compression framework with bitrate targets below 0.1 bpp, resulting in the noticeable subjective quality improvement compared with the traditional JPEG2000 and BPG. However, it had limitations by adopting a purely GAN-based structure. First, it was difficult to ensure the generalization of GAN to capture a variety of distributions of different datasets. In the meantime, GAN sometimes would introduce unexpected textures because of the failure of discriminator .
Therefore, we propose a MultiScale AutoEncoder (MSAE) based extreme image compression structure. “Priors” at different spatial resolution scale that well capture the local textures, are embedded as reference to help the reconstruction and compression. Meanwhile, multiscale discriminators are used in GAN and embedded in end-to-end training framework for overall subjective quality optimization, at a specific bit rate. This would generally help to maintain the global semantic structure for visual appealing reconstruction. We have our method experimented on both Cityscapes and ADE20K datasets, yielding significant perceptual quality margins over the existing JPEG2000 and BPG represented images, where a snapshot is given in Fig. 1.
2 MultiScale AutoEncoder with Generative Adversarial Optimization
Fig. 2 presents the extreme image compression framework of MSAE with generative adversarial optimization. Let be the original image ( is the size of the input). We downscale the to obtain two more inputs and . denotes the downscaling factor, which is set by 2 in this paper. Let be the autoencoder network toward scale (), and denotes the upscaling operator. We then define the overall MSAE framework by
Our proposed MSAE framework in (1), (2), and (3) has presented a coarse-to-fine reconstruction step by step. At the lowest scale , the autoencoder only takes as an input to derive the reconstructed image , yielding the coarsest representation of original . Then , as the prior, is upscaled and aggregated with residuals at each scale to derive the final . Low resolution reconstructions are referred as “priors” to improve the overall rate-distortion performance. In addition, conditional GAN  is integrated into our MSAE system to do end-to-end training for visually appealing reconstruction, by enabling the multiscale discriminators for each input high-resolution images.
The same autoencoder is used in our MSAE framework at each scale. Except at the scale where the downscaled image serves as the input, residuals between upscaled priors and inputs (i.e., and ) at the same resolution, are fed into the autoencoder for compression. Using residuals, instead of default textures, generally boost the coding efficiency at the same bitrate budget due to better energy compaction and redundancy exploit.
Such autoencoder, shown in Fig. 2, includes an encoder to encode the input to a set of feature maps (fMaps) . Then the is passed to the quantizer and will be quantized to a compressed representation . Specifically, the encoder first compresses the input with size of to feature maps with dimensions at . Usually, is for image width, is the height and is the number of color channels (e.g., for RGB color space). The fMaps are then projected down to at bottleneck layer prior to being quantized for . Note that varies at different scale.
The decoder, also the generator tries to reconstruct the image from the compressed representation . Within the decoder, nine residual blocks  based information augmentation module is aggregated to learn more information from the data to improve the reconstruction. Decoded fMaps will go through a mirror network of to obtain final reconstruction with dimensions at the same dimension, i.e., , as the input image.
Note that the autoencoder is optimized using PSNR or MS-SSIM in default, often resulting in compression artifacts such as blocking, blurring and contouring effects at a low bitrate. To address this problem, we adopt adversarial loss  in training to reconstruct image with visually pleasant quality.
2.2 End-to-End Rate-Distortion Optimization
We adopt adversarial training in end-to-end optimization framework for extreme compression. This is mainly due to the reason that adversarial loss can address the blurring and contouring problems when the bitrates get to a low level . In the proposed framework, the decoder or generator is conditioned on the compressed representations and there is no necessity to add random noise for generator . For discriminator , we use the multiscale architecture following , which measures the divergence between real image and fake image generated by
both globally and locally. Here we introduce a loss function that is closer to the perceptual similarity instead of relying on pixel-wise distortion, i.e.,
with and . represents the feature map generated by the -th convolution (with stride 2) of the -th scale for the multiscale discriminator. and are the dimensional size of the respective feature maps. For the coefficient , we set it as 10.
The regular GAN 
hypothesizes the discriminator as a classifier with the sigmoid cross entropy loss function, which may lead to gradient vanishing problem. There is a number of ways developed to avoid the gradient vanishing[16, 17, 18]. In this paper, we use objective measures and developed for Least-Squares GAN , where and denote the scalar functions. It results in the generator loss as,
and the discriminator loss as:
In order to backpropagate through the non-differentiable quantizer, we model the entropy rate following the  at bottleneck layer. We simply add uniform noise to ensure differentiability during the training and replace it with ROUND(
) in inference. To make sure the approximated tensor values are good enough, the training phase must be used to balance the quality of the reconstruction with the bitrate by adding an entropy rate term to the training loss for optimal rate-distortion efficiency, i.e.,
As we can see, rate-distortion trade-offs are adjusted by setting the variations of and . Distortion, i.e., , is measured by the PSNR in this study, and the entropy of compressed representation, i.e., , is used to approximate the encoding bitrate . Such compound loss is applied in a end-to-end trainable framework to achieve the optimal rate-distortion performance.
3 Experimental Studies
Datasets: We use two public accessible datasets for the training: Cityscapes  and ADE20K . Cityscapes dataset contains images, each of them has the dimension of in RGB color space. During the training, we randomly select 2400 images for training and the rest for validation. These images are downscaled to in our experiments to avoid GPU memory overflow in training. For the ADE20K dataset, we choose images. It is then segmented randomly to a training set and a validation set, with sizes from to . For simplicity, we rescale all of them to for training and validation.
Parameters: We set the coefficient and coefficients . Meanwhile, , accordingly. The number of channels of the bottleneck layer varies between different scale. For scale and , we set the , while at scale , . Additionally, we use a learning rate of and the Adam optimizer for end-to-end learning.
Performance Evaluation: To evaluate the performance of our proposed multiscale autoencoder based extreme image compression method, we compare our method with BPG and JPEG2000, as shown in Fig. 3, where both objective PSNRs and subjective snapshots of four samples are illustrated. For all the images we tested in datasets Cityscapes and ADE20K, the bitrate is blew 0.05 bpp. For quantitative evaluations, we compute the PSNR between the true image and the reconstruction . But we have to mention that at such low bitrate, quantitative measurements such as PSNR or MS-SSIM  become meaningless as they penalize changes in local structure rather than preservation of the global semantics.
It is clear that our method has demonstrated the significant perceptual quality margin with visually appealing reconstructions over the traditional JPEG2000 and BPG, even though the PSNR suffers. The similar visual effects are kept for other images as those shown in Fig. 3. This also coincides the similar observations that learning based compression often provides the better visual quality, but worse PSNR [7, 6, 8]. However, recent attempts have made the better PSNR of learning based methods over JPEG2000 , showing the promising prospect of DNN based compression schemes.
4 Concluding Remarks
We have develop an extreme image compression framework via a multiscale autoencoder structure with embedded generative adversarial optimization for end-to-end training. Such multiscale authoencoder is fulfilled by downscaling the original image into various resolution scales to capture the image statistics locally and globally. Each decoded representation at lower resolution scale is utilized as the priors for the efficient compression at higher scale. In addition to the traditional pixel-wise distortion measurements (e.g., PSNR, MS-SSIM), we have introduced the adversarial loss for pleasant image reconstruction at a very low bitrate (i.e., usually below 0.5 bpp), to preserve the image structure and global semantics, . Experimental studies have demonstrated that our method has provided significant subjective quality improvement over the existing JPEG2000 and BPG, on public accessible datasets.
-  David Taubman and Michael Marcellin, JPEG2000 image compression fundamentals, standards and practice: image compression fundamentals, standards and practice, vol. 642, Springer Science & Business Media, 2012.
-  Bellard, “Bpg image format,” https://bellard.org/bpg/.
George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen,
Joel Shor, and Michele Covell,
“Full resolution image compression with recurrent neural networks,”in , 2017, pp. 5306–5314.
-  Oren Rippel and Lubomir Bourdev, “Real-time adaptive image compression,” arXiv preprint arXiv:1705.05823, 2017.
-  Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston, “Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436, 2018.
-  Haojie Liu, Tong Chen, Qiu Shen, Tao Yue, and Zhan Ma, “Deep image compression via end-to-end learning,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018, pp. 2575–2578.
-  Johannes Ballé, Valero Laparra, and Eero P Simoncelli, “End-to-end optimized image compression,” arXiv preprint arXiv:1611.01704, 2016.
-  Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool, “Generative adversarial networks for extreme learned image compression,” arXiv preprint arXiv:1804.02958, 2018.
-  Zhou Wang, Eero P Simoncelli, and Alan C Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. Ieee, 2003, vol. 2, pp. 1398–1402.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy,
“Recovering realistic texture in image super-resolution by deep spatial feature transform,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 606–615.
-  Mehdi Mirza and Simon Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8798–8807.
-  Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 105–114.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
-  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
-  David Berthelot, Thomas Schumm, and Luke Metz, “Began: boundary equilibrium generative adversarial networks,” arXiv preprint arXiv:1703.10717, 2017.
-  Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2794–2802.
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele,
“The cityscapes dataset for semantic urban scene understanding,”in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
-  Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017, vol. 1, p. 4.
-  Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
-  Johannes Ballé, “Efficient nonlinear transforms for lossy image compression,” arXiv preprint arXiv:1802.00847, 2018.