Extreme Image Compression via Multiscale Autoencoders With Generative Adversarial Optimization

by   Chao Huang, et al.
Nanjing University

We propose a MultiScale AutoEncoder(MSAE) based extreme image compression framework to offer visually pleasing reconstruction at a very low bitrate. Our method leverages the "priors" at different resolution scale to improve the compression efficiency, and also employs the generative adversarial network(GAN) with multiscale discriminators to perform the end-to-end trainable rate-distortion optimization. We compare the perceptual quality of our reconstructions with traditional compression algorithms using High-Efficiency Video Coding(HEVC) based Intra Profile and JPEG2000 on the public Cityscapes and ADE20K datasets, demonstrating the significant subjective quality improvement.


page 1

page 2

page 4


Multi-scale Grouped Dense Network for VVC Intra Coding

Versatile Video Coding (H.266/VVC) standard achieves better image qualit...

A GAN-based Tunable Image Compression System

The method of importance map has been widely adopted in DNN-based lossy ...

Deep Generative Models for Distribution-Preserving Lossy Compression

We propose and study the problem of distribution-preserving lossy compre...

DVMark: A Deep Multiscale Framework for Video Watermarking

Video watermarking embeds a message into a cover video in an imperceptib...

How to Exploit the Transferability of Learned Image Compression to Conventional Codecs

Lossy image compression is often limited by the simplicity of the chosen...

Real-Time Adaptive Image Compression

We present a machine learning-based approach to lossy image compression ...

Fidelity-Controllable Extreme Image Compression with Generative Adversarial Networks

We propose a GAN-based image compression method working at extremely low...

1 Introduction

Images, that capture the vivid scenes and events, are stored and shared extensively every day. For instance, over 500 Million photos are uploaded and exchanged at Facebook per day. Thus, image compression plays a vital role to ensure the efficient storage and sharing at the entire Internet scale. Traditional image compression methods, such as JPEG, JPEG2000 [1] and HEVC Intra Profile based BPG [2]

, as well as recent deep neural network (DNN) based image compression technologies 

[3, 4, 5, 6] have presented significant advances in image compression efficiency. Typically, these DNN-based schemes exhibit better visual quality than the traditional methodologies, at the same bit rate [7]. However, both of them fail to represent the image efficiently with a pleasant reconstruction quality at a very low bitrate (e.g., targeting for bits per pixel (bpp)) [8], as shown in the bottom of Fig. 1.

Figure 1: Perceptual quality comparison of our method versus BPG on an image sample in Cityscapes dataset
Figure 2:

Our extreme image compression framework via Multi-Scale AutoEncoder (MSAE) with GAN optimization (a) overall structure, (b) autoencoder. The encoder network contains 1 convolutional layer with stride 1 and 4 convolutional layers with stride 2; all the residual block in information augmentation are with the same with convolutional kernel size 3 and stride 1; the decoder network is a mirror version of the encoder, which contains 4 transpose convolutional layers with stride 2 and 1 convolutional layer with stride 1.

This is mainly due to the reason that visual sensitive information (i.e., perceptual significance) can not be well preserved using the conventional quality optimization criteria, such as peak signal-to-noise ratio (PSNR) and multiscale structural similarity (MS-SSIM) [9], at such extreme compression scenario. Recent explorations have shown that adversarial loss could be a tentative solution to capture global semantic information and local texture, yielding the pleasant and appealing reconstructions [10, 8]. Thus, Agustsson et al. [8] developed a GAN-based extreme image compression framework with bitrate targets below 0.1 bpp, resulting in the noticeable subjective quality improvement compared with the traditional JPEG2000 and BPG. However, it had limitations by adopting a purely GAN-based structure. First, it was difficult to ensure the generalization of GAN to capture a variety of distributions of different datasets. In the meantime, GAN sometimes would introduce unexpected textures because of the failure of discriminator [11].

Therefore, we propose a MultiScale AutoEncoder (MSAE) based extreme image compression structure. “Priors” at different spatial resolution scale that well capture the local textures, are embedded as reference to help the reconstruction and compression. Meanwhile, multiscale discriminators are used in GAN and embedded in end-to-end training framework for overall subjective quality optimization, at a specific bit rate. This would generally help to maintain the global semantic structure for visual appealing reconstruction. We have our method experimented on both Cityscapes and ADE20K datasets, yielding significant perceptual quality margins over the existing JPEG2000 and BPG represented images, where a snapshot is given in Fig. 1.

2 MultiScale AutoEncoder with Generative Adversarial Optimization

Fig. 2 presents the extreme image compression framework of MSAE with generative adversarial optimization. Let be the original image ( is the size of the input). We downscale the to obtain two more inputs and . denotes the downscaling factor, which is set by 2 in this paper. Let be the autoencoder network toward scale (), and denotes the upscaling operator. We then define the overall MSAE framework by


Our proposed MSAE framework in (1), (2), and (3) has presented a coarse-to-fine reconstruction step by step. At the lowest scale , the autoencoder only takes as an input to derive the reconstructed image , yielding the coarsest representation of original . Then , as the prior, is upscaled and aggregated with residuals at each scale to derive the final . Low resolution reconstructions are referred as “priors” to improve the overall rate-distortion performance. In addition, conditional GAN [12] is integrated into our MSAE system to do end-to-end training for visually appealing reconstruction, by enabling the multiscale discriminators for each input high-resolution images.

2.1 AutoEncoder

The same autoencoder is used in our MSAE framework at each scale. Except at the scale where the downscaled image serves as the input, residuals between upscaled priors and inputs (i.e., and ) at the same resolution, are fed into the autoencoder for compression. Using residuals, instead of default textures, generally boost the coding efficiency at the same bitrate budget due to better energy compaction and redundancy exploit.

Such autoencoder, shown in Fig. 2, includes an encoder to encode the input to a set of feature maps (fMaps) . Then the is passed to the quantizer and will be quantized to a compressed representation . Specifically, the encoder first compresses the input with size of to feature maps with dimensions at . Usually, is for image width, is the height and is the number of color channels (e.g., for RGB color space). The fMaps are then projected down to at bottleneck layer prior to being quantized for . Note that varies at different scale.

The decoder, also the generator tries to reconstruct the image from the compressed representation . Within the decoder, nine residual blocks [13] based information augmentation module is aggregated to learn more information from the data to improve the reconstruction. Decoded fMaps will go through a mirror network of to obtain final reconstruction with dimensions at the same dimension, i.e., , as the input image.

Note that the autoencoder is optimized using PSNR or MS-SSIM in default, often resulting in compression artifacts such as blocking, blurring and contouring effects at a low bitrate. To address this problem, we adopt adversarial loss [10] in training to reconstruct image with visually pleasant quality.

2.2 End-to-End Rate-Distortion Optimization

We adopt adversarial training in end-to-end optimization framework for extreme compression. This is mainly due to the reason that adversarial loss can address the blurring and contouring problems when the bitrates get to a low level [10]. In the proposed framework, the decoder or generator is conditioned on the compressed representations and there is no necessity to add random noise for generator [12]. For discriminator , we use the multiscale architecture following [14], which measures the divergence between real image and fake image generated by

both globally and locally. Here we introduce a loss function that is closer to the perceptual similarity instead of relying on pixel-wise distortion 

[15], i.e.,


with and . represents the feature map generated by the -th convolution (with stride 2) of the -th scale for the multiscale discriminator. and are the dimensional size of the respective feature maps. For the coefficient , we set it as 10.

The regular GAN [10]

hypothesizes the discriminator as a classifier with the sigmoid cross entropy loss function, which may lead to gradient vanishing problem. There is a number of ways developed to avoid the gradient vanishing 

[16, 17, 18]. In this paper, we use objective measures and developed for Least-Squares GAN [19], where and denote the scalar functions. It results in the generator loss as,


and the discriminator loss as:


In order to backpropagate through the non-differentiable quantizer

, we model the entropy rate following the [5] at bottleneck layer. We simply add uniform noise to ensure differentiability during the training and replace it with ROUND(

) in inference. To make sure the approximated tensor values are good enough, the training phase must be used to balance the quality of the reconstruction with the bitrate by adding an entropy rate term to the training loss for optimal rate-distortion efficiency, i.e.,


As we can see, rate-distortion trade-offs are adjusted by setting the variations of and . Distortion, i.e., , is measured by the PSNR in this study, and the entropy of compressed representation, i.e., , is used to approximate the encoding bitrate [5]. Such compound loss is applied in a end-to-end trainable framework to achieve the optimal rate-distortion performance.

Figure 3: Illustration of performance comparison for our proposed extreme image compression method versus BPG, JPEG2000 on ADE20K dataset with objective PSNR and subjective snapshots.

3 Experimental Studies

Datasets: We use two public accessible datasets for the training: Cityscapes [20] and ADE20K [21]. Cityscapes dataset contains images, each of them has the dimension of in RGB color space. During the training, we randomly select 2400 images for training and the rest for validation. These images are downscaled to in our experiments to avoid GPU memory overflow in training. For the ADE20K dataset, we choose images. It is then segmented randomly to a training set and a validation set, with sizes from to . For simplicity, we rescale all of them to for training and validation.

Parameters: We set the coefficient and coefficients . Meanwhile, , accordingly. The number of channels of the bottleneck layer varies between different scale. For scale and , we set the , while at scale , . Additionally, we use a learning rate of and the Adam optimizer for end-to-end learning.

Performance Evaluation: To evaluate the performance of our proposed multiscale autoencoder based extreme image compression method, we compare our method with BPG and JPEG2000, as shown in Fig. 3, where both objective PSNRs and subjective snapshots of four samples are illustrated. For all the images we tested in datasets Cityscapes and ADE20K, the bitrate is blew 0.05 bpp. For quantitative evaluations, we compute the PSNR between the true image and the reconstruction . But we have to mention that at such low bitrate, quantitative measurements such as PSNR or MS-SSIM [22] become meaningless as they penalize changes in local structure rather than preservation of the global semantics.

It is clear that our method has demonstrated the significant perceptual quality margin with visually appealing reconstructions over the traditional JPEG2000 and BPG, even though the PSNR suffers. The similar visual effects are kept for other images as those shown in Fig. 3. This also coincides the similar observations that learning based compression often provides the better visual quality, but worse PSNR [7, 6, 8]. However, recent attempts have made the better PSNR of learning based methods over JPEG2000 [23], showing the promising prospect of DNN based compression schemes.

4 Concluding Remarks

We have develop an extreme image compression framework via a multiscale autoencoder structure with embedded generative adversarial optimization for end-to-end training. Such multiscale authoencoder is fulfilled by downscaling the original image into various resolution scales to capture the image statistics locally and globally. Each decoded representation at lower resolution scale is utilized as the priors for the efficient compression at higher scale. In addition to the traditional pixel-wise distortion measurements (e.g., PSNR, MS-SSIM), we have introduced the adversarial loss for pleasant image reconstruction at a very low bitrate (i.e., usually below 0.5 bpp), to preserve the image structure and global semantics, . Experimental studies have demonstrated that our method has provided significant subjective quality improvement over the existing JPEG2000 and BPG, on public accessible datasets.