DeepAI
Log In Sign Up

ProxIQA: A Proxy Approach to Perceptual Optimization of Learned Image Compression

10/19/2019
by   Li-Heng Chen, et al.
11

The use of ℓ_p(p=1,2) norms has largely dominated the measurement of loss in neural networks due to their simplicity and analytical properties. However, when used to assess the loss of visual information, these simple norms are not very consistent with human perception. Here, we describe a different "proximal" approach to optimize image analysis networks against quantitative perceptual models. Specifically, we construct a proxy network, broadly termed ProxIQA, which mimics the perceptual model while serving as a loss layer of the network. We experimentally demonstrate how this optimization framework can be applied to train an end-to-end optimized image compression network. By building on top of an existing deep image compression model, we are able to demonstrate a bitrate reduction of as much as 31% over MSE optimization, given a specified perceptual quality (VMAF) level.

READ FULL TEXT VIEW PDF

page 1

page 4

page 5

page 8

page 9

07/03/2020

Perceptually Optimizing Deep Image Compression

Mean squared error (MSE) and ℓ_p norms have largely dominated the measur...
05/05/2021

Perceptual Gradient Networks

Many applications of deep learning for image generation use perceptual l...
07/18/2019

Deep Perceptual Compression

Several deep learned lossy compression techniques have been proposed in ...
06/05/2021

On Perceptual Lossy Compression: The Cost of Perceptual Reconstruction and An Optimal Training Framework

Lossy compression algorithms are typically designed to achieve the lowes...
05/10/2010

On Macroscopic Complexity and Perceptual Coding

The theoretical limits of 'lossy' data compression algorithms are consid...
05/20/2022

Compression ensembles quantify aesthetic complexity and the evolution of visual art

The quantification of visual aesthetics and complexity have a long histo...
01/31/2021

Towards Imperceptible Query-limited Adversarial Attacks with Perceptual Feature Fidelity Loss

Recently, there has been a large amount of work towards fooling deep-lea...

I Introduction

Recently deep neural networks have been successfully and ubiquitously applied on diverse image processing and computer vision tasks, such as semantic segmentation

[1], object recognition [2], and optical flow [3]

. Many classic image transformation problems can be approached using a deep generative network, which learns to reconstruct high-quality output images from degraded input image(s). Explicitly, the generative network is trained in a supervised manner with a loss function, which is used to measure the fidelity between the output and a ground-truth image. For instance, the denoising task aims to reconstruct a noise-free image from a noisy image, and Convolutional Neural Networks (CNNs) have been shown to provide good noisy-to-pristine mapping functions

[4, 5]

. Similar tasks where retaining image fidelity is important include deep image compression, super-resolution

[6, 7]

, frame interpolation

[8], and so on.

Although a significant amount of research has been applied on deep learning image transformation problems, most of this work has focused on investigating network architectures or improving convergence speed. The selection of an appropriate loss function, however, has not been studied as much. The choice of the loss functions used to guide model training has been largely limited to the

norm family, in particular the MSE (squared norm), the norm, and variants of these[9]. The structural similarity quality index (SSIM) [10] and its multi-scale version (MS-SSIM) [11] have also been adopted as loss functions for several image reconstruction tasks [12, 13], owing to their perceptual relevance and good analytic properties, such as differentiability.

Fig. 1: General framework of perceptual optimization using a ProxIQA network: A generative network takes an image as input and outputs a reconstructed image . Note that are the parameters of , respectively, while is the ground-truth image. Given an image quality measurement , the ProxIQA network is learned as its proxy, where the output represents .

Perceptual image quality assessment has been a long-standing research problem. Although numerous powerful perceptual models have been proposed to predict the perceived quality of a distorted picture, other image quality indexes have never been adopted as deep network loss functions, because they are generally non-differentiable and functionally complex.

Towards bridging the gap between modern perceptual quality models and deep generative networks, we explore the potential of adapting more powerful and sophisticated perceptual image quality models as loss functions in deep neural network for addressing the aforementioned problems: neural networks, by simulating the measurements made by a perceptual model by a proxy network ProxIQA. As shown in Fig. 1, the main idea is to optimize the hyper parameters of the generative network

, using a ProxIQA network as a perceptual loss function

(1)

where , are the input and output of the generative network and is the ground-truth image. In the image compression problem, is an uncompressed image and we anticipate the fidelity of the compressed image. Thus, this is a special case where . The parameters of the ProxIQA network are optimized so that it mimics the .

The outline of this paper is as follows: Section II reviews the relevant literature of image quality assessment, perceptual optimization, and deep image compression. Section III describes the ProxIQA framework, while Section IV provides analysis and experimental results. Finally, Section V concludes the paper.

(a) Determined function loss
(b) Perceptual loss
(c) ProxIQA loss
Fig. 2: Comparison of different perceptual loss layers for generative neural networks. Given a target patch and a reconstructed patch

, (a) determined function based approaches typically use a differentiable function having a certain degree of convexity, such as SSIM and MS-SSIM. (b) Perceptual-loss based approaches define the loss from the features extracted from intermediate layers of a trained network (such as VGG) (c) Our method uses the output of a proxy network that approximates an IQA model as the loss function.

Ii Related Work

In this section, we provide a literature review of studies that are closely related to our work. The relevant topics of objective image quality assessment, perceptual optimization and deep compression are briefly reviewed.

Ii-a Perceptual Image Quality Metrics

Over the past decade there has been a remarkably increasing interest in developing objective image quality assessment (IQA) methods. Objective IQA models are commonly classified as full-reference, reduced-reference, and no-reference, based on the amount of information they assess from a reference image of ostensibly pristine quality. Here we only need to consider the full-reference (FR) scenario, since it may be assumed that ground-truth data is available, hence we only review FR IQA models.

Beyond the well-known structural similarity index and other SSIM-type methods, a wide variety of perception-based FR models have been designed, including the visual signal-to-noise ratio index (VSNR)

[14], the visual information fidelity (VIF) index [15], the MAD model [16], the feature similarity index (FSIM) and its extension FSIM [17], and the Visual Saliency-Induced index (VSI) [18].

With the rapid development of machine learning, important data-driven models have also begun to emerge

[19, 20, 21, 22, 23, 24]

. A particularly successful example is Netflix’s announcement of an open-source FR video quality engine called Video Multimethod Assessment Fusion (VMAF)

[25]

. VMAF combines multiple quality features to train a Support Vector Regressor (SVR) to predict subjective judgments of video quality. When it is applied to still pictures, VMAF treats the data as a video frame having zero motion. Like SSIM, VMAF is used to perceptually optimize tremendous volumes of internet video traffic.

Generally, more advanced, high-performance quality prediction models such as these are difficult to adopt as loss functions for end-to-end optimization networks.

Ii-B Perceptual Optimization

As tractable tools for perceptual optimization, SSIM and MS-SSIM have been widely adopted because of the simple analytical form of their gradients and computational ease. More over, their convexity properties [26] makes them feasible targets for optimization. For example, two recent studies adopted structural similarity functions as loss layers of image generation models, obtaining improved results as validated by conducting a human subjective study [12] and by objective evaluation against several other perceptual models [13].

Rather than optimizing a mathematical function, another approach uses a deep neural network to guide the training. Recent experimental studies suggest that the features extracted from a well-trained image classification network have the capability to capture information useful for other perceptual tasks [27]. As illustrated in Fig. 2LABEL:sub@fig_comp_b, the perceptual loss is defined as

(2)

where denotes the output feature map of the -th layer with elements of a pre-trained network .

In practice, the loss computed from the high-level features extracted from a pre-trained VGG classification network [28], also called VGG loss, has been commonly adopted for diverse computer vision tasks. The VGG loss has been applied to such diverse tasks as style transfer [29, 30], superresolution [31, 29, 32, 33]

, and image inpainting

[34].

Fig. 3: Detailed framework of the proposed optimization strategy. Perceptually training a deep image compression model involves alternating optimization of the compression network (left side of the figure) and the ProxIQA network (right side of the figure). Thin arrows indicate the flow of data in the network, while bold arrows represent the information being delivered to update the complementary network.
Fig. 4: Architecture of the ProxIQA network. The convolutional parameters are denoted as: height width input channel output channel stride padding; The max pooling layers are denoted as: vertical pooling size horizontal pooling size.

Ii-C End-to-end Optimized Lossy Image Compression

Recently, lossy image compression models have been realized using deep neural network architectures. Most of these have deployed deep auto-encoders. For example, Ballé et al. [35]

proposed a general infrastructure for optimizeing image compression in an end-to-end manner. Unlike other methods, the bitrate is estimated and considered during training. In

[36]

, this model is improved by incorporating a scale hyperprior into the compression framework. The authors use an additional network to estimate the standard deviation of the quantized coefficients to further improve coding efficiency. Later, Minnen

et al. [37]

exploit a PixelCNN layer, which they combine with an autoregressive hyperprior. Beyond these early efforts, other recent approaches have adopted more complex network architectures such as recurrent neural networks (RNNs)

[38, 39, 40] and generative adversarial networks (GANs) [41, 42]. Some works has also been done to extend these ideas to the deep video compression problem [43, 44].

Unsurprisingly, the idea of optimizing a conventional codec such as H.264/AVC against perceptual models like SSIM, VIF, and VMAF have been deeply studied [45, 46, 47] and implemented in widespread practice [25]. We seek to extend this concept in similar manner to learn an end-to-end perceptually optimized compression model.

Iii Proposed Perceptual Optimization Framework

Our approach to training an image compression model in a perceptually optimized way is depicted in Fig. 3. This framework involves optimizing two networks: an image compression network , and a sub-network , which is a proxy of an IQA model, which we will refer to as ProxIQA. A source image is input to a compression network, which produces a reconstructed image:

(3)

Separately, the ProxIQA network maps the image pair into a proxy of an image quality model :

(4)

In each training iteration, the two networks are alternately updated as follows:

Iii-1 Deep Compression Model Updating

To integrate into the update of given a mini-batch , the model parameters of are fixed during training. In order to minimize distortion, the output of becomes part of the objective in the optimization of :

(5)

By back-propagating through the forward model, the loss derivative is used to drive .

Iii-2 ProxIQA Network Updating

Given a mini-batch pair and collected from the most recent update of the compression network, the quality scores are calculated. The ProxIQA network is updated to optimally fit given the input . Note that the compression network is not needed in this part of the training.

As may be seen, the auxiliary sub-network ProxIQA is incorporated into the training of the compression network. However, it is important to understand that the ProxIQA network is not present during the testing (image compression/decompression) phase.

(a) Reconstructed patches from deep compression model.
(b) Waterloo Exploration database.
(c) BAPPS database.
Fig. 5: Comparison of distorted patches from existing databases and three deep compression networks. For (a)(b), The first column shows the pristine patches while the other three columns are corresponding distorted patches. In (c), each similar pair includes a reference patch (left), and a distorted patch (right). The distortions were generated from (a) patches reconstructed during the training of the deep compression network; (b) synthetic distortions different severity levels added to the original patches; (c) the outputs of various convolutional neural networks.

Iii-a Network Architecture

Iii-A1 ProxIQA Network

The goal is to learn a nonlinear regressor via a CNN. The network takes a reference patch and a distorted patch as input, where both have pixels. They are then concatenated into a 6-channel signal, where a raw input is fed into the network and reduced to a predicted quality score. As depicted in Fig. 4

, the ProxIQA network may be as simple as a shallow CNN consisting of three stages of convolution, ReLU as an activation function, and maxpooling. The spatial size is reduced by a factor of

after each stage via max pooling layers. Finally, feature maps are flattened and fed to a fully connected layer which yields the output. The parameterization of each layer is detailed in the figure.

Iii-A2 Compression Network

We build on the deep image compression model [35]. As shown in Fig. 3, the image compression network comprises an analysis transform () at the encoder side, and a synthesis transform () at the decoder side. Both transforms are composed of three consecutive layers of convolution-down(up) sampling-activation. Instead of utilizing ReLU, a generalized divisive normalization (GDN) transform is adopted as the activation function [48], similar to normalization of the visual signal by the human visual system.

Iii-B Loss Functions

As illustrated in Fig. 3, let , , , and be the source batch, latent presentation, quantized latent presentation, and reconstructed batch, respectively. The model parameters in the analysis and synthesis transforms are collectively denoted by . The ProxIQA network has model parameters . Given a perceptual metric , the goal is to optimize the full set of parameters , , such that the learned image codec can generate a reconstructed image that has a high preceptual quality score . Furthermore, the rate should be as small as possible. Therefore, we train our model using the following losses.

Iii-B1 Rate loss

The rate loss representing the bit consumption of an encode is defined by

(6)

where is the entropy model. This entropy term is minimized when the actual marginal distribution and have the same distribution.

During training, the latent presentation is quantized to by adding i.i.d uniform noise . Then, is used to estimate the rate via (6). Unlike the estimated entropy used when training the network, a variant of the context-adaptive binary arithmetic coder (CABAC) [49] is used to encode the discrete-valued data into the bitstream during testing.

Iii-B2 Pixel loss

The pixel loss is the residual between the source image and the reconstructed image mapped by a distance function . Given and , the pixel loss is defined by

(7)

where can be mean square error (i.e., ) or mean absolute error (i.e., ).

The original work in [35] used as the pixel loss to maximize the PSNR of the reconstructed images. When combined with the rate loss, the image compression network is optimized by minimizing the objective function defined by

(8)

which has a similar as rate-distortion optimization (RDO) functions in conventional codecs. We make use of a pixel loss to encourage training stability.

(a) Source image: Kodim01.
(b) Adversarial example.
Fig. 6: An “adversarial” example produced by the compression network. The score calculated from (a) the source image and (b) the decoded image is (which indicates a very poor-quality image), while the ProxIQA network predicts a quality score of .
(a) Pre-trained model.
(b) Model learned from the proposed alternating training process.
Fig. 7: Comparison of true VMAF scores and proxy VMAF scores (quality scores predicted by ProxIQA) using two different optimization strategies during the training process. The two scores are plotted in mean values (lines) and standard deviations (shadows).

Iii-B3 Proxy loss

Instead of just minimizing an norm between and , we introduce a novel loss term. The proxy loss is calculated from the output of ProxIQA network, denoted by , with fixed parameter :

(9)

Here denotes the upper bound of the model , which is a constant to the loss function.

Finally, the total loss function for optimizing the compression network is the weighted combination of the losses from Eqs. (7), (6), and (9):

(10)

where balances bitrate against distortion of the encoded bitstream, and weights the proxy loss against the pixel loss. The pixel loss plays a different role as a regularization term. Since the ProxIQA network is updated at each step, the loss function is also updated. The pixel loss serves to stabilize the training process.

Iii-B4 Metric loss

The ProxIQA network aims to mimic an image quality model . Given two images and , define a metric loss to attain this objective while updating ProxIQA network:

(11)

Note that is a constant, since it is obtained from the reconstructed patches generated during the most recent update of the compression network.

Iii-C What’s Wrong with Using a Pre-trained Network?

Another way of attempting to accomplish the same goal is to use a pre-trained network as the loss layer. That is, a proxy network is first learned to predict a metric score given a pristine patch and a distorted patch from an existing dataset. Next, the trained proxy network is inserted into the loss layer of the deep compression network with the goal of maximizing the proxy score. Unfortunately, severe complications can arise when applying this simple methodology.

Optimization
BD-rate Metric PSNR SSIM MSIM VMAF PSNR SSIM MSIM VMAF PSNR SSIM MSIM VMAF
Kodim01 -014.4 0-17.4 0-16.3 00-2.1 -012.6 00-1.4 00-3.9 00-2.4 -004.9 00-2.9 00-4.3 0-25.3
Kodim02 11.0 -21.6 -17.9 15.4 6.5 -13.1 -21.5 3.0 8.6 -2.2 -2.6 -26.7
Kodim03 13.5 -19.5 -16.6 5.8 4.3 -14.6 -24.1 -10.6 7.8 -5.2 -7.2 -33.2
Kodim04 15.6 -23.5 -21.4 9.3 10.0 -15.6 -26.6 0.0 6.3 -6.4 -7.4 -26.9
Kodim05 13.8 -16.8 -14.8 1.4 14.0 -3.9 -11.2 -2.1 3.1 -6.1 -6.3 -18.1
Kodim06 17.0 -22.0 -20.0 0.5 12.2 -10.1 -19.2 -2.3 4.5 -6.0 -7.4 -23.6
Kodim07 11.6 -14.5 -10.9 8.3 7.1 -9.9 -21.1 -0.6 7.0 -10.0 -9.7 -24.7
Kodim08 13.0 -14.9 -15.1 -5.4 16.4 -0.6 -6.6 -7.3 2.5 -3.9 -5.2 -20.3
Kodim09 16.1 -17.6 -14.5 12.3 7.6 -12.9 -23.8 -1.2 6.4 -3.1 -4.4 -24.9
Kodim10 15.4 -25.3 -22.4 12.2 10.4 -17.3 -32.8 3.3 5.0 -9.6 -10.7 -23.2
Kodim11 16.6 -26.6 -25.3 12.8 14.1 -16.1 -25.9 6.9 5.1 -7.1 -8.5 -20.0
Kodim12 10.4 -30.8 -28.8 12.8 2.5 -19.7 -31.3 8.3 6.6 -3.4 -5.9 -22.4
Kodim13 19.9 -25.3 -24.5 -7.1 16.8 -9.1 -16.6 -7.3 1.2 -8.1 -9.2 -19.6
Kodim14 17.5 -22.6 -20.3 8.2 13.0 -10.5 -18.3 0.0 4.0 -8.5 -9.2 -20.1
Kodim15 17.4 -27.1 -26.0 6.0 9.8 -15.6 -30.7 0.1 6.7 -5.7 -7.6 -31.8
Kodim16 14.4 -20.4 -15.9 2.5 6.3 -13.9 -21.3 -3.2 6.5 -4.6 -6.1 -24.9
Kodim17 18.0 -20.7 -19.3 7.3 10.5 -15.0 -29.7 0.1 5.3 -9.6 -10.9 -23.6
Kodim18 19.7 -20.7 -18.3 12.1 17.9 -9.2 -17.9 2.4 2.7 -9.1 -9.8 -22.0
Kodim19 17.6 -18.2 -16.0 13.9 17.7 -8.7 -21.8 6.8 8.7 -4.7 -6.1 -17.8
Kodim20 19.7 -21.3 -22.4 15.3 12.8 -10.3 -25.3 0.3 5.1 -5.4 -8.2 -24.2
Kodim21 18.1 -16.6 -15.0 1.9 16.4 -7.5 -16.9 -4.1 4.1 -5.3 -6.8 -21.6
Kodim22 15.0 -23.0 -21.2 11.8 10.3 -15.5 -24.3 3.5 3.4 -10.9 -12.4 -25.1
Kodim23 14.7 -21.6 -17.5 15.5 10.5 -16.6 -29.6 4.1 7.8 -7.5 0-8.7 -16.7
Kodim24 21.3 -23.4 -21.7 2.0 20.6 -11.3 -21.9 -1.7 2.2 -11.4 -12.2 -24.0
TABLE I: BD-rate change (in percentage) of the optimization results of deep image compression models on the Kodak dataset, for four different IQA models. The corresponding baseline is the MSE-optimized BLS model [35]. Smaller or negative values means better coding efficiency.

Iii-C1 New Distortion Types

The success of a CNN model depends highly on the size of the training set. This is often an obstacle to learning DNN-based IQA models, due to the insufficient size, as compared to image recognition databases, of publicly available IQA databases. Luckily, training a proxy network on an existing model does not require human-labeled subjective quality scores such as mean opinion scores (MOS) or differential mean opinion scores (DMOS). The ground truth metric score for training the proxy network is easily obtained, given a pristine patch and a distorted patch. Therefore, we can make use of large-scale databases that do not include MOS, such as the Waterloo Exploration database [50].

Nevertheless, the distortion types provided by public-domain databases are generally quite different from the distortions created by a deep compression model. As shown in Fig. 5, most existing databases only provide synthetic distortions, such as JPEG, JPEG2000, Gaussian blur, and white Gaussian noise, applied at discrete severity levels. These distortions are drastically different from the distortions created by CNNs.

The Berkeley-Adobe PPS (BAPPS) database [27] contains many distorted patches collected from the outputs of CNN models. However, these CNN-based distortion types are still different from the patches reconstructed by deep compression networks. In Fig. 5LABEL:sub@fig_dataset_a, several reconstructed patches output by the deep compression model are shown. We have observed that dissimilar distortions can be generated by using different training steps or parameters. By comparing these distortions, we noticed that learning a network from previously existing databases might not be the optimal solution to our problem. On the contrary, by applying the proposed alternating training, this problem is immediately resolved: the patches reconstructed by the compression network are directly used to learn the ProxIQA network.

Iii-C2 Adversarial Examples

We also discovered that the deep compression network often generates “adversarial” examples when its loss layer is the output of a pre-trained network having fixed parameters. Fig. 6 shows such an “adversarial” example generated by the deep compression network using a proxy network as its loss function. In this example, the proxy network was well-trained to mimic the VMAF algorithm. However, comparing Fig. 6LABEL:sub@fig_adv_ex_src with Fig. 6LABEL:sub@fig_adv_ex_rec, it is apparent that the true VMAF score and the proxy VMAF score predicted by the ProxIQA network are very different. This can be understood by considering the training of the network to be an interpolation problem, whereby the neural networks maps a test image to an accurate quality score. However, when the input is too different from the training set, the ProxIQA network may produce a poor interpolation result.

To further illustrate, Fig. 7 compares true VMAF scores with proxy VMAF scores. All of the scores were calculated on the reconstructed patches produced during training. Figure 7LABEL:sub@fig_train_plot_pre shows that the proxy VMAF scores quickly approached , whereas much lower true VMAF scores were assigned to the patches produced by the compression model. This problem becomes significant when the previous discussed training strategy is applied. However, a straightforward way of improving the training stage is to simply feed the adversarial examples along with their objective quality scores into the ProxIQA network as additional training data. The ProxIQA network is then updated, which enables it to predict proxy quality much more accurately. As shown in Fig. 7LABEL:sub@fig_train_plot_alt, the true and proxy scores become highly coincident early in the training process.

Image Dataset Kodak Tecnick NFLX
BD-rate Metric PSNR SSIM MSIM VIF VMAF PSNR SSIM MSIM VIF VMAF PSNR SSIM MSIM VIF VMAF
JPEG 113.99 129.49 149.86 95.62 78.36 119.33 218.04 171.59 94.59 89.73 102.28 143.99 168.20 79.34 89.95
15.86 21.98 25.04 15.20 20.33 31.65 34.11 38.88 32.73 24.01 28.59 34.94 37.19 25.88 23.53
JPEG2000 -11.51 6.25 -1.02 -1.76 -33.39 -13.06 -1.55 -8.41 -8.91 -34.25 -27.81 1.43 -3.93 -14.55 -38.98
10.78 14.68 13.71 13.19 8.77 22.94 25.92 22.89 22.32 18.47 20.25 26.84 25.77 23.34 17.03
HEVC -26.35 -6.32 -6.12 -13.68 -28.23 -28.32 -8.97 -11.07 -14.87 -27.65 -49.43 -17.12 -16.06 -29.93 -35.03
11.06 18.65 19.05 14.14 12.15 22.48 29.84 27.56 25.07 24.05 16.07 25.63 26.59 21.22 21.52
HEVC -27.33 -25.98 -24.97 -32.73 -42.18 -19.48 -28.97 -33.95 -40.15 -46.67 -37.63 -35.41 -33.88 -47.51 -50.91
8.56 15.05 15.53 11.77 10.32 19.25 22.74 19.93 18.05 18.37 18.13 19.01 20.08 15.66 16.34
BLS 15.89 -21.31 -19.25 -4.71 7.19 8.67 -10.79 -16.11 -5.72 8.68 16.79 -19.01 -17.73 -6.61 9.75
2.83 3.93 4.22 3.44 6.52 6.59 7.60 6.13 4.98 6.66 8.99 6.82 6.65 6.83 6.72
BLS 11.67 -11.58 -21.77 -3.69 -0.17 4.47 -17.40 -23.50 -5.00 0.19 12.28 -11.59 -23.53 -5.96 4.34
4.50 4.77 7.17 3.51 4.49 8.65 8.55 9.89 5.16 4.78 10.40 9.65 10.85 6.92 6.44
BLS 5.23 -6.53 -7.78 -1.90 -23.35 6.23 -8.45 -5.97 -2.24 -23.78 7.00 -4.35 -5.43 -1.96 -21.97
1.97 2.57 2.40 1.82 3.92 2.91 4.34 4.16 2.90 7.16 2.37 5.96 5.57 3.66 5.01
BMSHJ MSE [36] -21.46 -10.94 -10.17 -16.52 -25.78 -26.03 -20.22 -16.71 -22.75 -33.75 -36.64 -21.21 -21.08 -28.92 -38.01
6.02 6.36 5.66 6.22 8.25 10.88 13.83 10.68 11.74 12.70 10.18 11.32 12.18 12.26 12.62
BMSHJ -15.90 -13.57 -13.17 -17.02 -47.11 -19.64 -23.14 -16.73 -23.63 -53.18 -29.96 -18.87 -19.29 -29.00 -56.06
5.34 4.22 3.70 4.86 7.72 10.35 10.29 10.82 9.69 11.63 10.34 11.01 10.41 10.87 10.66
TABLE II: Overall comparison of different codecs and results of optimized deep image compression: average change of BD-rate and standard deviation (indicated by underline) expressed as percentage, using three different IQA models to train ProxIQA. The baseline of comparison is the MSE-optimized BLS model [35]. Smaller or negative values indicate better coding efficiency.

Iii-D Implementation and Training Details

The TensorFlow framework (version 1.12) was used to implement the proposed method. We use the Adam solver

[51] to optimize both the ProxIQA network and the deep compression network, with and . We set the initial learning rates for both networks at fixed values of for the first 2M steps and for an additional 100K iterations. Thus, the network was trained on 2.1M iterations of back-propagation. All of the models were trained using NVIDIA 1080-TI GPU cards.

We used a subset of the

processed images from the ImageNet database

[52] as training data. As described in [35], small amounts of uniform noise were added to the images. The images were then down-sampled by random factors to reduce compression artifacts and high-frequency noise, and randomly cropped to a size of . In each mini-batch, we randomly sampled image patches from the subset. We then cropped the images to patches. The source code is publicly available on https://github.com/treammm/Compression.

Iv Experiments and Analysis

We compared the proposed perceptual optimization framework against the original MSE optimized image compression model and also against state-of-the-art image codecs. In order to experimentally evaluate the trained models, we conducted and report the results of a quantitative evaluation, a subjective comparison, and by a runtime analysis. We first describe the experimental setups that were used, including the datasets on which the performance evaluation was conducted using standard evaluation criteria. We also performed a different series of experiments to probe the limitations of our design. In all the experiments conducted, we denote the deep image compression model [35] as the BLS model, which we use as a baseline for performance comparison. Also, we denote an optimized proxy model for a given IQA model using (10) and (11) is denoted by .

Codec Software Version
JPEG JPEG XT Release 1.53111https://jpeg.org/downloads/jpegxt/reference1367abcd89.zip
JPEG2000 Kakadu Version 7.10.2222https://kakadusoftware.com/downloads/
HEVC HM 16.9333https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-16.9/
Ballé[35] Tensorflow Compression Release 1.0444https://github.com/tensorflow/compression/releases/tag/v1.0
Ballé[36] Tensorflow Compression Release 1.2555https://github.com/tensorflow/compression/releases/tag/v1.2
TABLE III: Version of software image codecs used in the experiments

Iv-a Experimental Setup

Iv-A1 Evaluation Datasets

To evaluate the various image codecs, we utilized the well-known Kodak dataset of very high quality uncompressed images. This publicly available image set is commonly used to evaluate image compression algorithms and IQA models. We also used a subset of the Tecnick dataset [53] containing images of resolution , and billboard images collected from the Netflix library, yielding images having more diverse resolutions and contents. None of the test images were included in the training sets, to avoid bias and overfitting problems.

VMAF RD-curve Source Image Source Baseline JPEG2000 HEVC
Kodim03 bpp / VMAF 0.250 / 72.56 0.250 / 83.63 0.307 / 84.26 0.233 / 82.10
PSNR / SSIM 33.02 / 0.951 33.39 / 0.953 33.72 / 0.963 34.89 / 0.962
Kodim07 bpp / VMAF 0.579 / 85.17 0.498 / 88.01 0.522 / 88.30 0.468 / 87.30
PSNR / SSIM 35.71 / 0.986 34.74 / 0.979 34.74 / 0.985 35.77 / 0.983
Kodim10 bpp / VMAF 0.052 / 27.82 0.050 / 36.39 0.058 / 37.06 0.054 / 39.67
PSNR / SSIM 26.12 / 0.817 26.53 / 0.806 26.32 / 0.840 27.78 / 0.843
Kodim17 bpp / VMAF 0.052 / 19.03 0.050 / 32.27 0.058 / 30.30 0.061 / 35.82
PSNR / SSIM 26.20 / 0.808 26.03 / 0.795 26.43 / 0.828 27.13 / 0.821
Kodim19 bpp / VMAF 1.361 / 93.25 1.229 / 94.52 0.892 / 94.20 1.139 / 93.52
PSNR / SSIM 36.91 / 0.991 37.52 / 0.986 34.29 / 0.982 37.41 / 0.986
Fig. 8: Visual comparison of decoded images produced by different codecs as well as the corresponding VMAF RD-curve. Image crops from left to right: ground-truth, baseline model [35], JPEG2000, Ballé et al. optimized with  (denoted by ), and HEVC. Generally, the -optimized BLS model achieved visual quality comparable to intra HEVC and JPEG2000. The source images were cropped to resolution in the second column for display purposes.

Iv-A2 Evaluation Criteria

We measured the objective coding efficiency of each image codec using the Bjontegaard-Delta bitrate (BD-rate) [54], which quantifies differences in bitrate at a fixed distortion level relative to another reference encoder. To calculate BD-rate, we encoded the images at eight different bitrates, ranging from bpp (bit per pixel) to bpp. The performances of all of the codecs were compared to the same baseline - the MSE-optimized BLS model. A negative number of BD-rate means the bitrate was reduced as compared with the baseline. To fairly compare deep compression models having different loss layers, we used filters at every layer, and trained all of the models using the same number of steps. To ensure reproducibility, we report the version of each software codec used in Table III. The input image formats used were YUV444 for JPEG and JPEG2000, and both YUV420/444 for intra-coded HEVC, respectively. Lastly, the IQA algorithms that were used to evaluate the codecs were calculated using FFmpeg 4.0 with libavfilter (for PSNR) and libvmaf 0.6.1666https://github.com/Netflix/vmaf (for SSIM, MS-SSIM, VIF, and VMAF). Specifically, the PSNR calculation in libavfilter is defined by

(12)

which is commonly used for combining per-channel PSNRs.

Iv-B Experimental Results on Kodak Dataset

The results on the images in the Kodak dataset are given Table I. The distortion levels that were used for BD-rate calculation were quantified using PSNR, SSIM, MS-SSIM (also represented by MSIM in the tables), and VMAF. It may be observed that the RD performances that were measured using PSNR became worse on models optimized under other IQA models. This is not surprising since MSE, which is used by the baseline model, is the optimal loss function for PSNR.

It may also be noted that, unlike using other IQA models used as targets of the proposed optimization,  optimization delivers coding gain with respect to all of the BD-rate measurements, except the PSNR BD-rate.

SSIM (14)  (10)(11) SSIM+ (13)
PSNR BD-Rate 00133.79 00015.56 00015.43
SSIM BD-Rate -29.41 -19.04 -22.39
MSIM BD-Rate -11.74 -17.29 -19.69
VMAF BD-Rate 40.09 7.63 12.03
TABLE IV: Average change in BD-rate (in percentage) of different SSIM-driven optimization results on Kodak dataset.

Iv-C Comparison with State-of-the-art Codecs

Table II tabulates the percent change in BD-rate relative to the BLS baseline, with respect to different quality models. We comprehensively evaluated perceptual deep compression using different perceptual optimization protocols (highlighted in boldface), against three conventional image codecs: JPEG, JPEG2000, and intra-coded HEVC main-RExt (Format Range Extension) profile. Extensive experiments were carried out on the three aforementioned datasets, using three perceptual IQA models as optimization targets. In addition to the BLS model, we also deployed the proposed  optimization framework on a more sophisticated deep compression model [36] to test its generality. We report the BD-rate changes obtained, averaged over all the images in each dataset. Similar results were obtained on the Tecnick and NFLX datasets as shown in Tables I and II. These results show that our optimization approach is able to successfully optimize a deep image compression model over different IQA algorithms. Indeed, significant BD-rate reductions were obtained in many cases.

In addition to the quantitative results, we also visually compared the decoded images. Fig. 8 plots VMAF Rate-distortion (RD) curves for several images. A subset of the images corresponding to the RD points obtained by the various codecs are also shown. As a basic test, we subjectively compare results yielding similar bitrates but different objective quality scores. The images Kodim10 and Kodim17 were subject to extreme compression at bitrates around bpp. In these cases, the -optimized model significantly outperformed the MSE-optimized baseline model, delivering performance comparable to HEVC and JPEG2000 with respect to VMAF score and subjective quality. At high bitrates, the distinctions between the codecs becomes subtle. Therefore, we isolated these RD points associated with similar VMAF scores and compared bitrate consumption. Generally, the -optimized model yielded comparable subjective and objective (VMAF) quality as the baseline -optimized model, while consuming significant fewer bits. The encoding results on Kodim19 in Fig. 8 using the -optimized model yielded similar VMAF scores as the other codecs, while consuming only , and fewer bits than the Baseline, JPEG2000, and HEVC, respectively.

(a) Kodim01.
(b) Kodim21.
Fig. 9: Rate-distortion curves for different SSIM-oriented optimization protocols on two Kodak images. The baseline curve denotes the MSE-optimized BLS model [35].
Source SSIM (14)  (10)(11) SSIM+ (13)
bpp / SSIM 0.124 / 0.819 0.145 / 0.820 0.144 / 0.827
bpp / SSIM 0.504 / 0.974 0.563 / 0.974 0.628 / 0.979
Fig. 10: Visual comparison of model behavior among different SSIM-optimized models. First row: Kodim01 encoded at bpp. Second row: Kodim21 encoded at bpp.

Iv-D Limitations

When measuring RD performance using SSIM metric, optimizing a model using SSIM should approach the theoretical upper bound of SSIM-measured RD performance. Accordingly, we investigated the performance of our proposed framework by comparing SSIM and  optimization of the BLS model. The results of comparison are presented in Table IV, indicating a SSIM BD-rate performance gap between the two optimization approaches. A noticeable contributor to this performance drop is the pixel loss in (10). To validate this assumption, we conducted an ablation study to pinpoint the cause of this gap. We did this by fixing for the SSIM optimization. The loss function is then

(13)

where

(14)

The SSIM BD-rate that resulted from optimization of (13) is given in the fourth column of Table IV. It may be observed that the RD performance becomes very close to that of  optimization, which confirms that the pixel loss is the main contributor to the performance loss.

Moreover, Fig. 10 shows that similar visual results are obtained using  optimization and the optimization described in (13), even at heavy compression levels. A close examination shows that true SSIM optimization nicely preserves high-frequency details but loses chromatic fidelity. The RD-curves in Fig. 9 further confirm the similar behavior of  optimization and the optimization of (13).

All the models described in this subsection were trained using one million steps and a constant learning rate. Thus, the performance results of  differ slightly from the results reported in Tables I and II.

Iv-E Study of Training Steps

The instability introduced by the proxy loss can be further improved by training longer, and by reducing the learning rate. Fig. 11 plots the VMAF BD-rate as a function of the number of training steps. When measuring BD-rate against the same baseline (MSE optimization trained with 1M steps),  achieves significant improvement relative to MSE optimization, by training longer or by lowering the learning rate. For fair comparison, we also evaluate  using MSE optimization using the same training process as the baseline (dotted line). We have observed very relative results using other perceptual optimizers, like .

Fig. 11: VMAF BD-rate change (improvement) against the number of training steps and learning rate, using the Kodak images. The error bars represent the standard deviations of BD-rates. In the first 2M steps, a constant learning rate was used. After that, the learning rate was reduced by a factor of 10. We denote each result by “ ”. For BD-rate change calculations, the solid lines indicate MSE optimization over 1M steps, while the dotted line indicates that MSE optimization using the same training procedure as the optimization is used for baseline comparison.
Codec Encode Decode
JPEG CPU 43.02 62.88
JPEG2000 CPU 10.80 36.79
HEVC CPU 4578.57 89.88
BLS MSE CPU 251.01 117.93
GPU 231.62 32.56
BLS  (ours) CPU 246.57 119.02
GPU 229.26 29.22
TABLE V: Run time comparison of conventional image codecs and deep compression models. Model loading time for deep compression is excluded. All times are given in milliseconds.
Fig. 12: Normalized encoding times for five different bitrate settings.

Iv-F Execution Time

The encoding and decoding times of the various compared codecs are summarized in Table V. We compiled the source code of state-of-the-art (standard) codecs, in order to be able to compare them on the same machine. The results were then calculated by averaging the runtime over all Kodak images under different bitrate settings. From Table V, it may be observed that the time complexity of the MSE-optimized and -optimized BLS model are nearly identical, as they deploy the same network architecture in application. Of course, the runtime of deep compression models can be reduced if implemented on a GPU. It should also be noted that the decoding time of HEVC was estimated from the reference software HM, which might be slow. This can be improved by using a third-party decoder such as FFmpeg.

We also compared the encoding times for different bitrate settings in Fig. 12. For each encoder, all of the runtimes were normalized by dividing by the value of bpp. The conventional codecs required more time to encode at high bitrates, whereas the deep compression models have encoding times that do not vary much with bitrate.

V Concluding Remarks

We have presented a learning framework for perceptually optimizing a learned image compression model. To optimize the ProxIQA network, we developed an alternating training method. We experimentally demonstrated that, for a fixed VMAF value, our proposed proxy approach achieved a bitrate reduction, on average, relative to the MSE-based framework.

The idea behind the proposed optimization framework is general. We believe that, with proper modifications of the architecture of the ProxIQA network, the application scope should be applicable to a wide variety of image enhancement, restoration, and reconstruction problems, such as super-resolution or de-noising.

Another future topic could be the study of new types of distortions caused by deep compression models. Like the examples shown in this paper, distorted images created by CNNs are very different from images afflicted by more traditional distortions, such as JPEG compression. Creating databases for assessing the subjective quality of these new types of distortions would be quite valuable.

Acknowledgment

The authors thank Johannes Ballé for providing the training images. The authors also thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu

References

  • [1] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 39, no. 4, pp. 640–651, Apr. 2017.
  • [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2016, pp. 770–778.
  • [3] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical flow with convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vision, Dec. 2015, pp. 2758–2766.
  • [4] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with BM3D?” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2012, pp. 2392–2399.
  • [5] V. Jain and S. Seung, “Natural image denoising with convolutional networks,” in Proc. Adv. Neural Inf. Process. Syst., 2009, pp. 769–776.
  • [6] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Proc. Eur. Conf. Comput. Vision, 2014, pp. 184–199.
  • [7] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2017, pp. 624–632.
  • [8] Y.-L. Liu, Y.-T. Liao, Y.-Y. Lin, and Y.-Y. Chuang, “Deep video frame interpolation using cyclic frame generation,” in Proc. AAAI, 2019, pp. 8794–8802.
  • [9] J. T. Barron, “A general and adaptive robust loss function,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2019, pp. 4331–4339.
  • [10] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004.
  • [11] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multi-scale structural similarity for image quality assessment,” in Proc. IEEE Asilomar Conf. on Signals, Syst., and Comput., Nov. 2003, pp. 1398–1402.
  • [12] J. Snell, K. Ridgeway, R. Liao, B. D. Roads, M. C. Mozer, and R. S. Zemel, “Learning to generate images with perceptual similarity metrics,” in Proc. IEEE Int. Conf. Image Process., Sep. 2017, pp. 4277–4281.
  • [13] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Trans. on Computational Imaging, vol. 3, no. 1, pp. 47–57, Mar. 2017.
  • [14] D. Chandler and S. Hemami, “VSNR: A wavelet-based visual signal-to-noise ratio for natural images,” IEEE Trans. Image Processing, vol. 16, no. 9, pp. 2284–2298, Sep. 2007.
  • [15] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Trans. Image Processing, vol. 15, no. 2, pp. 430–444, Feb. 2006.
  • [16] E. C. Larson and D. M. Chandler, “Most apparent distortion: full-reference image quality assessment and the role of strategy,” Journal of Electronic Imaging, vol. 19, no. 1, pp. 011 006:1–011 006:21, Jan. 2010.
  • [17] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity index for image quality assessment,” IEEE Trans. Image Processing, vol. 20, no. 8, pp. 2378–2386, Aug. 2011.
  • [18] L. Zhang, Y. Shen, and H. Li, “VSI: A visual saliency-induced index for perceptual image quality assessment,” IEEE Trans. Image Processing, vol. 23, no. 10, pp. 4270–4281, Oct. 2014.
  • [19] T.-J. Liu, W. Lin, and C.-C. J. Kuo, “Image quality assessment using multi-method fusion,” IEEE Trans. Image Processing, vol. 22, no. 5, pp. 1793–1807, May 2013.
  • [20]

    S.-C. Pei and L.-H. Chen, “Image quality assessment using human visual DOG model fused with random forest,”

    IEEE Trans. Image Processing, vol. 24, no. 11, pp. 3282–3292, Nov. 2015.
  • [21] V. V. Lukin, N. N. Ponomarenko, O. I. Ieremeiev, K. O. Egiazarian, and J. Astola, “Combining full-reference image visual quality metrics by neural network,” Proc. SPIE, vol. 9394, p. 93940K, Mar. 2015.
  • [22] M. Oszust, “Decision fusion for image quality assessment using an optimization approach,” IEEE Signal Process. Lett., vol. 23, no. 1, pp. 65–69, Jan. 2016.
  • [23] F. Gao, Y. Wang, P. Li, M. Tan, J. Yu, and Y. Zhu, “DeepSim: Deep similarity for image quality assessment,” Neurocomputing, vol. 257, pp. 104–114, Sep. 2017.
  • [24] S. Bosse, D. Maniry, K.-R. Muller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Trans. Image Processing, vol. 27, no. 1, pp. 206–219, Jan. 2018.
  • [25] Z. Li, C. Bampis, J. Novak, A. Aaron, K. Swanson, A. Moorthy, and J. D. Cock, “VMAF: The journey continues,” The NETFLIX tech blog, 2018. [Online]. Available: https://medium.com/netflix-techblog/vmaf-the-journey-continues-44b51ee9ed12
  • [26] D. Brunet, E. R. Vrscay, and Z. Wang, “On the mathematical properties of the structural similarity index,” IEEE Trans. Image Processing, vol. 21, no. 4, pp. 1488–1499, Apr. 2012.
  • [27]

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in

    Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2018, pp. 586–595.
  • [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–14.
  • [29] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vision, vol. 9906, 2016, pp. 694–711.
  • [30] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman, “Controlling perceptual factors in neural style transfer,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jul. 2017, pp. 3730–3738.
  • [31] J. Bruna, P. Sprechmann, and Y. LeCun, “Super-resolution with deep convolutional sufficient statistics,” in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–17.
  • [32] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., 2017, pp. 105–114.
  • [33] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch, “EnhanceNet: Single image super-resolution through automated texture synthesis,” in Proc. IEEE Int. Conf. Comput. Vision, Oct. 2017.
  • [34] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High-resolution image inpainting using multi-scale neural patch synthesis,” in Proc. IEEE Conf. Comput. Vision Pattern Recog.   IEEE, Jul. 2017, pp. 4076–4084.
  • [35] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in Proc. Int. Conf. Learn. Represent., 2017, pp. 1–27.
  • [36] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–23.
  • [37] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems 31, 2018, pp. 10 771–10 780.
  • [38] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compression with recurrent neural networks,” CoRR, vol. abs/1511.06085, 2015.
  • [39] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jul. 2017, pp. 5306–5314.
  • [40] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., June 2018, pp. 4385–4393.
  • [41] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool, “Generative adversarial networks for extreme learned image compression,” arXiv preprint arXiv:1804.02958, 2018.
  • [42] J. Löhdefink, A. Bär, N. M. Schmidt, F. Hüger, P. Schlicht, and T. Fingscheidt, “GAN- vs. JPEG2000 image compression for distributed automotive perception: Higher peak snr does not mean better semantic segmentation,” ArXiv, vol. abs/1902.04311, 2019.
  • [43] C.-Y. Wu, N. Singhal, and P. Krähenbühl, “Video compression through image interpolation,” in Proc. Eur. Conf. Comput. Vision, 2018.
  • [44] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learning image and video compression through spatial-temporal energy compaction,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., 2019.
  • [45] S. Channappayya, A. Bovik, and R. Heath, “Rate bounds on SSIM index of quantized images,” IEEE Trans. Image Processing, vol. 17, no. 9, pp. 1624–1639, Sep. 2008.
  • [46] Y.-H. Huang, T.-S. Ou, P.-Y. Su, and H. H. Chen, “Perceptual rate-distortion optimization using structural similarity index as quality metric,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 11, pp. 1614–1624, Nov. 2010.
  • [47] S. Wang, A. Rehman, Z. Wang, S. Ma, and W. Gao, “SSIM-motivated rate-distortion optimization for video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 4, pp. 516–529, 2012.
  • [48] J. Ballé, “Efficient nonlinear transforms for lossy image compression,” in Proc. IEEE Picture Coding Symp., 2018, pp. 248–252.
  • [49] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the h.264/AVC video compression standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 620–636, Jul. 2003.
  • [50] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo Exploration Database: New challenges for image quality assessment models,” IEEE Trans. Image Processing, vol. 26, no. 2, pp. 1004–1016, Feb. 2017.
  • [51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15.
  • [52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2009, pp. 248–255.
  • [53] N. Asuni and A. Giachetti, “TESTIMAGES: a large-scale archive for testing visual devices and basic image processing algorithms,” in Proc. Eurographics Italian Chapter Conference, 2014, pp. 63–70.
  • [54] G. Bjøntegaard, “Calculation of average PSNR differences between RD-curves,” document VCEG-M33, ITU-T Video Coding Experts Group (VCEG) Thirteenth Meeting, Austin, TX, April 2001.