I Introduction
Recently deep neural networks have been successfully and ubiquitously applied on diverse image processing and computer vision tasks, such as semantic segmentation
[1], object recognition [2], and optical flow [3]. Many classic image transformation problems can be approached using a deep generative network, which learns to reconstruct high-quality output images from degraded input image(s). Explicitly, the generative network is trained in a supervised manner with a loss function, which is used to measure the fidelity between the output and a ground-truth image. For instance, the denoising task aims to reconstruct a noise-free image from a noisy image, and Convolutional Neural Networks (CNNs) have been shown to provide good noisy-to-pristine mapping functions
[4, 5]. Similar tasks where retaining image fidelity is important include deep image compression, super-resolution
[6, 7], frame interpolation
[8], and so on.Although a significant amount of research has been applied on deep learning image transformation problems, most of this work has focused on investigating network architectures or improving convergence speed. The selection of an appropriate loss function, however, has not been studied as much. The choice of the loss functions used to guide model training has been largely limited to the
norm family, in particular the MSE (squared norm), the norm, and variants of these[9]. The structural similarity quality index (SSIM) [10] and its multi-scale version (MS-SSIM) [11] have also been adopted as loss functions for several image reconstruction tasks [12, 13], owing to their perceptual relevance and good analytic properties, such as differentiability.
Perceptual image quality assessment has been a long-standing research problem. Although numerous powerful perceptual models have been proposed to predict the perceived quality of a distorted picture, other image quality indexes have never been adopted as deep network loss functions, because they are generally non-differentiable and functionally complex.
Towards bridging the gap between modern perceptual quality models and deep generative networks, we explore the potential of adapting more powerful and sophisticated perceptual image quality models as loss functions in deep neural network for addressing the aforementioned problems: neural networks, by simulating the measurements made by a perceptual model by a proxy network ProxIQA. As shown in Fig. 1, the main idea is to optimize the hyper parameters of the generative network
, using a ProxIQA network as a perceptual loss function
(1) |
where , are the input and output of the generative network and is the ground-truth image. In the image compression problem, is an uncompressed image and we anticipate the fidelity of the compressed image. Thus, this is a special case where . The parameters of the ProxIQA network are optimized so that it mimics the .
The outline of this paper is as follows: Section II reviews the relevant literature of image quality assessment, perceptual optimization, and deep image compression. Section III describes the ProxIQA framework, while Section IV provides analysis and experimental results. Finally, Section V concludes the paper.
![]() |
![]() |
![]() |
, (a) determined function based approaches typically use a differentiable function having a certain degree of convexity, such as SSIM and MS-SSIM. (b) Perceptual-loss based approaches define the loss from the features extracted from intermediate layers of a trained network (such as VGG) (c) Our method uses the output of a proxy network that approximates an IQA model as the loss function.
Ii Related Work
In this section, we provide a literature review of studies that are closely related to our work. The relevant topics of objective image quality assessment, perceptual optimization and deep compression are briefly reviewed.
Ii-a Perceptual Image Quality Metrics
Over the past decade there has been a remarkably increasing interest in developing objective image quality assessment (IQA) methods. Objective IQA models are commonly classified as full-reference, reduced-reference, and no-reference, based on the amount of information they assess from a reference image of ostensibly pristine quality. Here we only need to consider the full-reference (FR) scenario, since it may be assumed that ground-truth data is available, hence we only review FR IQA models.
Beyond the well-known structural similarity index and other SSIM-type methods, a wide variety of perception-based FR models have been designed, including the visual signal-to-noise ratio index (VSNR)
[14], the visual information fidelity (VIF) index [15], the MAD model [16], the feature similarity index (FSIM) and its extension FSIM [17], and the Visual Saliency-Induced index (VSI) [18].With the rapid development of machine learning, important data-driven models have also begun to emerge
[19, 20, 21, 22, 23, 24]. A particularly successful example is Netflix’s announcement of an open-source FR video quality engine called Video Multimethod Assessment Fusion (VMAF)
[25]. VMAF combines multiple quality features to train a Support Vector Regressor (SVR) to predict subjective judgments of video quality. When it is applied to still pictures, VMAF treats the data as a video frame having zero motion. Like SSIM, VMAF is used to perceptually optimize tremendous volumes of internet video traffic.
Generally, more advanced, high-performance quality prediction models such as these are difficult to adopt as loss functions for end-to-end optimization networks.
Ii-B Perceptual Optimization
As tractable tools for perceptual optimization, SSIM and MS-SSIM have been widely adopted because of the simple analytical form of their gradients and computational ease. More over, their convexity properties [26] makes them feasible targets for optimization. For example, two recent studies adopted structural similarity functions as loss layers of image generation models, obtaining improved results as validated by conducting a human subjective study [12] and by objective evaluation against several other perceptual models [13].
Rather than optimizing a mathematical function, another approach uses a deep neural network to guide the training. Recent experimental studies suggest that the features extracted from a well-trained image classification network have the capability to capture information useful for other perceptual tasks [27]. As illustrated in Fig. 2LABEL:sub@fig_comp_b, the perceptual loss is defined as
(2) |
where denotes the output feature map of the -th layer with elements of a pre-trained network .
In practice, the loss computed from the high-level features extracted from a pre-trained VGG classification network [28], also called VGG loss, has been commonly adopted for diverse computer vision tasks. The VGG loss has been applied to such diverse tasks as style transfer [29, 30], superresolution [31, 29, 32, 33]
, and image inpainting
[34].

Ii-C End-to-end Optimized Lossy Image Compression
Recently, lossy image compression models have been realized using deep neural network architectures. Most of these have deployed deep auto-encoders. For example, Ballé et al. [35]
proposed a general infrastructure for optimizeing image compression in an end-to-end manner. Unlike other methods, the bitrate is estimated and considered during training. In
[36], this model is improved by incorporating a scale hyperprior into the compression framework. The authors use an additional network to estimate the standard deviation of the quantized coefficients to further improve coding efficiency. Later, Minnen
et al. [37]exploit a PixelCNN layer, which they combine with an autoregressive hyperprior. Beyond these early efforts, other recent approaches have adopted more complex network architectures such as recurrent neural networks (RNNs)
[38, 39, 40] and generative adversarial networks (GANs) [41, 42]. Some works has also been done to extend these ideas to the deep video compression problem [43, 44].Unsurprisingly, the idea of optimizing a conventional codec such as H.264/AVC against perceptual models like SSIM, VIF, and VMAF have been deeply studied [45, 46, 47] and implemented in widespread practice [25]. We seek to extend this concept in similar manner to learn an end-to-end perceptually optimized compression model.
Iii Proposed Perceptual Optimization Framework
Our approach to training an image compression model in a perceptually optimized way is depicted in Fig. 3. This framework involves optimizing two networks: an image compression network , and a sub-network , which is a proxy of an IQA model, which we will refer to as ProxIQA. A source image is input to a compression network, which produces a reconstructed image:
(3) |
Separately, the ProxIQA network maps the image pair into a proxy of an image quality model :
(4) |
In each training iteration, the two networks are alternately updated as follows:
Iii-1 Deep Compression Model Updating
To integrate into the update of given a mini-batch , the model parameters of are fixed during training. In order to minimize distortion, the output of becomes part of the objective in the optimization of :
(5) |
By back-propagating through the forward model, the loss derivative is used to drive .
Iii-2 ProxIQA Network Updating
Given a mini-batch pair and collected from the most recent update of the compression network, the quality scores are calculated. The ProxIQA network is updated to optimally fit given the input . Note that the compression network is not needed in this part of the training.
As may be seen, the auxiliary sub-network ProxIQA is incorporated into the training of the compression network. However, it is important to understand that the ProxIQA network is not present during the testing (image compression/decompression) phase.
![]() |
![]() |
![]() |
Iii-a Network Architecture
Iii-A1 ProxIQA Network
The goal is to learn a nonlinear regressor via a CNN. The network takes a reference patch and a distorted patch as input, where both have pixels. They are then concatenated into a 6-channel signal, where a raw input is fed into the network and reduced to a predicted quality score. As depicted in Fig. 4
, the ProxIQA network may be as simple as a shallow CNN consisting of three stages of convolution, ReLU as an activation function, and maxpooling. The spatial size is reduced by a factor of
after each stage via max pooling layers. Finally, feature maps are flattened and fed to a fully connected layer which yields the output. The parameterization of each layer is detailed in the figure.Iii-A2 Compression Network
We build on the deep image compression model [35]. As shown in Fig. 3, the image compression network comprises an analysis transform () at the encoder side, and a synthesis transform () at the decoder side. Both transforms are composed of three consecutive layers of convolution-down(up) sampling-activation. Instead of utilizing ReLU, a generalized divisive normalization (GDN) transform is adopted as the activation function [48], similar to normalization of the visual signal by the human visual system.
Iii-B Loss Functions
As illustrated in Fig. 3, let , , , and be the source batch, latent presentation, quantized latent presentation, and reconstructed batch, respectively. The model parameters in the analysis and synthesis transforms are collectively denoted by . The ProxIQA network has model parameters . Given a perceptual metric , the goal is to optimize the full set of parameters , , such that the learned image codec can generate a reconstructed image that has a high preceptual quality score . Furthermore, the rate should be as small as possible. Therefore, we train our model using the following losses.
Iii-B1 Rate loss
The rate loss representing the bit consumption of an encode is defined by
(6) |
where is the entropy model. This entropy term is minimized when the actual marginal distribution and have the same distribution.
During training, the latent presentation is quantized to by adding i.i.d uniform noise . Then, is used to estimate the rate via (6). Unlike the estimated entropy used when training the network, a variant of the context-adaptive binary arithmetic coder (CABAC) [49] is used to encode the discrete-valued data into the bitstream during testing.
Iii-B2 Pixel loss
The pixel loss is the residual between the source image and the reconstructed image mapped by a distance function . Given and , the pixel loss is defined by
(7) |
where can be mean square error (i.e., ) or mean absolute error (i.e., ).
The original work in [35] used as the pixel loss to maximize the PSNR of the reconstructed images. When combined with the rate loss, the image compression network is optimized by minimizing the objective function defined by
(8) |
which has a similar as rate-distortion optimization (RDO) functions in conventional codecs. We make use of a pixel loss to encourage training stability.
![]() |
![]() |
![]() |
![]() |
Iii-B3 Proxy loss
Instead of just minimizing an norm between and , we introduce a novel loss term. The proxy loss is calculated from the output of ProxIQA network, denoted by , with fixed parameter :
(9) |
Here denotes the upper bound of the model , which is a constant to the loss function.
Finally, the total loss function for optimizing the compression network is the weighted combination of the losses from Eqs. (7), (6), and (9):
(10) |
where balances bitrate against distortion of the encoded bitstream, and weights the proxy loss against the pixel loss. The pixel loss plays a different role as a regularization term. Since the ProxIQA network is updated at each step, the loss function is also updated. The pixel loss serves to stabilize the training process.
Iii-B4 Metric loss
The ProxIQA network aims to mimic an image quality model . Given two images and , define a metric loss to attain this objective while updating ProxIQA network:
(11) |
Note that is a constant, since it is obtained from the reconstructed patches generated during the most recent update of the compression network.
Iii-C What’s Wrong with Using a Pre-trained Network?
Another way of attempting to accomplish the same goal is to use a pre-trained network as the loss layer. That is, a proxy network is first learned to predict a metric score given a pristine patch and a distorted patch from an existing dataset. Next, the trained proxy network is inserted into the loss layer of the deep compression network with the goal of maximizing the proxy score. Unfortunately, severe complications can arise when applying this simple methodology.
Optimization | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
BD-rate Metric | PSNR | SSIM | MSIM | VMAF | PSNR | SSIM | MSIM | VMAF | PSNR | SSIM | MSIM | VMAF |
Kodim01 | 14.4 | -17.4 | -16.3 | -2.1 | 12.6 | -1.4 | -3.9 | -2.4 | 4.9 | -2.9 | -4.3 | -25.3 |
Kodim02 | 11.0 | -21.6 | -17.9 | 15.4 | 6.5 | -13.1 | -21.5 | 3.0 | 8.6 | -2.2 | -2.6 | -26.7 |
Kodim03 | 13.5 | -19.5 | -16.6 | 5.8 | 4.3 | -14.6 | -24.1 | -10.6 | 7.8 | -5.2 | -7.2 | -33.2 |
Kodim04 | 15.6 | -23.5 | -21.4 | 9.3 | 10.0 | -15.6 | -26.6 | 0.0 | 6.3 | -6.4 | -7.4 | -26.9 |
Kodim05 | 13.8 | -16.8 | -14.8 | 1.4 | 14.0 | -3.9 | -11.2 | -2.1 | 3.1 | -6.1 | -6.3 | -18.1 |
Kodim06 | 17.0 | -22.0 | -20.0 | 0.5 | 12.2 | -10.1 | -19.2 | -2.3 | 4.5 | -6.0 | -7.4 | -23.6 |
Kodim07 | 11.6 | -14.5 | -10.9 | 8.3 | 7.1 | -9.9 | -21.1 | -0.6 | 7.0 | -10.0 | -9.7 | -24.7 |
Kodim08 | 13.0 | -14.9 | -15.1 | -5.4 | 16.4 | -0.6 | -6.6 | -7.3 | 2.5 | -3.9 | -5.2 | -20.3 |
Kodim09 | 16.1 | -17.6 | -14.5 | 12.3 | 7.6 | -12.9 | -23.8 | -1.2 | 6.4 | -3.1 | -4.4 | -24.9 |
Kodim10 | 15.4 | -25.3 | -22.4 | 12.2 | 10.4 | -17.3 | -32.8 | 3.3 | 5.0 | -9.6 | -10.7 | -23.2 |
Kodim11 | 16.6 | -26.6 | -25.3 | 12.8 | 14.1 | -16.1 | -25.9 | 6.9 | 5.1 | -7.1 | -8.5 | -20.0 |
Kodim12 | 10.4 | -30.8 | -28.8 | 12.8 | 2.5 | -19.7 | -31.3 | 8.3 | 6.6 | -3.4 | -5.9 | -22.4 |
Kodim13 | 19.9 | -25.3 | -24.5 | -7.1 | 16.8 | -9.1 | -16.6 | -7.3 | 1.2 | -8.1 | -9.2 | -19.6 |
Kodim14 | 17.5 | -22.6 | -20.3 | 8.2 | 13.0 | -10.5 | -18.3 | 0.0 | 4.0 | -8.5 | -9.2 | -20.1 |
Kodim15 | 17.4 | -27.1 | -26.0 | 6.0 | 9.8 | -15.6 | -30.7 | 0.1 | 6.7 | -5.7 | -7.6 | -31.8 |
Kodim16 | 14.4 | -20.4 | -15.9 | 2.5 | 6.3 | -13.9 | -21.3 | -3.2 | 6.5 | -4.6 | -6.1 | -24.9 |
Kodim17 | 18.0 | -20.7 | -19.3 | 7.3 | 10.5 | -15.0 | -29.7 | 0.1 | 5.3 | -9.6 | -10.9 | -23.6 |
Kodim18 | 19.7 | -20.7 | -18.3 | 12.1 | 17.9 | -9.2 | -17.9 | 2.4 | 2.7 | -9.1 | -9.8 | -22.0 |
Kodim19 | 17.6 | -18.2 | -16.0 | 13.9 | 17.7 | -8.7 | -21.8 | 6.8 | 8.7 | -4.7 | -6.1 | -17.8 |
Kodim20 | 19.7 | -21.3 | -22.4 | 15.3 | 12.8 | -10.3 | -25.3 | 0.3 | 5.1 | -5.4 | -8.2 | -24.2 |
Kodim21 | 18.1 | -16.6 | -15.0 | 1.9 | 16.4 | -7.5 | -16.9 | -4.1 | 4.1 | -5.3 | -6.8 | -21.6 |
Kodim22 | 15.0 | -23.0 | -21.2 | 11.8 | 10.3 | -15.5 | -24.3 | 3.5 | 3.4 | -10.9 | -12.4 | -25.1 |
Kodim23 | 14.7 | -21.6 | -17.5 | 15.5 | 10.5 | -16.6 | -29.6 | 4.1 | 7.8 | -7.5 | -8.7 | -16.7 |
Kodim24 | 21.3 | -23.4 | -21.7 | 2.0 | 20.6 | -11.3 | -21.9 | -1.7 | 2.2 | -11.4 | -12.2 | -24.0 |
Iii-C1 New Distortion Types
The success of a CNN model depends highly on the size of the training set. This is often an obstacle to learning DNN-based IQA models, due to the insufficient size, as compared to image recognition databases, of publicly available IQA databases. Luckily, training a proxy network on an existing model does not require human-labeled subjective quality scores such as mean opinion scores (MOS) or differential mean opinion scores (DMOS). The ground truth metric score for training the proxy network is easily obtained, given a pristine patch and a distorted patch. Therefore, we can make use of large-scale databases that do not include MOS, such as the Waterloo Exploration database [50].
Nevertheless, the distortion types provided by public-domain databases are generally quite different from the distortions created by a deep compression model. As shown in Fig. 5, most existing databases only provide synthetic distortions, such as JPEG, JPEG2000, Gaussian blur, and white Gaussian noise, applied at discrete severity levels. These distortions are drastically different from the distortions created by CNNs.
The Berkeley-Adobe PPS (BAPPS) database [27] contains many distorted patches collected from the outputs of CNN models. However, these CNN-based distortion types are still different from the patches reconstructed by deep compression networks. In Fig. 5LABEL:sub@fig_dataset_a, several reconstructed patches output by the deep compression model are shown. We have observed that dissimilar distortions can be generated by using different training steps or parameters. By comparing these distortions, we noticed that learning a network from previously existing databases might not be the optimal solution to our problem. On the contrary, by applying the proposed alternating training, this problem is immediately resolved: the patches reconstructed by the compression network are directly used to learn the ProxIQA network.
Iii-C2 Adversarial Examples
We also discovered that the deep compression network often generates “adversarial” examples when its loss layer is the output of a pre-trained network having fixed parameters. Fig. 6 shows such an “adversarial” example generated by the deep compression network using a proxy network as its loss function. In this example, the proxy network was well-trained to mimic the VMAF algorithm. However, comparing Fig. 6LABEL:sub@fig_adv_ex_src with Fig. 6LABEL:sub@fig_adv_ex_rec, it is apparent that the true VMAF score and the proxy VMAF score predicted by the ProxIQA network are very different. This can be understood by considering the training of the network to be an interpolation problem, whereby the neural networks maps a test image to an accurate quality score. However, when the input is too different from the training set, the ProxIQA network may produce a poor interpolation result.
To further illustrate, Fig. 7 compares true VMAF scores with proxy VMAF scores. All of the scores were calculated on the reconstructed patches produced during training. Figure 7LABEL:sub@fig_train_plot_pre shows that the proxy VMAF scores quickly approached , whereas much lower true VMAF scores were assigned to the patches produced by the compression model. This problem becomes significant when the previous discussed training strategy is applied. However, a straightforward way of improving the training stage is to simply feed the adversarial examples along with their objective quality scores into the ProxIQA network as additional training data. The ProxIQA network is then updated, which enables it to predict proxy quality much more accurately. As shown in Fig. 7LABEL:sub@fig_train_plot_alt, the true and proxy scores become highly coincident early in the training process.
Image Dataset | Kodak | Tecnick | NFLX | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BD-rate Metric | PSNR | SSIM | MSIM | VIF | VMAF | PSNR | SSIM | MSIM | VIF | VMAF | PSNR | SSIM | MSIM | VIF | VMAF |
JPEG | 113.99 | 129.49 | 149.86 | 95.62 | 78.36 | 119.33 | 218.04 | 171.59 | 94.59 | 89.73 | 102.28 | 143.99 | 168.20 | 79.34 | 89.95 |
15.86 | 21.98 | 25.04 | 15.20 | 20.33 | 31.65 | 34.11 | 38.88 | 32.73 | 24.01 | 28.59 | 34.94 | 37.19 | 25.88 | 23.53 | |
JPEG2000 | -11.51 | 6.25 | -1.02 | -1.76 | -33.39 | -13.06 | -1.55 | -8.41 | -8.91 | -34.25 | -27.81 | 1.43 | -3.93 | -14.55 | -38.98 |
10.78 | 14.68 | 13.71 | 13.19 | 8.77 | 22.94 | 25.92 | 22.89 | 22.32 | 18.47 | 20.25 | 26.84 | 25.77 | 23.34 | 17.03 | |
HEVC | -26.35 | -6.32 | -6.12 | -13.68 | -28.23 | -28.32 | -8.97 | -11.07 | -14.87 | -27.65 | -49.43 | -17.12 | -16.06 | -29.93 | -35.03 |
11.06 | 18.65 | 19.05 | 14.14 | 12.15 | 22.48 | 29.84 | 27.56 | 25.07 | 24.05 | 16.07 | 25.63 | 26.59 | 21.22 | 21.52 | |
HEVC | -27.33 | -25.98 | -24.97 | -32.73 | -42.18 | -19.48 | -28.97 | -33.95 | -40.15 | -46.67 | -37.63 | -35.41 | -33.88 | -47.51 | -50.91 |
8.56 | 15.05 | 15.53 | 11.77 | 10.32 | 19.25 | 22.74 | 19.93 | 18.05 | 18.37 | 18.13 | 19.01 | 20.08 | 15.66 | 16.34 | |
BLS | 15.89 | -21.31 | -19.25 | -4.71 | 7.19 | 8.67 | -10.79 | -16.11 | -5.72 | 8.68 | 16.79 | -19.01 | -17.73 | -6.61 | 9.75 |
2.83 | 3.93 | 4.22 | 3.44 | 6.52 | 6.59 | 7.60 | 6.13 | 4.98 | 6.66 | 8.99 | 6.82 | 6.65 | 6.83 | 6.72 | |
BLS | 11.67 | -11.58 | -21.77 | -3.69 | -0.17 | 4.47 | -17.40 | -23.50 | -5.00 | 0.19 | 12.28 | -11.59 | -23.53 | -5.96 | 4.34 |
4.50 | 4.77 | 7.17 | 3.51 | 4.49 | 8.65 | 8.55 | 9.89 | 5.16 | 4.78 | 10.40 | 9.65 | 10.85 | 6.92 | 6.44 | |
BLS | 5.23 | -6.53 | -7.78 | -1.90 | -23.35 | 6.23 | -8.45 | -5.97 | -2.24 | -23.78 | 7.00 | -4.35 | -5.43 | -1.96 | -21.97 |
1.97 | 2.57 | 2.40 | 1.82 | 3.92 | 2.91 | 4.34 | 4.16 | 2.90 | 7.16 | 2.37 | 5.96 | 5.57 | 3.66 | 5.01 | |
BMSHJ MSE [36] | -21.46 | -10.94 | -10.17 | -16.52 | -25.78 | -26.03 | -20.22 | -16.71 | -22.75 | -33.75 | -36.64 | -21.21 | -21.08 | -28.92 | -38.01 |
6.02 | 6.36 | 5.66 | 6.22 | 8.25 | 10.88 | 13.83 | 10.68 | 11.74 | 12.70 | 10.18 | 11.32 | 12.18 | 12.26 | 12.62 | |
BMSHJ | -15.90 | -13.57 | -13.17 | -17.02 | -47.11 | -19.64 | -23.14 | -16.73 | -23.63 | -53.18 | -29.96 | -18.87 | -19.29 | -29.00 | -56.06 |
5.34 | 4.22 | 3.70 | 4.86 | 7.72 | 10.35 | 10.29 | 10.82 | 9.69 | 11.63 | 10.34 | 11.01 | 10.41 | 10.87 | 10.66 |
Iii-D Implementation and Training Details
The TensorFlow framework (version 1.12) was used to implement the proposed method. We use the Adam solver
[51] to optimize both the ProxIQA network and the deep compression network, with and . We set the initial learning rates for both networks at fixed values of for the first 2M steps and for an additional 100K iterations. Thus, the network was trained on 2.1M iterations of back-propagation. All of the models were trained using NVIDIA 1080-TI GPU cards.We used a subset of the
processed images from the ImageNet database
[52] as training data. As described in [35], small amounts of uniform noise were added to the images. The images were then down-sampled by random factors to reduce compression artifacts and high-frequency noise, and randomly cropped to a size of . In each mini-batch, we randomly sampled image patches from the subset. We then cropped the images to patches. The source code is publicly available on https://github.com/treammm/Compression.Iv Experiments and Analysis
We compared the proposed perceptual optimization framework against the original MSE optimized image compression model and also against state-of-the-art image codecs. In order to experimentally evaluate the trained models, we conducted and report the results of a quantitative evaluation, a subjective comparison, and by a runtime analysis. We first describe the experimental setups that were used, including the datasets on which the performance evaluation was conducted using standard evaluation criteria. We also performed a different series of experiments to probe the limitations of our design. In all the experiments conducted, we denote the deep image compression model [35] as the BLS model, which we use as a baseline for performance comparison. Also, we denote an optimized proxy model for a given IQA model using (10) and (11) is denoted by .
Codec | Software | Version |
---|---|---|
JPEG | JPEG XT | Release 1.53111https://jpeg.org/downloads/jpegxt/reference1367abcd89.zip |
JPEG2000 | Kakadu | Version 7.10.2222https://kakadusoftware.com/downloads/ |
HEVC | HM | 16.9333https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-16.9/ |
Ballé[35] | Tensorflow Compression | Release 1.0444https://github.com/tensorflow/compression/releases/tag/v1.0 |
Ballé[36] | Tensorflow Compression | Release 1.2555https://github.com/tensorflow/compression/releases/tag/v1.2 |
Iv-a Experimental Setup
Iv-A1 Evaluation Datasets
To evaluate the various image codecs, we utilized the well-known Kodak dataset of very high quality uncompressed images. This publicly available image set is commonly used to evaluate image compression algorithms and IQA models. We also used a subset of the Tecnick dataset [53] containing images of resolution , and billboard images collected from the Netflix library, yielding images having more diverse resolutions and contents. None of the test images were included in the training sets, to avoid bias and overfitting problems.
VMAF RD-curve | Source Image | Source | Baseline | JPEG2000 | HEVC | |
---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||
Kodim03 | bpp / VMAF | 0.250 / 72.56 | 0.250 / 83.63 | 0.307 / 84.26 | 0.233 / 82.10 | |
PSNR / SSIM | 33.02 / 0.951 | 33.39 / 0.953 | 33.72 / 0.963 | 34.89 / 0.962 | ||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||
Kodim07 | bpp / VMAF | 0.579 / 85.17 | 0.498 / 88.01 | 0.522 / 88.30 | 0.468 / 87.30 | |
PSNR / SSIM | 35.71 / 0.986 | 34.74 / 0.979 | 34.74 / 0.985 | 35.77 / 0.983 | ||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||
Kodim10 | bpp / VMAF | 0.052 / 27.82 | 0.050 / 36.39 | 0.058 / 37.06 | 0.054 / 39.67 | |
PSNR / SSIM | 26.12 / 0.817 | 26.53 / 0.806 | 26.32 / 0.840 | 27.78 / 0.843 | ||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||
Kodim17 | bpp / VMAF | 0.052 / 19.03 | 0.050 / 32.27 | 0.058 / 30.30 | 0.061 / 35.82 | |
PSNR / SSIM | 26.20 / 0.808 | 26.03 / 0.795 | 26.43 / 0.828 | 27.13 / 0.821 | ||
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||
Kodim19 | bpp / VMAF | 1.361 / 93.25 | 1.229 / 94.52 | 0.892 / 94.20 | 1.139 / 93.52 | |
PSNR / SSIM | 36.91 / 0.991 | 37.52 / 0.986 | 34.29 / 0.982 | 37.41 / 0.986 |
Iv-A2 Evaluation Criteria
We measured the objective coding efficiency of each image codec using the Bjontegaard-Delta bitrate (BD-rate) [54], which quantifies differences in bitrate at a fixed distortion level relative to another reference encoder. To calculate BD-rate, we encoded the images at eight different bitrates, ranging from bpp (bit per pixel) to bpp. The performances of all of the codecs were compared to the same baseline - the MSE-optimized BLS model. A negative number of BD-rate means the bitrate was reduced as compared with the baseline. To fairly compare deep compression models having different loss layers, we used filters at every layer, and trained all of the models using the same number of steps. To ensure reproducibility, we report the version of each software codec used in Table III. The input image formats used were YUV444 for JPEG and JPEG2000, and both YUV420/444 for intra-coded HEVC, respectively. Lastly, the IQA algorithms that were used to evaluate the codecs were calculated using FFmpeg 4.0 with libavfilter (for PSNR) and libvmaf 0.6.1666https://github.com/Netflix/vmaf (for SSIM, MS-SSIM, VIF, and VMAF). Specifically, the PSNR calculation in libavfilter is defined by
(12) |
which is commonly used for combining per-channel PSNRs.
Iv-B Experimental Results on Kodak Dataset
The results on the images in the Kodak dataset are given Table I. The distortion levels that were used for BD-rate calculation were quantified using PSNR, SSIM, MS-SSIM (also represented by MSIM in the tables), and VMAF. It may be observed that the RD performances that were measured using PSNR became worse on models optimized under other IQA models. This is not surprising since MSE, which is used by the baseline model, is the optimal loss function for PSNR.
It may also be noted that, unlike using other IQA models used as targets of the proposed optimization, optimization delivers coding gain with respect to all of the BD-rate measurements, except the PSNR BD-rate.
Iv-C Comparison with State-of-the-art Codecs
Table II tabulates the percent change in BD-rate relative to the BLS baseline, with respect to different quality models. We comprehensively evaluated perceptual deep compression using different perceptual optimization protocols (highlighted in boldface), against three conventional image codecs: JPEG, JPEG2000, and intra-coded HEVC main-RExt (Format Range Extension) profile. Extensive experiments were carried out on the three aforementioned datasets, using three perceptual IQA models as optimization targets. In addition to the BLS model, we also deployed the proposed optimization framework on a more sophisticated deep compression model [36] to test its generality. We report the BD-rate changes obtained, averaged over all the images in each dataset. Similar results were obtained on the Tecnick and NFLX datasets as shown in Tables I and II. These results show that our optimization approach is able to successfully optimize a deep image compression model over different IQA algorithms. Indeed, significant BD-rate reductions were obtained in many cases.
In addition to the quantitative results, we also visually compared the decoded images. Fig. 8 plots VMAF Rate-distortion (RD) curves for several images. A subset of the images corresponding to the RD points obtained by the various codecs are also shown. As a basic test, we subjectively compare results yielding similar bitrates but different objective quality scores. The images Kodim10 and Kodim17 were subject to extreme compression at bitrates around bpp. In these cases, the -optimized model significantly outperformed the MSE-optimized baseline model, delivering performance comparable to HEVC and JPEG2000 with respect to VMAF score and subjective quality. At high bitrates, the distinctions between the codecs becomes subtle. Therefore, we isolated these RD points associated with similar VMAF scores and compared bitrate consumption. Generally, the -optimized model yielded comparable subjective and objective (VMAF) quality as the baseline -optimized model, while consuming significant fewer bits. The encoding results on Kodim19 in Fig. 8 using the -optimized model yielded similar VMAF scores as the other codecs, while consuming only , and fewer bits than the Baseline, JPEG2000, and HEVC, respectively.
![]() |
![]() |
Iv-D Limitations
When measuring RD performance using SSIM metric, optimizing a model using SSIM should approach the theoretical upper bound of SSIM-measured RD performance. Accordingly, we investigated the performance of our proposed framework by comparing SSIM and optimization of the BLS model. The results of comparison are presented in Table IV, indicating a SSIM BD-rate performance gap between the two optimization approaches. A noticeable contributor to this performance drop is the pixel loss in (10). To validate this assumption, we conducted an ablation study to pinpoint the cause of this gap. We did this by fixing for the SSIM optimization. The loss function is then
(13) |
where
(14) |
The SSIM BD-rate that resulted from optimization of (13) is given in the fourth column of Table IV. It may be observed that the RD performance becomes very close to that of optimization, which confirms that the pixel loss is the main contributor to the performance loss.
Moreover, Fig. 10 shows that similar visual results are obtained using optimization and the optimization described in (13), even at heavy compression levels. A close examination shows that true SSIM optimization nicely preserves high-frequency details but loses chromatic fidelity. The RD-curves in Fig. 9 further confirm the similar behavior of optimization and the optimization of (13).
Iv-E Study of Training Steps
The instability introduced by the proxy loss can be further improved by training longer, and by reducing the learning rate. Fig. 11 plots the VMAF BD-rate as a function of the number of training steps. When measuring BD-rate against the same baseline (MSE optimization trained with 1M steps), achieves significant improvement relative to MSE optimization, by training longer or by lowering the learning rate. For fair comparison, we also evaluate using MSE optimization using the same training process as the baseline (dotted line). We have observed very relative results using other perceptual optimizers, like .

Codec | Encode | Decode | |
---|---|---|---|
JPEG | CPU | 43.02 | 62.88 |
JPEG2000 | CPU | 10.80 | 36.79 |
HEVC | CPU | 4578.57 | 89.88 |
BLS MSE | CPU | 251.01 | 117.93 |
GPU | 231.62 | 32.56 | |
BLS (ours) | CPU | 246.57 | 119.02 |
GPU | 229.26 | 29.22 |

Iv-F Execution Time
The encoding and decoding times of the various compared codecs are summarized in Table V. We compiled the source code of state-of-the-art (standard) codecs, in order to be able to compare them on the same machine. The results were then calculated by averaging the runtime over all Kodak images under different bitrate settings. From Table V, it may be observed that the time complexity of the MSE-optimized and -optimized BLS model are nearly identical, as they deploy the same network architecture in application. Of course, the runtime of deep compression models can be reduced if implemented on a GPU. It should also be noted that the decoding time of HEVC was estimated from the reference software HM, which might be slow. This can be improved by using a third-party decoder such as FFmpeg.
We also compared the encoding times for different bitrate settings in Fig. 12. For each encoder, all of the runtimes were normalized by dividing by the value of bpp. The conventional codecs required more time to encode at high bitrates, whereas the deep compression models have encoding times that do not vary much with bitrate.
V Concluding Remarks
We have presented a learning framework for perceptually optimizing a learned image compression model. To optimize the ProxIQA network, we developed an alternating training method. We experimentally demonstrated that, for a fixed VMAF value, our proposed proxy approach achieved a bitrate reduction, on average, relative to the MSE-based framework.
The idea behind the proposed optimization framework is general. We believe that, with proper modifications of the architecture of the ProxIQA network, the application scope should be applicable to a wide variety of image enhancement, restoration, and reconstruction problems, such as super-resolution or de-noising.
Another future topic could be the study of new types of distortions caused by deep compression models. Like the examples shown in this paper, distorted images created by CNNs are very different from images afflicted by more traditional distortions, such as JPEG compression. Creating databases for assessing the subjective quality of these new types of distortions would be quite valuable.
Acknowledgment
The authors thank Johannes Ballé for providing the training images. The authors also thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu
References
- [1] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 39, no. 4, pp. 640–651, Apr. 2017.
- [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2016, pp. 770–778.
- [3] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical flow with convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vision, Dec. 2015, pp. 2758–2766.
- [4] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with BM3D?” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2012, pp. 2392–2399.
- [5] V. Jain and S. Seung, “Natural image denoising with convolutional networks,” in Proc. Adv. Neural Inf. Process. Syst., 2009, pp. 769–776.
- [6] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Proc. Eur. Conf. Comput. Vision, 2014, pp. 184–199.
- [7] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2017, pp. 624–632.
- [8] Y.-L. Liu, Y.-T. Liao, Y.-Y. Lin, and Y.-Y. Chuang, “Deep video frame interpolation using cyclic frame generation,” in Proc. AAAI, 2019, pp. 8794–8802.
- [9] J. T. Barron, “A general and adaptive robust loss function,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2019, pp. 4331–4339.
- [10] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004.
- [11] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multi-scale structural similarity for image quality assessment,” in Proc. IEEE Asilomar Conf. on Signals, Syst., and Comput., Nov. 2003, pp. 1398–1402.
- [12] J. Snell, K. Ridgeway, R. Liao, B. D. Roads, M. C. Mozer, and R. S. Zemel, “Learning to generate images with perceptual similarity metrics,” in Proc. IEEE Int. Conf. Image Process., Sep. 2017, pp. 4277–4281.
- [13] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Trans. on Computational Imaging, vol. 3, no. 1, pp. 47–57, Mar. 2017.
- [14] D. Chandler and S. Hemami, “VSNR: A wavelet-based visual signal-to-noise ratio for natural images,” IEEE Trans. Image Processing, vol. 16, no. 9, pp. 2284–2298, Sep. 2007.
- [15] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Trans. Image Processing, vol. 15, no. 2, pp. 430–444, Feb. 2006.
- [16] E. C. Larson and D. M. Chandler, “Most apparent distortion: full-reference image quality assessment and the role of strategy,” Journal of Electronic Imaging, vol. 19, no. 1, pp. 011 006:1–011 006:21, Jan. 2010.
- [17] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity index for image quality assessment,” IEEE Trans. Image Processing, vol. 20, no. 8, pp. 2378–2386, Aug. 2011.
- [18] L. Zhang, Y. Shen, and H. Li, “VSI: A visual saliency-induced index for perceptual image quality assessment,” IEEE Trans. Image Processing, vol. 23, no. 10, pp. 4270–4281, Oct. 2014.
- [19] T.-J. Liu, W. Lin, and C.-C. J. Kuo, “Image quality assessment using multi-method fusion,” IEEE Trans. Image Processing, vol. 22, no. 5, pp. 1793–1807, May 2013.
-
[20]
S.-C. Pei and L.-H. Chen, “Image quality assessment using human visual DOG model fused with random forest,”
IEEE Trans. Image Processing, vol. 24, no. 11, pp. 3282–3292, Nov. 2015. - [21] V. V. Lukin, N. N. Ponomarenko, O. I. Ieremeiev, K. O. Egiazarian, and J. Astola, “Combining full-reference image visual quality metrics by neural network,” Proc. SPIE, vol. 9394, p. 93940K, Mar. 2015.
- [22] M. Oszust, “Decision fusion for image quality assessment using an optimization approach,” IEEE Signal Process. Lett., vol. 23, no. 1, pp. 65–69, Jan. 2016.
- [23] F. Gao, Y. Wang, P. Li, M. Tan, J. Yu, and Y. Zhu, “DeepSim: Deep similarity for image quality assessment,” Neurocomputing, vol. 257, pp. 104–114, Sep. 2017.
- [24] S. Bosse, D. Maniry, K.-R. Muller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Trans. Image Processing, vol. 27, no. 1, pp. 206–219, Jan. 2018.
- [25] Z. Li, C. Bampis, J. Novak, A. Aaron, K. Swanson, A. Moorthy, and J. D. Cock, “VMAF: The journey continues,” The NETFLIX tech blog, 2018. [Online]. Available: https://medium.com/netflix-techblog/vmaf-the-journey-continues-44b51ee9ed12
- [26] D. Brunet, E. R. Vrscay, and Z. Wang, “On the mathematical properties of the structural similarity index,” IEEE Trans. Image Processing, vol. 21, no. 4, pp. 1488–1499, Apr. 2012.
-
[27]
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in
Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2018, pp. 586–595. - [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–14.
- [29] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vision, vol. 9906, 2016, pp. 694–711.
- [30] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman, “Controlling perceptual factors in neural style transfer,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jul. 2017, pp. 3730–3738.
- [31] J. Bruna, P. Sprechmann, and Y. LeCun, “Super-resolution with deep convolutional sufficient statistics,” in Proc. Int. Conf. Learn. Represent., 2016, pp. 1–17.
- [32] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., 2017, pp. 105–114.
- [33] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch, “EnhanceNet: Single image super-resolution through automated texture synthesis,” in Proc. IEEE Int. Conf. Comput. Vision, Oct. 2017.
- [34] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High-resolution image inpainting using multi-scale neural patch synthesis,” in Proc. IEEE Conf. Comput. Vision Pattern Recog. IEEE, Jul. 2017, pp. 4076–4084.
- [35] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in Proc. Int. Conf. Learn. Represent., 2017, pp. 1–27.
- [36] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–23.
- [37] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems 31, 2018, pp. 10 771–10 780.
- [38] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compression with recurrent neural networks,” CoRR, vol. abs/1511.06085, 2015.
- [39] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jul. 2017, pp. 5306–5314.
- [40] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., June 2018, pp. 4385–4393.
- [41] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool, “Generative adversarial networks for extreme learned image compression,” arXiv preprint arXiv:1804.02958, 2018.
- [42] J. Löhdefink, A. Bär, N. M. Schmidt, F. Hüger, P. Schlicht, and T. Fingscheidt, “GAN- vs. JPEG2000 image compression for distributed automotive perception: Higher peak snr does not mean better semantic segmentation,” ArXiv, vol. abs/1902.04311, 2019.
- [43] C.-Y. Wu, N. Singhal, and P. Krähenbühl, “Video compression through image interpolation,” in Proc. Eur. Conf. Comput. Vision, 2018.
- [44] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learning image and video compression through spatial-temporal energy compaction,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., 2019.
- [45] S. Channappayya, A. Bovik, and R. Heath, “Rate bounds on SSIM index of quantized images,” IEEE Trans. Image Processing, vol. 17, no. 9, pp. 1624–1639, Sep. 2008.
- [46] Y.-H. Huang, T.-S. Ou, P.-Y. Su, and H. H. Chen, “Perceptual rate-distortion optimization using structural similarity index as quality metric,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 11, pp. 1614–1624, Nov. 2010.
- [47] S. Wang, A. Rehman, Z. Wang, S. Ma, and W. Gao, “SSIM-motivated rate-distortion optimization for video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 4, pp. 516–529, 2012.
- [48] J. Ballé, “Efficient nonlinear transforms for lossy image compression,” in Proc. IEEE Picture Coding Symp., 2018, pp. 248–252.
- [49] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the h.264/AVC video compression standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 620–636, Jul. 2003.
- [50] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo Exploration Database: New challenges for image quality assessment models,” IEEE Trans. Image Processing, vol. 26, no. 2, pp. 1004–1016, Feb. 2017.
- [51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15.
- [52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., Jun. 2009, pp. 248–255.
- [53] N. Asuni and A. Giachetti, “TESTIMAGES: a large-scale archive for testing visual devices and basic image processing algorithms,” in Proc. Eurographics Italian Chapter Conference, 2014, pp. 63–70.
- [54] G. Bjøntegaard, “Calculation of average PSNR differences between RD-curves,” document VCEG-M33, ITU-T Video Coding Experts Group (VCEG) Thirteenth Meeting, Austin, TX, April 2001.