Log In Sign Up

Perceptually Optimizing Deep Image Compression

Mean squared error (MSE) and ℓ_p norms have largely dominated the measurement of loss in neural networks due to their simplicity and analytical properties. However, when used to assess visual information loss, these simple norms are not highly consistent with human perception. Here, we propose a different proxy approach to optimize image analysis networks against quantitative perceptual models. Specifically, we construct a proxy network, which mimics the perceptual model while serving as a loss layer of the network.We experimentally demonstrate how this optimization framework can be applied to train an end-to-end optimized image compression network. By building on top of a modern deep image compression models, we are able to demonstrate an averaged bitrate reduction of 28.7% over MSE optimization, given a specified perceptual quality (VMAF) level.


page 2

page 5


ProxIQA: A Proxy Approach to Perceptual Optimization of Learned Image Compression

The use of ℓ_p(p=1,2) norms has largely dominated the measurement of los...

Deep Generative Adversarial Compression Artifact Removal

Compression artifacts arise in images whenever a lossy compression algor...

The Human Visual System and Adversarial AI

This paper introduces existing research about the Human Visual System in...

Deep Perceptual Compression

Several deep learned lossy compression techniques have been proposed in ...

A Compression Objective and a Cycle Loss for Neural Image Compression

In this manuscript we propose two objective terms for neural image compr...

Enhanced Residual Networks for Context-based Image Outpainting

Although humans perform well at predicting what exists beyond the bounda...

Adversarial and Perceptual Refinement for Compressed Sensing MRI Reconstruction

Deep learning approaches have shown promising performance for compressed...

1. Introduction

Deep neural networks have made rapid advances on diverse multimedia tasks (Liu et al., 2018; Bosse et al., 2018; Paul et al., 2019), especially the image transformation problems including denoising (Burger et al., 2012)

, super-resolution

(Lai et al., 2019)

, frame interpolation

(Liu et al., 2019)

, and so on. Specifically speaking, a generative network is learned to reconstruct high-quality output images from degraded input image under a supervised manner. A loss function is defined to measure the fidelity between the output and a ground-truth image. For example, the denoising task aims to reconstruct a noise-free image from a noisy image, and Convolutional Neural Networks (CNNs) have been shown to provide good noisy-to-pristine mapping functions. However, despite the tremendous amount of research being applied on deep learning image transformation problems, the loss functions used to guide model training has been underexamined and largely limited to the

norm family. The structural similarity quality index (SSIM) (Wang et al., 2004) has also been adopted as loss functions for several image reconstruction tasks (Snell et al., 2017; Zhao et al., 2017), owing to their perceptual relevance and good analytic properties, such as differentiability.

Figure 1. Conceptual framework of perceptual optimization using a proxy network: A generative network takes a mini batch as input, and outputs a reconstructed batch during training. The proxy network is learned as a proxy of an image quality model , where the output estimates the quality score predicted by . The generative network learns to maximize .

As a long-standing research problem, predicting picture quality with high-quality reference pictures has achieved remarkable success. Numerous full-reference (FR) perceptual models have been proposed and proven to surpass MSE-based measurements. Examples include other SSIM-type methods (Wang et al., 2003; Wang and Li, 2011; Pei and Chen, 2015), VIF (Sheikh and Bovik, 2006), VSNR (Chandler and Hemami, 2007), MAD (Larson and Chandler, 2010), FSIM (Zhang et al., 2011), and VSI (Zhang et al., 2014). Moreover, learning based quality predictors such as Video Multimethod Assessment Fusion (VMAF) (Li et al., 2018)

, a successful open-sourced example developed by Netflix, has been powerful tools to optimize tremendous volumes of internet video traffic. Unfortunately, most of the advanced, high-performance image quality indeces have never been adopted as loss functions for end-to-end optimization networks, because they are generally non-differentiable and functionally complex.

Recently, lossy image compression models have been realized using deep neural network architectures. This may be viewed as a special case of generative networks, where the input image is equal to the ground-truth image. Unlike the conventional image codecs standards, which rely on “handcrafted” functional blocks such as transform matrix or in-loop filters, the parameters of learned image compression are optimized in an end-to-end manner. Most of these have employed deep auto-encoders. For example, Ballé et al. (Ballé et al., 2017) proposed a general infrastructure for optimizing image compression where bitrate is estimated and considered during training. In (Ballé et al., 2018)

, this model is improved by incorporating a network, scale hyperprior, into the compression framework. The authors use the additional network to estimate the standard deviation of the quantized coefficients to further improve coding efficiency. Later, Minnen

et al. (Minnen et al., 2018)

exploit a PixelCNN layer, which they combine with an autoregressive hyperprior. Beyond these early efforts, other recent approaches have adopted more complex network architectures such as recurrent neural networks (RNNs)

(Johnston et al., 2018)

, convolutional autoencoder (CAE)

(Cheng et al., 2019b), and generative adversarial networks (GANs) (Agustsson et al., 2019). Some works has also been done to extend these ideas to the deep video compression problem (Cheng et al., 2019c).

In fact, the idea of optimizing conventional codecs such as JPEG or H.264/AVC against perceptual models like SSIM, VIF, or VMAF have been deeply studied (Channappayya et al., 2008; Huang et al., 2010; Wang et al., 2012; Lu et al., 2020) and implemented in widespread practice (Li et al., 2018)

. We seek to both extend this concept, as well as try to bridge the gap between modern perceptual quality models and deep generative networks, we explore the potential of adapting sophisticated perceptual picture quality models as loss functions in deep image compression network. In order to address the aforementioned shortcomings, we conceptually propose to simulate the measurements made by a perceptual image quality model using a proxy network. Then, the proxy network is adopted as a perceptual loss function as interpreted in Fig.


(a) Source image (b) Adversarial example.
Figure 2. An “adversarial” example (kodim01 image) produced by the compression network. The true VMAF score calculated from (a) and (b) is (which indicates a very poor-quality image), while the proxy network predicts a quality score of .

2. Related work

Most of the work on deep image transformation problems has focused on investigating novel network architectures or improving convergence speed. The selection of an appropriate loss function that is consistent with human perception, however, has not been studied much. Here, we review related studies that are closely related to perceptual optimization. As tractable tools, SSIM and MS-SSIM have been widely adopted because of the simple analytical form of their gradients and their computational ease. Moreover, their convexity properties (Channappayya et al., 2008; Brunet et al., 2012) makes them feasible targets for optimization. Two recent studies adopted structural similarity functions as loss layers of image generation models, obtaining improved results, as validated by conducting a human subjective study (Snell et al., 2017) and by objective evaluation against several other perceptual models (Zhao et al., 2017).

Rather than optimizing a mathematical function, another approach uses a deep neural network to guide the training. Recent experimental studies suggest that the features extracted from a well-trained image classification network have the capability to capture information useful for other perceptual tasks

(Zhang et al., 2018). Mathematically, the perceptual loss is defined as


where denotes the output feature map of the -th layer with elements of a pre-trained network . In practice, the loss computed from the high-level features extracted from a pre-trained VGG classification network (Simonyan and Zisserman, 2015)

, also called VGG loss, has been commonly adopted for diverse computer vision tasks. The VGG loss has been applied to such diverse tasks as style transfer

(Johnson et al., 2016; Gatys et al., 2017), superresolution (Bruna et al., 2016; Johnson et al., 2016; Ledig et al., 2017; Sajjadi et al., 2017)

, and image inpainting

(Yang et al., 2017). Despite its ubiquity, this “unreasonable” perceptual loss is notorious for creating unpleasant artifacts (Johnson et al., 2016). Most importantly, this edge-sharpening loss function is incapable of optimizing a network toward a specific quality model.

Figure 3. Illustration of the proposed optimization framework. Perceptually training a deep image compression model involves alternating optimization of the compression network and the proxy network of an IQA model . Thin arrows indicate the flow of data in the network, while bold arrows represent the information being delivered to update the complementary network. The convolutional parameters of are denoted by “height width input channel output channel stride padding”.

The 1907 Franklin Model D roadster.

3. Proposed Perceptual optimization Framework

Learning a successful CNN model depends highly on the size of the training set. Luckily, training a proxy network on an existing model does not require human-labeled subjective quality scores such as mean opinion scores (MOS), which is often the greatest obstacle to learning DNN-based IQA models (Ghadiyaram and Bovik, 2016; Kim et al., 2017; Ying et al., 2020). Ground truth scores for training the proxy network are easily obtained, given the availability of pristine and distorted patches. To start with, we created a simple network trained by existing datasets (Ma et al., 2017) comprising numerous reference and distorted images. Also, the corresponding metric scores were calculated as the ground-truth for training. The proxy network is first learned to predict the metric score given a pristine patch and a distorted patch. Next, the trained proxy network is inserted into the loss layer of the deep compression network with the goal of maximizing the proxy score. Unfortunately, severe complication can arise when applying this straightforward methodology.

We discovered that the deep compression network often generates “adversarial” examples when its loss layer is the output of a pre-trained network having fixed parameters. Figure 2 shows such an “adversarial” example generated by the deep compression network using a proxy network as its loss function. In this example, the proxy network was trained to mimic the VMAF algorithm. However, comparing Fig. 2(a) with Fig. 2(b), it is apparent that the true VMAF score and the proxy VMAF score predicted by the network are very different. This can be understood by considering the training of the network to be an interpolation problem, whereby the neural networks maps a test image to an accurate quality score. However, when the input is too different from the training set, the proxy network may produce a poor interpolation result. Additionally, as pointed out in (Cheng et al., 2019a), the conventional distortion types in public domain databases are generally quite different from distortions created by a deep neural networks. In this regard, training a proxy network on previously created databases might be suboptimal for this problem.

3.1. Alternating Learning Framework

In order to tackle the aspect, our approach to training an image compression model in a perceptually optimized way is depicted in Fig. 3. The idea is to simply use the adversarial examples along with their objective quality scores as additional training data of the proxy network. The proxy network is then updated, enabling it to predict proxy quality much more accurately. This framework involves optimizing an image compression network , and a proxy network of an IQA model . In each training iteration, the two networks are alternately updated as follows:

Deep Compression Network. To integrate the proxy network into the update of given a mini-batch , the model parameters of are fixed during training. In order to minimize perceptual distortion, the output of becomes part of the objective in the optimization of :


By back-propagating through the forward model, the loss derivative is used to drive .

Proxy IQA Network. Given a mini-batch pair and collected from the most recent update of the compression network, the quality scores are calculated. The network is updated to optimally fit given the input . Note that the compression network is not needed in this part of the training. As may be seen, is incorporated into the training of the compression network. However, it is important to understand that it is not present during the inference phase.

By applying the proposed alternating training, the proxy network is capable of spontaneously adapting to newly generated adversarial patches. In addition, exotic artifacts created by deep image compression can be “seen” by the proxy IQA network: the patches reconstructed by the compression network are directly used to update the proxy network, hence the aforementioned problem becomes immediately resolved.

Image Dataset Kodak Tecnick NFLX Billboard
JPEG 113.99 129.49 149.86 78.36 119.33 218.04 171.59 89.73 102.28 143.99 168.20 89.95
JPEG2000 -11.51 6.25 -1.02 -33.39 -13.06 -1.55 -8.41 -34.25 -27.81 1.43 -3.93 -38.98
HEVC -26.35 -6.32 -6.12 -28.23 -28.32 -8.97 -11.07 -27.65 -49.43 -17.12 -16.06 -35.03
HEVC -27.33 -25.98 -24.97 -42.18 -19.48 -28.97 -33.95 -46.67 -37.63 -35.41 -33.88 -50.91
BLS 15.89 -21.31 -19.25 7.19 8.67 -10.79 -16.11 8.68 16.79 -19.01 -17.73 9.75
BLS 11.67 -11.58 -21.77 -0.17 4.47 -17.40 -23.50 0.19 12.28 -11.59 -23.53 4.34
BLS 5.23 -6.53 -7.78 -23.35 6.23 -8.45 -5.97 -23.78 7.00 -4.35 -5.43 -21.97
BMSHJ MSE (Ballé et al., 2018) -21.46 -10.94 -10.17 -25.78 -26.03 -20.22 -16.71 -33.75 -36.64 -21.21 -21.08 -38.01
BMSHJ -15.90 -13.57 -13.17 -47.11 -19.64 -23.14 -16.73 -53.18 -29.96 -18.87 -19.29 -56.06
Table 1. Comparison of conventional codecs and optimized deep image compression: average change of BD-rate expressed as percentage, using three different IQA models to train the compression network. The baseline of comparison is the MSE-optimized BLS model (Ballé et al., 2017). Smaller or negative values indicate better coding efficiency.

3.2. Network Architecture

The proxy IQA network takes a reference patch and a distorted patch as input, where both have pixels. They are then concatenated into a 6-channel signal, where a raw input is fed into the network and reduced to a predicted quality score. As depicted in Fig. 3, the network

may be as simple as a shallow CNN consisting of three stages of convolution, ReLU nonlinearity, and subsampling. The spatial size is reduced by a factor of

after each stage via max pooling layers. The size of convolution kernels are fixed to for all stages, while the number of filters at the first stage is set to and is increased by a factor of 2 for each subsequent stage. Finally, feature maps with size are flattened and fed to a fully connected layer which yields the output. The parameterization of each layer is detailed in the figure.

The image compression network comprises an analysis transform () at the encoder side, and a synthesis transform (

) at the decoder side. Both transform units are implemented as consecutive layers of convolution-down(up) sampling-activation.Instead of the commonly used ReLU, a generalized divisive normalization (GDN) transform is adopted as the activation function

(Ballé, 2018). It is similar to the local gain control behavior in human visual system, where visual signals are normalized by their rectified neighbors. Lastly, the functional block “Q/EC” denotes quantization and entropy coding. In this work, we build on two different deep image compression models (Ballé et al., 2017, 2018).

3.3. Loss Functions

As illustrated in Fig. 3, let , , , and be the source batch, latent presentation, quantized latent presentation, and reconstructed batch, respectively. The model parameters in the analysis and synthesis transforms are collectively denoted by . The proxy network has model parameters . Given a perceptual metric , the goal is to optimize the full set of parameters , , such that the learned image codec can generate a reconstructed image that has a high perceptual quality score . Furthermore, the rate should be as small as possible.

Generally, learned image compression network is optimized by minimizing the objective function defined by


which has a similar notion as rate-distortion optimization (RDO) in conventional codecs. Under this scheme, is the residual between the source patch and the reconstructed patch mapped by


where is a distance function such as mean square error or mean absolute error. On the other hand, is the rate loss representing the bit consumption of an encode . We followed the original work in (Ballé et al., 2017), where the rate loss is defined by


The term denotes the entropy model. This entropy term is minimized when the actual marginal distribution and have the same distribution. During training, the latent presentation is quantized to by adding i.i.d uniform noise . Then, is used to estimate the rate via (5). Unlike the training phase, normal rounding-based quantization is applied to quantize . Then, entropy coders such as variable length coding or arithmetic coding can be used to losslessly encode the discrete-valued data into the bitstream during the inference.

Rather than just minimizing an norm between and , we introduce a loss term . This proxy loss is defined to maximize the output of proxy network , denoted by , with fixed network parameters :


Here denotes the upper bound of the model , which is a constant to the loss function. Finally, the total loss function for optimizing the compression network is the weighted combination of the losses:


where controls the trade-off between bitrate and distortion of the encoded bitstream, and weights against . Here, the term plays a different role as a regularization term. Since the proxy network is updated at each step, the loss function is also changed. The pixel loss serves to stabilize the training process. In our model, we empirically set and when optimizing for VMAF.

The proxy network aims to mimic an image quality model . While updating , we define a metric loss to attain this objective given two image batches and :


Note that is a constant, since it is obtained from the reconstructed patches generated during the most recent update of the compression network.

Figure 4. Rate-Distortion (RD) curves for different image compression algorithms on the image kodim10, measured with VMAF.

The 1907 Franklin Model D roadster.

bpp / VMAF 0.052 / 27.82 0.050 / 36.39 0.058 / 37.06 0.054 / 39.67
PSNR / SSIM 26.12 / 0.817 26.53 / 0.806 26.32 / 0.840 27.78 / 0.843
Figure 5. Visualization of decoded pictures (kodim10) from differend compression models. Each cropped to patch for display purpose.

4. Experiments

The following subsections thoroughly describe the experimental setup. We also present the quantitative evaluation and subjective comparison.

4.1. Experimental Setup

Implementation Details. We used the TensorFlow framework (version 1.12) to implement the proposed method. The Adam solver (Kingma and Ba, 2015) were used to optimize both the proxy network and the deep compression network, with parameters and a weight decay of . We set the initial learning rates for both networks at fixed values of for the first 2M steps and a lower learning rate of for an additional 100K steps. Thus, the networks were trained on 2.1M iterations of back-propagation. To fairly compare deep image compression models having different loss layers, we used filters at every layer, and trained all of the models using the same number of steps. All of the models were trained using NVIDIA 1080-TI GPU cards.

Training Setup. We used a subset of the

processed images from the ImageNet database

(Deng et al., 2009) as training data. As described in (Ballé et al., 2017), small amounts of uniform noise were added to the images. The images were then down-sampled by random factors to reduce compression artifacts and high-frequency noise, and randomly cropped to a size of . In each mini-batch, we randomly sampled image patches from the subset. We then cropped the images to patches resulting in an tensor.

Evaluation Datasets. To evaluate various image codecs, we utilized the Kodak dataset of very high quality uncompressed images. This publicly available image set is commonly used to evaluate image compression algorithms and IQA models. We also used a subset of the Tecnick dataset (Asuni and Giachetti, 2014) containing images of resolution , and billboard images collected from the Netflix library (Sinno et al., 2020), yielding images having more diverse resolutions and contents. It should be noted that none of these test images were included in the training sets, to avoid overfitting problems.

Evaluation Setup. As is the common practice in the field of video coding, we measured the objective coding efficiency of each image codec using the Bjøntegaard-Delta bitrate (BD-rate) (G. Bjøntegaard, 2001), which quantifies average differences in bitrate at the same distortion level relative to another reference encoder. To calculate BD-rate, we encoded the images at eight different bitrates, ranging from bpp (bit per pixel) to bpp. In all the experiments conducted, we denote the image compression model (Ballé et al., 2017) as the BLS model. The performances of all of the codecs were compared to the same baseline – the MSE-optimized BLS model. A negative number of BD-rate means the bitrate was reduced as compared with the baseline. The input image formats used were YUV444 for JPEG and JPEG2000, and both YUV420/444 for intra-coded HEVC, respectively. Lastly, the distortion levels that were used for BD-rate calculation were quantified using PSNR, SSIM, MS-SSIM (also represented by MSIM in the table), and VMAF.

4.2. Comparison with Different Codecs

We comprehensively evaluated perceptual deep compression using different perceptual optimization protocols (highlighted in boldface), against three conventional image codecs: JPEG, JPEG2000, and intra coding of HEVC. Extensive experiments were carried out using three perceptual IQA models as optimization targets. Table 1 tabulates the benchmark study on the aformentioned three datasets. Each cell shows the BD-rate relative to the BLS baseline, with respect to different quality models. We denote an optimized compression model for a given IQA model using (7) and (8) by . In addition to the BLS model, we also deployed the proposed  optimization framework on a more sophisticated deep compression model (Ballé et al., 2018) (BMSHJ) to test its generality. We report the BD-rate changes obtained, averaged over all the images in each dataset. These results show that our optimization approach is able to successfully optimize a deep image compression model over different IQA algorithms. Indeed, significant BD-rate reductions were obtained in many cases. An interesting observation can be made that, unlike using other IQA models used as targets of the proposed optimization,  optimization delivers coding gain with respect to all of the BD-rate measurements, except the PSNR BD-rate. This suggests VMAF being a good optimization target.

As a basic test, we subjectively compare results yielding similar bitrates but different objective quality scores. Figure 5 shows a visual comparison under extreme compression (around 0.05 bpp). Obviously, the -optimized model significantly outperformed the MSE-optimized baseline model, delivering performance comparable to HEVC and JPEG2000 with respect to VMAF score and subjective quality. We also plot the corresponding VMAF Rate-distortion (RD) curve, a common tool for comparing different encoders, in Fig. 4. We observe that the proposed optimization scheme generally leads to a compression gain in VMAF. At high bitrates, the -optimized model yielded comparable VMAF scores as the baseline MSE-optimized model, while consuming significant fewer bits. In this particular example, roughly of the bits can be reduced without suffering perceptual quality.

4.3. Study of Alternating Training

To further understand the behavior of the prpopsed alternating training, we compared true VMAF scores against proxy VMAF scores in Fig. 6. All of the scores were calculated on the reconstructed patches produced during training. Figure 6(a) shows that the proxy VMAF scores quickly approached , whereas much lower true VMAF scores were assigned to the patches produced by the compression model. This directly reflects the problem we have mentioned (Sec. 3) when a pre-trained proxy network is applied. However, when the reconstructed patches were feed into the proxy network along with their objective quality scores, the proxy network is updated straightaway to predict proxy quality much more accurately. As shown in Fig. 6(b), the true and proxy scores become highly consistent early in the training process.

(a) Model learned with a pre-trained proxy network. (b) Model learned from the proposed alternating training process.
Figure 6. Comparison of two different optimization strategies during the training process. We plot true VMAF scores and proxy VMAF scores (predicted by the proxy network) of the reconstructed batch. The two scores are plotted in mean values (lines) and one standard deviations (shadows).

4.4. Computational Cost

It is critical for a learned image compression model to have comparable execution time to other codecs. We compiled the source code of standard codecs, in order to be able to compare them on the same computer with a GHz CPU and GTX-1080TI GPUs. The results were then calculated by averaging the runtime over all Kodak images under different bitrate settings. The encoding and decoding times of the various compared codecs are reported in Table 2. It may be observed that the time complexity of the MSE-optimized and -optimized models are nearly identical, as they deploy the same network architecture in application. Overall, the runtime of learned image compression models are acceptable and can be further reduced if performed on a GPU. It is worth noting that the BLS models achieves the fastest decoding speed when a GPU is available. It should also be noted that the decoding time of HEVC was estimated from the reference software HM16.9, which might be slow.

Compression model Encode time Decode time
JPEG CPU 43.02 62.88
JPEG2000 CPU 10.80 36.79
HEVC CPU 4578.57 89.88
BLS MSE CPU 251.01 117.93
GPU 231.62 32.56
BLS CPU 246.57 119.02
GPU 229.26 29.22
BMSHJ MSE CPU 351.23 378.68
GPU 312.28 344.13
BMSHJ CPU 367.71 380.04
GPU 308.53 341.96
Table 2. Average processing speed in milliseconds for different compression models. Model loading time for deep compression is excluded.

5. Conclusion

In this work, we focus on designing the loss function for deep image compression. In particular, we have presented a framework for perceptually optimizing a generative network and a proxy image quality assessment network. When integrated into deep image compression models, our method allows end-to-end training and can provide compression gains with respect to different IQA metrics. We believe that the idea behind the proposed training framework is general. With proper modifications of the framework parameters or the architecture of the proxy network, the approach has the potential to improve on a wide variety of image restoration problems with weak MSE based ways of optimization.

This research is supported by Netflix. The authors thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL:


  • E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L.V. Gool (2019) Generative adversarial networks for extreme learned image compression. In Proc. IEEE Int. Conf. Comput. Vision, pp. 221–231. Cited by: §1.
  • N. Asuni and A. Giachetti (2014) TESTIMAGES: a large-scale archive for testing visual devices and basic image processing algorithms. In Proc. Eurographics Italian Chapter Conference, pp. 63–70. Cited by: §4.1.
  • J. Ballé, V. Laparra, and E.P. Simoncelli (2017) End-to-end optimized image compression. In Proc. Int. Conf. Learn. Represent., pp. 1–27. Cited by: §1, §3.2, §3.3, Table 1, §4.1, §4.1.
  • J. Ballé, D. Minnen, S. Singh, S.J. Hwang, and N. Johnston (2018) Variational image compression with a scale hyperprior. In Proc. Int. Conf. Learn. Represent., pp. 1–23. Cited by: §1, §3.2, Table 1, §4.2.
  • J. Ballé (2018) Efficient nonlinear transforms for lossy image compression. In Proc. IEEE Picture Coding Symp., pp. 248–252. Cited by: §3.2.
  • S. Bosse, D. Maniry, K.-R. Muller, T. Wiegand, and W. Samek (2018) Deep neural networks for no-reference and full-reference image quality assessment. IEEE Trans. Image Processing 27 (1), pp. 206–219. Cited by: §1.
  • J. Bruna, P. Sprechmann, and Y. LeCun (2016) Super-resolution with deep convolutional sufficient statistics. In Proc. Int. Conf. Learn. Represent., pp. 1–17. Cited by: §2.
  • D. Brunet, E.R. Vrscay, and Z. Wang (2012) On the mathematical properties of the structural similarity index. IEEE Trans. Image Processing 21 (4), pp. 1488–1499. Cited by: §2.
  • H.C. Burger, C.J. Schuler, and S. Harmeling (2012) Image denoising: can plain neural networks compete with BM3D?. In Proc. IEEE Conf. Comput. Vision Pattern Recog., pp. 2392–2399. Cited by: §1.
  • D.M. Chandler and S.S. Hemami (2007)

    VSNR: a wavelet-based visual signal-to-noise ratio for natural images

    IEEE Trans. Image Processing 16 (9), pp. 2284–2298. Cited by: §1.
  • S.S. Channappayya, A.C. Bovik, and R.W. Heath (2008) Rate bounds on SSIM index of quantized images. IEEE Trans. Image Processing 17 (9), pp. 1624–1639. Cited by: §1, §2.
  • Z. Cheng, P. Akyazi, H. Sun, J. Katto, and T. Ebrahimi (2019a) Perceptual quality study on deep learning based image compression. In Proc. IEEE Int. Conf. Image Proc., Cited by: §3.
  • Z. Cheng, H. Sun, M. Takeuchi, and J. Katto (2019b) Energy compaction-based image compression using convolutional AutoEncoder. IEEE Transactions on Multimedia, pp. 1–1. Cited by: §1.
  • Z. Cheng, H. Sun, M. Takeuchi, and J. Katto (2019c) Learning image and video compression through spatial-temporal energy compaction. In Proc. IEEE Conf. Comput. Vision Pattern Recog., Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li (2009) ImageNet: a large-scale hierarchical image database. In Proc. IEEE Conf. Comput. Vision Pattern Recog., pp. 248–255. Cited by: §4.1.
  • G. Bjøntegaard (2001) Calculation of average PSNR differences between RD-curves. document VCEG-M33, ITU-T Video Coding Experts Group (VCEG) Thirteenth Meeting. Cited by: §4.1.
  • L.A. Gatys, A.S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman (2017) Controlling perceptual factors in neural style transfer. In Proc. IEEE Conf. Comput. Vision Pattern Recog., pp. 3730–3738. Cited by: §2.
  • D. Ghadiyaram and A.C. Bovik (2016) Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Processing 25 (1), pp. 372–387. Cited by: §3.
  • Y.-H. Huang, T.-S. Ou, P.-Y. Su, and H.H. Chen (2010) Perceptual rate-distortion optimization using structural similarity index as quality metric. IEEE Trans. Circuits Syst. Video Technol. 20 (11), pp. 1614–1624. Cited by: §1.
  • J. Johnson, A. Alahi, and F.-F. Li (2016) Perceptual losses for real-time style transfer and super-resolution. In Proc. Eur. Conf. Comput. Vision, Vol. 9906, pp. 694–711. Cited by: §2.
  • N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S.J. Hwang, J. Shor, and G. Toderici (2018) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In Proc. IEEE Conf. Comput. Vision Pattern Recog., pp. 4385–4393. Cited by: §1.
  • J. Kim, H. Zeng, D. Ghadiyaram, S. Lee, L. Zhang, and A.C. Bovik (2017) Deep convolutional neural models for picture-quality prediction: challenges and solutions to data-driven image quality assessment. IEEE Signal Process. Mag. 34 (6), pp. 130–141. Cited by: §3.
  • D.P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proc. Int. Conf. Learn. Represent., pp. 1–15. Cited by: §4.1.
  • W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang (2019) Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE Trans. on Pattern Anal. Mach. Intell. 41 (11), pp. 2599–2613. Cited by: §1.
  • E.C. Larson and D.M. Chandler (2010) Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of Electronic Imaging 19 (1), pp. 011006:1–011006:21. Cited by: §1.
  • C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A.P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proc. IEEE Conf. Comput. Vision Pattern Recog., pp. 105–114. Cited by: §2.
  • Z. Li, C.G. Bampis, J. Novak, A. Aaron, K. Swanson, A. Moorthy, and J.D. Cock (2018) VMAF: the journey continues. The Netflix tech blog. External Links: Link Cited by: §1, §1.
  • T.-M. Liu, C.-H. Tsai, T.-H. Wu, J.-Y. Lin, L.-H. Chen, H.-L. Chou, Y.-C. Chang, and C.-C. Ju (2018) A 0.76 mm2 0.22 nJ/pixel DL-assisted 4k video encoder LSI for quality-of-experience over smartphones. IEEE Solid-State Circuits Lett. 1 (12), pp. 221–224. Cited by: §1.
  • Y.-L. Liu, Y.-T. Liao, Y.-Y. Lin, and Y.-Y. Chuang (2019) Deep video frame interpolation using cyclic frame generation. In Proc. AAAI, pp. 8794–8802. Cited by: §1.
  • K.-S. Lu, A. Ortega, D. Mukherjee, and Y. Chen (2020) Cited by: §1.
  • K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang (2017) Waterloo Exploration Database: new challenges for image quality assessment models. IEEE Trans. Image Processing 26 (2), pp. 1004–1016. Cited by: §3.
  • D. Minnen, J. Ballé, and G.D. Toderici (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems 31, pp. 10771–10780. Cited by: §1.
  • S. Paul, A. Norkin, and A.C. Bovik (2019) Cited by: §1.
  • S.-C. Pei and L.-H. Chen (2015)

    Image quality assessment using human visual DOG model fused with random forest

    IEEE Trans. Image Processing 24 (11), pp. 3282–3292. Cited by: §1.
  • M.S.M. Sajjadi, B. Scholkopf, and M. Hirsch (2017) EnhanceNet: single image super-resolution through automated texture synthesis. In Proc. IEEE Int. Conf. Comput. Vision, Cited by: §2.
  • H.R. Sheikh and A.C. Bovik (2006) Image information and visual quality. IEEE Trans. Image Processing 15 (2), pp. 430–444. Cited by: §1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In Proc. Int. Conf. Learn. Represent., pp. 1–14. Cited by: §2.
  • Z. Sinno, A.K. Moorthy, J.D. Cock, Z. Li, and A.C. Bovik (2020) Quality measurement of images on mobile streaming interfaces deployed at scale. IEEE Trans. Image Processing 29, pp. 2536–2551. Cited by: §4.1.
  • J. Snell, K. Ridgeway, R. Liao, B.D. Roads, M.C. Mozer, and R.S. Zemel (2017) Learning to generate images with perceptual similarity metrics. In Proc. IEEE Int. Conf. Image Process., pp. 4277–4281. Cited by: §1, §2.
  • S. Wang, A. Rehman, Z. Wang, S. Ma, and W. Gao (2012) SSIM-motivated rate-distortion optimization for video coding. IEEE Trans. Circuits Syst. Video Technol. 22 (4), pp. 516–529. Cited by: §1.
  • Z. Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing 13 (4), pp. 600–612. Cited by: §1.
  • Z. Wang and Q. Li (2011) Information content weighting for perceptual image quality assessment. IEEE Trans. Image Processing 20 (5), pp. 1185–1198. Cited by: §1.
  • Z. Wang, E.P. Simoncelli, and A.C. Bovik (2003) Multi-scale structural similarity for image quality assessment. In Proc. IEEE Asilomar Conf. on Signals, Syst., and Comput., pp. 1398–1402. Cited by: §1.
  • C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li (2017) High-resolution image inpainting using multi-scale neural patch synthesis. In Proc. IEEE Conf. Comput. Vision Pattern Recog., pp. 4076–4084. Cited by: §2.
  • X. Ying, H. Niu, P. Gupta, D. Mahajan, D. Ghadiyaram, and A.C. Bovik (2020) From patches to pictures (PaQ-2-PiQ): mapping the perceptual space of picture quality. In Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recogn., Cited by: §3.
  • L. Zhang, Y. Shen, and H. Li (2014) VSI: a visual saliency-induced index for perceptual image quality assessment. IEEE Trans. Image Processing 23 (10), pp. 4270–4281. Cited by: §1.
  • L. Zhang, L. Zhang, X. Mou, and D. Zhang (2011) FSIM: a feature similarity index for image quality assessment. IEEE Trans. Image Processing 20 (8), pp. 2378–2386. External Links: ISSN 1057-7149 Cited by: §1.
  • R. Zhang, P. Isola, A.A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In Proc. IEEE Conf. Comput. Vision Pattern Recog., pp. 586–595. Cited by: §2.
  • H. Zhao, O. Gallo, I. Frosio, and J. Kautz (2017) Loss functions for image restoration with neural networks. IEEE Trans. on Computational Imaging 3 (1), pp. 47–57. Cited by: §1, §2.