Log In Sign Up

A Loss Function for Generative Neural Networks Based on Watson's Perceptual Model

To train Variational Autoencoders (VAEs) to generate realistic imagery requires a loss function that reflects human perception of image similarity. We propose such a loss function based on Watson's perceptual model, which computes a weighted distance in frequency space and accounts for luminance and contrast masking. We extend the model to color images, increase its robustness to translation by using the Fourier Transform, remove artifacts due to splitting the image into blocks, and make it differentiable. In experiments, VAEs trained with the new loss function generated realistic, high-quality image samples. Compared to using the Euclidean distance and the Structural Similarity Index, the images were less blurry; compared to deep neural network based losses, the new approach required less computational resources and generated images with less artifacts.


page 2

page 6

page 7

page 11

page 13

page 14

page 16


The Fourier Loss Function

This paper introduces a new loss function induced by the Fourier-based M...

Guiding human gaze with convolutional neural networks

The eye fixation patterns of human observers are a fundamental indicator...

Metameric Varifocal Holography

Computer-Generated Holography (CGH) offers the potential for genuine, hi...

Image quality measurements and denoising using Fourier Ring Correlations

Image quality is a nebulous concept with different meanings to different...

Focal Frequency Loss for Generative Models

Despite the remarkable success of generative models in creating photorea...

Frequency Domain Loss Function for Deep Exposure Correction of Dark Images

We address the problem of exposure correction of dark, blurry and noisy ...

Ensuring accurate stain reproduction in deep generative networks for virtual immunohistochemistry

Immunohistochemistry is a valuable diagnostic tool for cancer pathology....

1 Introduction

Variational Autoencoders (VAEs) Kingma and Welling (2013)

are generative neural networks that learn a probability distribution over

from training data . New samples are generated by drawing a latent variable from a distribution and using to sample from a conditional decoder distribution . The distribution of induces a similarity measure on

. A generic choice is a normal distribution

with a fixed variance

. In this case the underlying energy-function is . Thus, the model assumes that for two samples which are sufficiently close to each other (as measured by ), the similarity measure can be well approximated by the squared loss. The choice of is crucial for the generative model. For image generation, traditional pixel-by-pixel loss metrics such as the squared loss are popular because of their simplicity, ease of use and efficiency Hou et al. (2017). However, they perform poorly at modeling the human perception of image similarity Zhang et al. (2018). Most VAEs trained with such losses produce images that look blurred Dosovitskiy and Brox (2016); Hou et al. (2017)

. Accordingly, perceptual loss functions for VAEs are an active research area. These loss functions fall into two broad categories, namely explicit models, as exemplified by the Structural Similarity Index Model (SSIM)

Wang et al. (2004)

, and learned models. The latter include models based on deep feature embeddings extracted from image classification networks

Hou et al. (2017); Zhang et al. (2018); Kettunen et al. (2019) as well as combinations of VAEs with discriminator networks of Generative Adversarial Networks (GANs) Goodfellow et al. (2014); Larsen et al. (2016); Mathieu et al. (2016).

Perceptual loss functions based on deep neural networks, which we refer to as deeploss approaches, have produced promising results. However, features optimized for one task need not be a good choice for a different task. Our experimental results suggest that deeploss metrics optimized on specific datasets may not generalize to broader categories of images. We argue that using features from networks pre-trained for image classification in loss functions for training VAEs for image generation may be problematic, because invariance properties beneficial for classification make it difficult to capture details required to generate realistic images.

In this work, we introduce a loss function based on Watson’s visual perception model Watson (1993), an explicit perceptual model used in image compression and digital watermarking Li and Cox (2007). The model accounts for the perceptual phenomena of sensitivity, luminance masking, and contrast masking. It computes the loss as a weighted distance in frequency space based on a Discrete Cosine Transform (DCT). We optimize the Watson model for image generation by (i) replacing the DCT with the discrete Fourier Transform (DFT) to improve robustness against translational shifts, (ii) extending the model to color images, (iii) replacing the fixed grid in the block-wise computations by a randomized grid to avoid artifacts, and (iv) replacing the operator to make the loss function differentiable. We trained the free parameters of our model and several competitors using human similarity judgement data (Zhang et al. (2018), see Figure 1 for examples). We applied the trained similarity measures to image generation of numerals and celebrity faces. The modified Watson model generalized well to the different image domains and resulted in imagery exhibiting less blur and far fewer artifacts compared to alternative approaches.

Figure 1: Similarity judgement of tested metrics on selected images (see 4.1 for details). The proposed Watson-DFT metric can model spatial variations, yet punishes image degradation through noise and graphic artifacts. A deep-neural network based metric pre-trained on classification tasks is invariant to the image quality, leading to more artifacts when employed in generation tasks. We refer to Supplement G for additional random examples.

2 Background

In this section we briefly review variational autoencoders and Watson’s perceptual model.

Variational Autoencoders

Samples from VAEs Kingma and Welling (2013) are drawn from , where is a prior distribution that can be freely chosen and is typically modeled by a deep neural network. The model is trained using a variational lower bound on the likelihood


where is an encoder function designed to approximate and is a scaling factor. We choose and , where the covariance matrix is restricted to be diagonal and both and are modelled by deep neural networks.

Loss functions for VAEs

It is possible to incorporate a wide range of loss functions into VAE-training. If we choose , where is a neural network and we ensure that leads to a proper probability function, the first term of (1) becomes


Choosing freely comes at the price that we typically lose the ability to sample from

directly. Therefore, Markov Chain Monte Carlo methods are applied. In most applications, however, it is assumed that

is a good approximation of and most articles present means instead of samples. Typical choices for are the squared loss and -norms . A more advanced choice is Structured Similarity (SSIM) Wang et al. (2004), which models perceived image fidelity. We refer to section A in the supplementary material for a description of SSIM.

Another approach to define loss functions is to extract features using a deep neural network and to measure the differences between the features from original and reconstructed images Hou et al. (2017). In Hou et al. (2017), it is proposed to consider the first five layers of VGGNet Simonyan and Zisserman (2015). In Zhang et al. (2018)

, different feature extraction networks, including AlexNet

Krizhevsky et al. (2012) and SqeezeNet Iandola et al. (2016), are tested. Furthermore, the metrics are improved by weighting each feature based on data from human perception experiments (see Section 4.1). With adaptive weights for each feature map, the resulting loss function reads


where , and are the height, width and number of channels (feature maps) in layer . The normalized

-dimensional feature vectors are denoted by

and , where contains the features of image in layer at spatial coordinates (see Zhang et al. (2018) for details).

Watson’s Perceptual Model

Watson’s perceptual model of the human visual system Watson (1993) describes an image as a composition of base images of different frequencies. It accounts for the perceptual impact of luminance masking, contrast masking, and sensitivity. Input images are first divided into disjoint blocks of pixels, where . Each block is then transformed into frequency-space using the DCT. We denote the DCT coefficient of the -th block by for and .

The Watson model computes the loss as weighted -norm (typically ) in frequency-space


where is derived from the DCT coefficients . The loss is not symmetric as does not influence . To compute , an image-independent sensitivity table is defined. It stores the sensitivity of the image to changes in its individual DCT components. The table is a function of a number of parameters, including the image resolution and the distance of an observer to the image. It can be chosen freely dependent on the application, a popular choice is given in Cox et al. (2008). Watson’s model adjusts T for each block according to the block’s luminance. The luminance-masked threshold is given by


where is a constant with a suggested value of , is the d.c. coefficient (average brightness) of the -th block in the original image, and is the average luminance of the entire image. As a result, brighter regions of an image are less sensitive to changes.

Contrast masking accounts for the reduction in visibility of one image component by the presence of another. If a DCT frequency is strongly present, an absolute change in its coefficient is less perceptible compared to when the frequency is less pronounced. Contrast masking gives


where the constant has a suggested value of .

3 Modified Watson’s Perceptual Model

A differentiable model

To make the loss function differentiable we replace the maximization in the computation of by a smooth-maximum function and the equation for becomes


For numerical stability, we introduce a small constant and arrive at the trainable Watson-loss for the coefficients of a single channel


Extension to color images

Watson’s perceptual model is defined for a single channel (i.e., greyscale). To make the model applicable to color images, we aggregate the loss calculated on multiple separate channels to a single loss value.111Many perceptually oriented image processing domains choose color representations that separate luminance from chroma. For example, the HSV color model distinguishes between hue, saturation, and color, and formats such as Lab or YCbCr distinguish between a luminance value and two color planes Smith (1978). The separation of brightness from color information is motivated by a difference in perception. The luminance of an image has a larger influence on human perception than chromatic components Schwarz et al. (1987). Perceptual image processing standards such as JPEG compression utilize this by encoding chroma at a lower resolution than luminance Wallace (1992). We represent color images in the YCbCr format, consisting of the luminance channel Y and chroma channels Cb and Cr. We calculate the single-channel losses separately and weight the results. Let , , be the loss values in the luminance, blue-difference and red-difference components for any greyscale loss function. Then the corresponding multi-channel loss is calculated as


where the weighting coefficients are learned from data, see below.

Fourier transform

In order to be less sensitive to small translational shifts, we replace the DCT with a discrete Fourier Transform (DFT), which is in accordance with Watson’s original work (e.g., Watson and Ahumada (1985); Watson (1987)). The later use of the DCT was most likely motivated by its application within JPEG Wallace (1992); Watson (1994). The DFT separates a signal into amplitude and phase information. Translation of an image affects phase, but not amplitude. We apply Watson’s model on the amplitudes while we use the cosine-distance for changes in phase information. Let be the amplitudes of the DFT and let be the phase-information. We then obtain


where are individual weights of the phase-distances that can be learned (see below).

The change of representation going from DCT to DFT disentangles amplitude and phase information, but does not increase the number of parameters as the DFT of real images results in a Hermitian complex coefficient matrix (i.e., the element in row and column is the complex conjugate of the element in row and column ) .

Grid translation

Computing the loss from disjoint blocks works for the original application of Watson’s perceptual model, lossy compression. However, a powerful generative model can take advantage of the static blocks, leading to noticeable artifacts at block boundaries. We solve this problem by randomly shifting the block-grid in the loss-computation during training. The offsets are drawn uniformly in the interval in both dimensions. In expectation, this is equivalent to computing the loss via a sliding window as in SSIM.

Free parameters

When benchmarking Watson’s perceptual model with the suggested parameters on data from a Two-Alternative Forced-Choice (2AFC) task measuring human perception of image similarity, see Subsection 4.1, we found that the model underestimated differences in images with strong high-frequency components. This allows compression algorithms to improve compression ratios by omitting noisy image patterns, but does not model the full range of human perception and can be detrimental in image generation tasks, where the underestimation of errors in these frequencies might lead to the generation of an unnatural amount of noise. We solve this problem by training all parameters of all loss variants, including and for color images and , on the 2AFC dataset (see Section 4.1).

4 Experiments

We empirically compared our loss functions to traditional as well as deeploss approaches. First, we trained the free parameters of the proposed Watson model as well as of loss functions based on VGGNet Simonyan and Zisserman (2015) and SqueezeNet Iandola et al. (2016) to mimic human perception on data of human perceptual judgements. Next, we applied the similarity metrics as loss functions of VAEs in two image generation tasks. Finally, we evaluated the perceptual performance, and investigate individual error cases.

4.1 Training on data from human perceptual experiments

The modified Watson model, referred to as Watson-DFT, as well as Deeploss-VGG and Deeploss-Squeeze have trainable parameters, which we adapted using the same data. For Deeploss-VGG and Deeploss-Squeeze, we followed the methodology called LPIPS (linear) in Zhang et al. (2018) and trained feature weights according to (3) for the first 5 or 7 layers, respectively.

Figure 2: Optimization of a loss function on the 2AFC dataset. The inputs are the original image and two distorted version and . is used to calculate perceptual distances , . The function predicts a ranking-probability . The training loss is the binary cross-entropy between the true-human ranking-probability and the loss-ranking-probability .

We trained on the Two-Alternative Forced-Choice (2AFC) dataset of perceptual judgements published as part of the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset Zhang et al. (2018). Participants were asked which of two distortions of an color image is more similar to the reference . A human reference judgement is provided indicating whether the human judges on average deemed () or () more similar to .222The three image patches and label form a record. The dataset contains a total of 151,400 training records and 36,500 test records. Each training record was judged by 2, each test record by 5 humans. The dataset is based on a total of 20 different distortions, with the strength of each distortion randomized per sample. Some distortions can be combined, giving 308 combinations. Figure 1 and Fig. B.7 in the supplementary material show examples.

To train a loss function on the 2AFC dataset, we follow the schema outlined in Figure 2. We first compute the perceptual distances and . Then these distances are converted into a probability to determine whether is perceptually more similar than . To calculate the probability based on distance measures, we use



is the sigmoid function with learned weight

modelling the steepness of the slope. This computation is invariant to linear transformations of the loss functions.

The training loss between the predicted judgment and the human judgment is calculated by the binary cross-entropy:


This objective function was used to adapt the parameters of all considered metrics (used as loss functions in the VAE experiments). We trained the DCT based loss Watson-DCT and the DFT based loss Watson-DFT, see (8) and (10), respectively, both for single-channel greyscale input as well as for color images with the multi-channel aggregator (9). We compared our results to the linearly weighted deep loss functions from Zhang et al. (2018), which we reproduced using the original methodology, which differs from (3) only in modeling as a shallow neural network with all positive weights.

(a) Watson-DFT
(b) SSIM
(c) Deeploss-VGG
(d) Deeploss-Squeeze
Figure 3: Manifolds extracted from the 2-dimensional latent space of VAEs trained with different loss functions. Underlying -values lie on a grid over .

4.2 Application to VAEs

We evaluated VAEs trained with loss functions based on the the modified Watson model as well as SSIM, Deeploss-VGG and Deeploss-Squeeze. Since quantitative evaluation of generative models is challenging Theis et al. (2016)

, we qualitatively assessed the generation, reconstruction and latent-value interpolation of each model on two independent datasets.

333We provide the source code for our methods and the experiments, including the scripts that randomly sampled from the models to generate the plots in this article. We encourage to run the code and generate more samples to verify that the presented results are representative. We considered the gray-scale MNIST dataset LeCun et al. (1998) and the celebA dataset Liu et al. (2015) of celebrity faces. The images of the celebA dataset are of higher resolution and visual complexity compared to MNIST. The feature space dimensionalities for the two models, MNIST-VAE and celebA-VAE, were 2 and 256, respectively.444The full architectures are given in supplementary material Appendix C. The optimization algorithm was Adam Kingma and Ba (2015). The initial learning rate was and decreased exponentially throughout training by a factor of every epochs for the MNIST-VAE, and every epochs for the celebA-VAE. For all models, we first performed a hyper-parameter search over the regularization parameter in (1). We tested for for epochs on the MNIST set and epochs on the celebA set, then selected the best performing hyper-parameter by visual inspection of generated samples. Values selected for training the full model are shown in Table C.3 in the supplement. For each loss function, we trained the MNIST-VAE for epochs and the celebA-VAE for epochs.

Results of reconstructed samples from models trained on celebA are given in Fig. 5. Generated images of all models are given in Fig. 5 and Supplement E. For the two-dimensional feature-space of the MNIST model, Fig. 3 shows reconstructions from -values that lie on a grid over . Additional results showing interpolations and reconstructions of the models are given in Supplement E.

Handwritten digits

The VAE trained with the Watson-DFT captured the MNIST dataset well (see Fig. 3 and supplementary Fig. E.8). The visualization of the latent-space shows natural-looking handwritten digits. All generated samples are clearly identifiable as numbers. The model trained with SSIM produced similar results, but edges are slightly less sharp (Fig. E.8). The VAE trained with the Deeploss-VGG metric produced unnatural looking samples, very distinct from the original dataset. Samples generated by VAEs trained with Deeploss-Squeeze were not recognizeable as digits. Both deep feature based metrics performed badly on this simple task; they did not generalize to this domain of images, which differs from the 2AFC images used to tune the learned similarity metrics.

Celebrity photos

The model trained with the Watson-DFT metric generated samples of high visual fidelity. Background patterns and haircuts were defined and recognizable, and even strands of hair were partially visible. The images showed no blurring and few artifacts. However, objects lacked fine details like skin imperfections, leading to a smooth appearance. Samples from this generative model overall looked very good and covered the full range of diversity of the original dataset.

The VAE trained with SSIM showed the typical problems of training with traditional losses. Well-aligned components of the images, such as eyes and mouth, were realistically generated. More specific features such as the background and glasses, or features with a greater amount of spatial uncertainty, such as hair, were very blurry or not generated at all. The samples were bland and did not capture the full diversity of the training data. The VAE trained with the Deeploss-VGG metric generated samples and visual patterns of the original dataset very well. Minor details such as strands of hair, skin imperfections, and reflections were generated very accurately. However, very strong artifacts were present (e.g., in the form of grid-like patterns, see Fig. 5 (c)). The VAE trained with Deeploss-Squeeze showed very strong artifacts in reconstructed images as well as generated images (see supplementary Fig. E.11).

(a) Watson-DFT
(b) SSIM
(c) Deeploss-VGG
Figure 4: Random samples decoded from latent values for VAEs trained with different loss functions. For results of Deeploss-Squeeze, we refer to supplementary material Appendix E.
Figure 5: Reconstructions from the celebA test set using VAEs trained with different loss functions.

4.3 Perceptual score

We used the validation part of the 2AFC dataset to compute perceptual scores and investigated similarity judgements on individual samples of the set. The agreement with human judgements is measured by as in Zhang et al. (2018).555For example, when of humans judged to be more similar to the reference we have . If the metric predicted to be closer, , and we grant it score for this judgement. A human reference score was calculated using . The results are summarized in Figure 6. Overall, the scores were similar to the results in Zhang et al. (2018), which verifies our methodology. We can see that the explicit approaches (

and SSIM) performed similarly. Watson-DFT performed considerably better, but not as well as Deeploss-VGG or Deeploss-Squeeze. We observe that the ability of metrics to learn perceptual judgement grows with the degrees of freedom (>1000 parameters for deeploss metrics, <100 for Watson-based metrics, none for traditional metrics).

Inspecting the errors revealed qualitative difference between the metrics, some representative examples are shown in Fig. 1. We observed that the deep networks are good at semantic matching (see biker in Fig 1

), but under-estimate the perceptual impact of graphical artifacts such as noise (see treeline) and blur. We argue that this is because the features were originally optimized for object recognition, where invariance against distortions and spatial shifts is beneficial. In contrast, the Watson-based metric is sensitive to changes in frequency (noise, blur) and large translations.

4.4 Resource requirements

During training, deeploss approaches require considerably more computation time and GPU memory – which is then missing for the VAE model and data - compared to the other approaches. Section D in the supplementary material summarizes an experimental comparison. For example, evaluation of Watson-DFT was 17 times faster than Deeploss-VGG on greyscale images and required only a few megabytes of GPU memory instead of two gigabytes.

5 Discussion and conclusions


The 2AFC dataset is suitable to evaluate and tune perceptual similarity measures. But it considers a special, limited, partially artificial set of images and transformations. On the 2AFC task our metric based on Watson’s perceptual model outperformed the simple and metrics as well as the popular structural similarity SSIM Wang et al. (2004). Learning a metric using deep neural networks on the 2AFC data gave better results on the corresponding test data. This does not come as a surprise given the high flexibility of this purely data-driven approach. However, the resulting neural networks did not work well when used as a loss function for training VAEs, indicating weak generalization beyond the images and transformations in the training data. This is in accordance with (1) the fact that the higher flexibility of Deeploss-Squeeze compared to Deeploss-VGG yields a better fit in the 2AFC task (see also Zhang et al. (2018)) but even worse results in the VAE experiments; (2) that DeepLoss approaches profit from extensive regularization, especially by including the squared error in the loss function (e.g., Kettunen et al. (2019)).

Figure 6: Metrics evaluated on the validation part of the 2AFC dataset (mean and variance). Black: Human reference. Grey: Metrics evaluated on greyscale images. Red: Metrics evaluated on color images. Shades group metrics into the categories of non-learned ‘Traditional’ pixel-based metrics, our modified Watson Model and the LPIPS (linear) metrics from Zhang et al. (2018). We refer to supplementary material Appendix F for an evaluation by transformation group.

In contrast, our approach based on Watson’s Perceptual Model is not very complex (in terms of degrees of freedom) and it has a strong inductive bias to match human perception. Therefore it extrapolates much better in a way expected from a perceptual metric/loss.

Deep neural networks for object recognition are trained to be invariant against translation, noise and blur, distortions, and other visual artifacts. We observed the invariance against noise and artifacts even after tuning on the data from human experiments, see Fig. 1

. While these properties are important to perform well in many computer vision tasks, they are not desirable for image generation. The generator/decoder can exploit these areas of ‘blindness’ of the similarity metric, leading to significantly more visual artifacts in generated samples, as we observed in the image generation experiments.

Furthermore, the computational and memory requirements of neural network based loss functions are much higher compared to SSIM or Watson’s model, to an extent that limits their applicability in generative neural network training.


We introduced a novel image similarity metric and corresponding loss function based on Watson’s perceptual model, which we transformed to a trainable model and extended to color-images. We replaced the underlying DCT by a DFT to disentangles amplitude and phase information in order to increase robustness against small shifts.

The novel loss function optimized on data from human experiments can be used to train deep generative neural networks to produce realistic looking, high-quality samples. It is fast to compute and requires little memory. The new perceptual loss function does not suffer from the blurring effects of traditional similarity metrics like Euclidean distance or SSIM, and generates less visual artifacts than current state-of-the-art losses based on deep neural networks.

CI acknowledges support by the Villum Foundation through the project Deep Learning and Remote Sensing for Unlocking Global Ecosystem Resource Dynamics (DeReEco).


  • [1] I. Cox, M. Miller, J. Bloom, J. Fridrich, and T. Kalker (2008) Digital watermarking and steganography. The Morgan Kaufmann series in multimedia information and systems, Morgan Kaufmann Publishers. Cited by: §2.
  • [2] A. Dosovitskiy and T. Brox (2016) Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 658–666. Cited by: §1.
  • [3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680. Cited by: §1.
  • [4] X. Hou, L. Shen, K. Sun, and G. Qiu (2017) Deep feature consistent variational autoencoder. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1133–1141. Cited by: §1, §2.
  • [5] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2, §4.
  • [6] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning (ICML)

    pp. 448–456. Cited by: Table C.2.
  • [7] M. Kettunen, E. Härkönen, and J. Lehtinen (2019) E-LPIPS: robust perceptual image similarity via random transformation ensembles. CoRR abs/1906.03973. External Links: 1906.03973 Cited by: §1, §5.
  • [8] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference for Learning Representations (ICLR). Cited by: footnote 4.
  • [9] D. P. Kingma and M. Welling (2013) Auto-encoding variational Bayes. International Conference on Learning Representations (ICLR). Cited by: §1, §2.
  • [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §2.
  • [11] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2016) Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning (ICML), pp. 1558–1566. Cited by: §1.
  • [12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Table C.1, §4.2.
  • [13] Q. Li and I. Cox (2007) Using perceptual models to improve fidelity and provide resistance to valumetric scaling for quantization index modulation watermarking. IEEE Transactions on Information Forensics and Security 2 (2), pp. 127–139. Cited by: §1.
  • [14] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: Table C.2, §4.2.
  • [15] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning (ICML), Cited by: Table C.1, Table C.2.
  • [16] M. Mathieu, C. Couprie, and Y. LeCun (2016) Deep multi-scale video prediction beyond mean square error. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • [17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    Advances in Neural Information Processing Systems (NeurIPS), Workshop on Automatic Differentiation. Cited by: Appendix D.
  • [18] M. W. Schwarz, W. B. Cowan, and J. C. Beatty (1987-04) An experimental comparison of RGB, YIQ, LAB, HSV, and opponent color models. ACM Transactions on Graphics 6 (2), pp. 123–158. Cited by: footnote 1.
  • [19] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §2, §4.
  • [20] A. R. Smith (1978) Color gamut transform pairs. ACM SIGGRAPH Computer Graphics 12 (3), pp. 12–19. Cited by: footnote 1.
  • [21] L. Theis, A. van den Oord, and M. Bethge (2016) A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR), pp. 1–10. Cited by: §4.2.
  • [22] G. K. Wallace (1992) The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38 (1), pp. xviii–xxxiv. Cited by: §3, footnote 1.
  • [23] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: Appendix A, §1, §2, §5.
  • [24] A. B. Watson and A. J. Ahumada (1985) Model of human visual-motion sensing. Journal of the Optical Society of America A 2 (2), pp. 322–342. Cited by: §3.
  • [25] A. B. Watson (1987) The cortex transform: rapid computation of simulated neural images. Computer Vision, Graphics, and Image Processing 39 (3), pp. 311–327. Cited by: §3.
  • [26] A. B. Watson (1993) DCT quantization matrices visually optimized for individual images. In Human vision, visual processing, and digital display IV, Vol. 1913, pp. 202–217. Cited by: §1, §2.
  • [27] A. B. Watson (1994) Image compression using the discrete cosine transform. Mathematica Journal 4 (1), pp. 81–88. Cited by: §3.
  • [28] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure F.12, §1, §1, §2, §4.1, §4.1, §4.1, §4.3, Figure 6, §5.

Appendix A Structural Similarity Loss Function

The Structured Similarity (SSIM) Wang et al. (2004), which models perceived image fidelity, is a popular loss function for VAE training. In SSIM, a sample is decomposed into blocks and individual channels. Errors are calculated per channel and finally averaged over the entire image. The structured similarity between two blocks , is defined as


with denoting the average of , the average of , the variance of , the variance of and the co-variance of and . The constants and stabilize division and are calculated depending on the dynamic range of pixel values. We use the recommended values for the parameters , and block size Wang et al. (2004). Blocks are weighted by a Gaussian sampling function and moved pixel-by-pixel over the image.

Appendix B 2AFC Data


Distortion 1

Distortion 2

Figure B.7: Example records from the 2AFC dataset. Top row: Original image patches. Row 2 & 3: Distortions. The distortion judged closer to the reference in human trials is marked red.

Appendix C Model Training

MNIST-VAE Input Size Layer
Encoder Conv.

, leaky ReLU

Conv. , leaky ReLU
Fully-connected , leaky ReLU
Fully-connected , leaky ReLU
Decoder Fully-connected , leaky ReLU
Fully-connected , leaky ReLU
Conv. , leaky ReLU
Bilinear Upsampling
Conv. , leaky ReLU
Conv. , leaky ReLU
Conv. , Sigmoid
Table C.1: Architecture of the VAE for the MNIST dataset LeCun et al. (1998)

. All convolutional layers use a stride of 1 and padding of 1. “Leaky ReLU” denotes leaky Rectified Linear Units

Maas et al. (2013)

. Fully-connected layers state the number of hidden neurons.

celebA-VAE Input Size Layer
Encoder Conv. , leaky ReLU
Maxpool, Batch Normalization
Conv. , leaky ReLU
Maxpool, Batch Normalization
Conv. , leaky ReLU
Fully-connected , leaky ReLU
Fully-connected , leaky ReLU
Decoder Fully-connected , leaky ReLU
Fully-connected , leaky ReLU
Conv. , leaky ReLU
Bilinear Upsampling, Batch Normalization
Conv. , leaky ReLU
Bilinear Upsampling, Batch Normalization
Conv. , leaky ReLU
Conv. , leaky ReLU
Conv. , Sigmoid
Table C.2: Architecture of the VAE for the celebA dataset Liu et al. (2015). All convolutional layers use a stride of 1 and padding of 1. “Leaky ReLU” denotes leaky Rectified Linear Units Maas et al. (2013). Fully-connected layers state the number of hidden neurons. We use batch normalization Ioffe and Szegedy (2015).
Model Similarity Metric Hyper-parameter
Table C.3: Hyper-parameters for models trained.

Appendix D Resource Requirements

When applied for training a generative model, the time and memory requirements of computing a loss function and its derivative are important. We measure these requirements by considering a typical learning scenario. Mini-batches of 128 images of size

with either one (greyscale) or three channels (color) were forward-fed through the tested loss functions. The loss with regard to one input image was back-propagated, and the image was updated accordingly using stochastic gradient descent. We measured the time for

iterations and the maximum GPU memory allocated. Results are averaged over five runs of the experiment. We used PyTorch Paszke et al. (2017), 32-bit precision, and a Tesla P100 GPU. The results are shown in Fig. D.4. For example, evaluation of Watson-DFT took 13s which was 5 times faster than Deeploss-VGG on color images. This factor increased to 17 on greyscale images. Furthermore, Watson-DFT only required a few megabytes of GPU memory, compared to the 2 gigabytes of memory required for Deeploss-VGG.

Input Metric Runtime (s) Mem. (Mb)
Table D.4: Time and GPU memory required for 500 feed-forward and backwards-propagation cycles of a batch of 128 images of size with either 1 (greyscale) or 3 channels (color). Lower values are better.

Appendix E Additional Results

Ground Truth





Figure E.8: Reconstruction of samples from the MNIST test set using VAEs trained with different loss functions.

Ground Truth





Figure E.9: Latent space interpolation between two samples from the MNIST test set. Comparison of VAEs trained with different loss functions.

Ground Truth





Figure E.10: Latent space interpolation between two samples from the celebA test set. Comparison of VAEs trained with different loss functions.
Figure E.11: Random samples decoded from latent values for VAEs trained with Deeploss-Squeeze.

Appendix F Additional 2AFC Metrics

(a) Algorithms
(b) Distortions
Figure F.12:

Metrics evaluated on transformation groups of the validation part of the 2AFC dataset (mean and variance). Transformations in (a) have been generated by established algorithms (Superresolution, Frame Interpolation, Video Deblur, Colorization), transformations in (b) by distortions (Blur, Compression, Noise, CNN based distortions). For more details on data generation see

Zhang et al. (2018).

Appendix F Additional 2AFC Metrics

(a) Algorithms
(b) Distortions

Appendix G Additional 2AFC Judgements













Figure G.13: Similarity judgements on the 2AFC dataset. First row: reference image. Second row: image judged more similar to reference by Watson-DFT metric. Third row: image judged more similar by Deeploss-VGG metric. Red framed: image judged more similar by 5 human judges. Images pictured were selected at random.