1 Introduction
Variational Autoencoders (VAEs) Kingma and Welling (2013)
are generative neural networks that learn a probability distribution over
from training data . New samples are generated by drawing a latent variable from a distribution and using to sample from a conditional decoder distribution . The distribution of induces a similarity measure on. A generic choice is a normal distribution
with a fixed variance
. In this case the underlying energyfunction is . Thus, the model assumes that for two samples which are sufficiently close to each other (as measured by ), the similarity measure can be well approximated by the squared loss. The choice of is crucial for the generative model. For image generation, traditional pixelbypixel loss metrics such as the squared loss are popular because of their simplicity, ease of use and efficiency Hou et al. (2017). However, they perform poorly at modeling the human perception of image similarity Zhang et al. (2018). Most VAEs trained with such losses produce images that look blurred Dosovitskiy and Brox (2016); Hou et al. (2017). Accordingly, perceptual loss functions for VAEs are an active research area. These loss functions fall into two broad categories, namely explicit models, as exemplified by the Structural Similarity Index Model (SSIM)
Wang et al. (2004), and learned models. The latter include models based on deep feature embeddings extracted from image classification networks
Hou et al. (2017); Zhang et al. (2018); Kettunen et al. (2019) as well as combinations of VAEs with discriminator networks of Generative Adversarial Networks (GANs) Goodfellow et al. (2014); Larsen et al. (2016); Mathieu et al. (2016).Perceptual loss functions based on deep neural networks, which we refer to as deeploss approaches, have produced promising results. However, features optimized for one task need not be a good choice for a different task. Our experimental results suggest that deeploss metrics optimized on specific datasets may not generalize to broader categories of images. We argue that using features from networks pretrained for image classification in loss functions for training VAEs for image generation may be problematic, because invariance properties beneficial for classification make it difficult to capture details required to generate realistic images.
In this work, we introduce a loss function based on Watson’s visual perception model Watson (1993), an explicit perceptual model used in image compression and digital watermarking Li and Cox (2007). The model accounts for the perceptual phenomena of sensitivity, luminance masking, and contrast masking. It computes the loss as a weighted distance in frequency space based on a Discrete Cosine Transform (DCT). We optimize the Watson model for image generation by (i) replacing the DCT with the discrete Fourier Transform (DFT) to improve robustness against translational shifts, (ii) extending the model to color images, (iii) replacing the fixed grid in the blockwise computations by a randomized grid to avoid artifacts, and (iv) replacing the operator to make the loss function differentiable. We trained the free parameters of our model and several competitors using human similarity judgement data (Zhang et al. (2018), see Figure 1 for examples). We applied the trained similarity measures to image generation of numerals and celebrity faces. The modified Watson model generalized well to the different image domains and resulted in imagery exhibiting less blur and far fewer artifacts compared to alternative approaches.
2 Background
In this section we briefly review variational autoencoders and Watson’s perceptual model.
Variational Autoencoders
Samples from VAEs Kingma and Welling (2013) are drawn from , where is a prior distribution that can be freely chosen and is typically modeled by a deep neural network. The model is trained using a variational lower bound on the likelihood
(1) 
where is an encoder function designed to approximate and is a scaling factor. We choose and , where the covariance matrix is restricted to be diagonal and both and are modelled by deep neural networks.
Loss functions for VAEs
It is possible to incorporate a wide range of loss functions into VAEtraining. If we choose , where is a neural network and we ensure that leads to a proper probability function, the first term of (1) becomes
(2) 
Choosing freely comes at the price that we typically lose the ability to sample from
directly. Therefore, Markov Chain Monte Carlo methods are applied. In most applications, however, it is assumed that
is a good approximation of and most articles present means instead of samples. Typical choices for are the squared loss and norms . A more advanced choice is Structured Similarity (SSIM) Wang et al. (2004), which models perceived image fidelity. We refer to section A in the supplementary material for a description of SSIM.Another approach to define loss functions is to extract features using a deep neural network and to measure the differences between the features from original and reconstructed images Hou et al. (2017). In Hou et al. (2017), it is proposed to consider the first five layers of VGGNet Simonyan and Zisserman (2015). In Zhang et al. (2018)
, different feature extraction networks, including AlexNet
Krizhevsky et al. (2012) and SqeezeNet Iandola et al. (2016), are tested. Furthermore, the metrics are improved by weighting each feature based on data from human perception experiments (see Section 4.1). With adaptive weights for each feature map, the resulting loss function reads(3) 
where , and are the height, width and number of channels (feature maps) in layer . The normalized
dimensional feature vectors are denoted by
and , where contains the features of image in layer at spatial coordinates (see Zhang et al. (2018) for details).Watson’s Perceptual Model
Watson’s perceptual model of the human visual system Watson (1993) describes an image as a composition of base images of different frequencies. It accounts for the perceptual impact of luminance masking, contrast masking, and sensitivity. Input images are first divided into disjoint blocks of pixels, where . Each block is then transformed into frequencyspace using the DCT. We denote the DCT coefficient of the th block by for and .
The Watson model computes the loss as weighted norm (typically ) in frequencyspace
(4) 
where is derived from the DCT coefficients . The loss is not symmetric as does not influence . To compute , an imageindependent sensitivity table is defined. It stores the sensitivity of the image to changes in its individual DCT components. The table is a function of a number of parameters, including the image resolution and the distance of an observer to the image. It can be chosen freely dependent on the application, a popular choice is given in Cox et al. (2008). Watson’s model adjusts T for each block according to the block’s luminance. The luminancemasked threshold is given by
(5) 
where is a constant with a suggested value of , is the d.c. coefficient (average brightness) of the th block in the original image, and is the average luminance of the entire image. As a result, brighter regions of an image are less sensitive to changes.
Contrast masking accounts for the reduction in visibility of one image component by the presence of another. If a DCT frequency is strongly present, an absolute change in its coefficient is less perceptible compared to when the frequency is less pronounced. Contrast masking gives
(6) 
where the constant has a suggested value of .
3 Modified Watson’s Perceptual Model
A differentiable model
To make the loss function differentiable we replace the maximization in the computation of by a smoothmaximum function and the equation for becomes
(7) 
For numerical stability, we introduce a small constant and arrive at the trainable Watsonloss for the coefficients of a single channel
(8) 
Extension to color images
Watson’s perceptual model is defined for a single channel (i.e., greyscale). To make the model applicable to color images, we aggregate the loss calculated on multiple separate channels to a single loss value.^{1}^{1}1Many perceptually oriented image processing domains choose color representations that separate luminance from chroma. For example, the HSV color model distinguishes between hue, saturation, and color, and formats such as Lab or YCbCr distinguish between a luminance value and two color planes Smith (1978). The separation of brightness from color information is motivated by a difference in perception. The luminance of an image has a larger influence on human perception than chromatic components Schwarz et al. (1987). Perceptual image processing standards such as JPEG compression utilize this by encoding chroma at a lower resolution than luminance Wallace (1992). We represent color images in the YCbCr format, consisting of the luminance channel Y and chroma channels Cb and Cr. We calculate the singlechannel losses separately and weight the results. Let , , be the loss values in the luminance, bluedifference and reddifference components for any greyscale loss function. Then the corresponding multichannel loss is calculated as
(9) 
where the weighting coefficients are learned from data, see below.
Fourier transform
In order to be less sensitive to small translational shifts, we replace the DCT with a discrete Fourier Transform (DFT), which is in accordance with Watson’s original work (e.g., Watson and Ahumada (1985); Watson (1987)). The later use of the DCT was most likely motivated by its application within JPEG Wallace (1992); Watson (1994). The DFT separates a signal into amplitude and phase information. Translation of an image affects phase, but not amplitude. We apply Watson’s model on the amplitudes while we use the cosinedistance for changes in phase information. Let be the amplitudes of the DFT and let be the phaseinformation. We then obtain
(10) 
where are individual weights of the phasedistances that can be learned (see below).
The change of representation going from DCT to DFT disentangles amplitude and phase information, but does not increase the number of parameters as the DFT of real images results in a Hermitian complex coefficient matrix (i.e., the element in row and column is the complex conjugate of the element in row and column ) .
Grid translation
Computing the loss from disjoint blocks works for the original application of Watson’s perceptual model, lossy compression. However, a powerful generative model can take advantage of the static blocks, leading to noticeable artifacts at block boundaries. We solve this problem by randomly shifting the blockgrid in the losscomputation during training. The offsets are drawn uniformly in the interval in both dimensions. In expectation, this is equivalent to computing the loss via a sliding window as in SSIM.
Free parameters
When benchmarking Watson’s perceptual model with the suggested parameters on data from a TwoAlternative ForcedChoice (2AFC) task measuring human perception of image similarity, see Subsection 4.1, we found that the model underestimated differences in images with strong highfrequency components. This allows compression algorithms to improve compression ratios by omitting noisy image patterns, but does not model the full range of human perception and can be detrimental in image generation tasks, where the underestimation of errors in these frequencies might lead to the generation of an unnatural amount of noise. We solve this problem by training all parameters of all loss variants, including and for color images and , on the 2AFC dataset (see Section 4.1).
4 Experiments
We empirically compared our loss functions to traditional as well as deeploss approaches. First, we trained the free parameters of the proposed Watson model as well as of loss functions based on VGGNet Simonyan and Zisserman (2015) and SqueezeNet Iandola et al. (2016) to mimic human perception on data of human perceptual judgements. Next, we applied the similarity metrics as loss functions of VAEs in two image generation tasks. Finally, we evaluated the perceptual performance, and investigate individual error cases.
4.1 Training on data from human perceptual experiments
The modified Watson model, referred to as WatsonDFT, as well as DeeplossVGG and DeeplossSqueeze have trainable parameters, which we adapted using the same data. For DeeplossVGG and DeeplossSqueeze, we followed the methodology called LPIPS (linear) in Zhang et al. (2018) and trained feature weights according to (3) for the first 5 or 7 layers, respectively.
We trained on the TwoAlternative ForcedChoice (2AFC) dataset of perceptual judgements published as part of the BerkeleyAdobe Perceptual Patch Similarity (BAPPS) dataset Zhang et al. (2018). Participants were asked which of two distortions of an color image is more similar to the reference . A human reference judgement is provided indicating whether the human judges on average deemed () or () more similar to .^{2}^{2}2The three image patches and label form a record. The dataset contains a total of 151,400 training records and 36,500 test records. Each training record was judged by 2, each test record by 5 humans. The dataset is based on a total of 20 different distortions, with the strength of each distortion randomized per sample. Some distortions can be combined, giving 308 combinations. Figure 1 and Fig. B.7 in the supplementary material show examples.
To train a loss function on the 2AFC dataset, we follow the schema outlined in Figure 2. We first compute the perceptual distances and . Then these distances are converted into a probability to determine whether is perceptually more similar than . To calculate the probability based on distance measures, we use
(11) 
where
is the sigmoid function with learned weight
modelling the steepness of the slope. This computation is invariant to linear transformations of the loss functions.
The training loss between the predicted judgment and the human judgment is calculated by the binary crossentropy:
(12) 
This objective function was used to adapt the parameters of all considered metrics (used as loss functions in the VAE experiments). We trained the DCT based loss WatsonDCT and the DFT based loss WatsonDFT, see (8) and (10), respectively, both for singlechannel greyscale input as well as for color images with the multichannel aggregator (9). We compared our results to the linearly weighted deep loss functions from Zhang et al. (2018), which we reproduced using the original methodology, which differs from (3) only in modeling as a shallow neural network with all positive weights.
4.2 Application to VAEs
We evaluated VAEs trained with loss functions based on the the modified Watson model as well as SSIM, DeeplossVGG and DeeplossSqueeze. Since quantitative evaluation of generative models is challenging Theis et al. (2016)
, we qualitatively assessed the generation, reconstruction and latentvalue interpolation of each model on two independent datasets.
^{3}^{3}3We provide the source code for our methods and the experiments, including the scripts that randomly sampled from the models to generate the plots in this article. We encourage to run the code and generate more samples to verify that the presented results are representative. We considered the grayscale MNIST dataset LeCun et al. (1998) and the celebA dataset Liu et al. (2015) of celebrity faces. The images of the celebA dataset are of higher resolution and visual complexity compared to MNIST. The feature space dimensionalities for the two models, MNISTVAE and celebAVAE, were 2 and 256, respectively.^{4}^{4}4The full architectures are given in supplementary material Appendix C. The optimization algorithm was Adam Kingma and Ba (2015). The initial learning rate was and decreased exponentially throughout training by a factor of every epochs for the MNISTVAE, and every epochs for the celebAVAE. For all models, we first performed a hyperparameter search over the regularization parameter in (1). We tested for for epochs on the MNIST set and epochs on the celebA set, then selected the best performing hyperparameter by visual inspection of generated samples. Values selected for training the full model are shown in Table C.3 in the supplement. For each loss function, we trained the MNISTVAE for epochs and the celebAVAE for epochs.Results of reconstructed samples from models trained on celebA are given in Fig. 5. Generated images of all models are given in Fig. 5 and Supplement E. For the twodimensional featurespace of the MNIST model, Fig. 3 shows reconstructions from values that lie on a grid over . Additional results showing interpolations and reconstructions of the models are given in Supplement E.
Handwritten digits
The VAE trained with the WatsonDFT captured the MNIST dataset well (see Fig. 3 and supplementary Fig. E.8). The visualization of the latentspace shows naturallooking handwritten digits. All generated samples are clearly identifiable as numbers. The model trained with SSIM produced similar results, but edges are slightly less sharp (Fig. E.8). The VAE trained with the DeeplossVGG metric produced unnatural looking samples, very distinct from the original dataset. Samples generated by VAEs trained with DeeplossSqueeze were not recognizeable as digits. Both deep feature based metrics performed badly on this simple task; they did not generalize to this domain of images, which differs from the 2AFC images used to tune the learned similarity metrics.
Celebrity photos
The model trained with the WatsonDFT metric generated samples of high visual fidelity. Background patterns and haircuts were defined and recognizable, and even strands of hair were partially visible. The images showed no blurring and few artifacts. However, objects lacked fine details like skin imperfections, leading to a smooth appearance. Samples from this generative model overall looked very good and covered the full range of diversity of the original dataset.
The VAE trained with SSIM showed the typical problems of training with traditional losses. Wellaligned components of the images, such as eyes and mouth, were realistically generated. More specific features such as the background and glasses, or features with a greater amount of spatial uncertainty, such as hair, were very blurry or not generated at all. The samples were bland and did not capture the full diversity of the training data. The VAE trained with the DeeplossVGG metric generated samples and visual patterns of the original dataset very well. Minor details such as strands of hair, skin imperfections, and reflections were generated very accurately. However, very strong artifacts were present (e.g., in the form of gridlike patterns, see Fig. 5 (c)). The VAE trained with DeeplossSqueeze showed very strong artifacts in reconstructed images as well as generated images (see supplementary Fig. E.11).
4.3 Perceptual score
We used the validation part of the 2AFC dataset to compute perceptual scores and investigated similarity judgements on individual samples of the set. The agreement with human judgements is measured by as in Zhang et al. (2018).^{5}^{5}5For example, when of humans judged to be more similar to the reference we have . If the metric predicted to be closer, , and we grant it score for this judgement. A human reference score was calculated using . The results are summarized in Figure 6. Overall, the scores were similar to the results in Zhang et al. (2018), which verifies our methodology. We can see that the explicit approaches (
and SSIM) performed similarly. WatsonDFT performed considerably better, but not as well as DeeplossVGG or DeeplossSqueeze. We observe that the ability of metrics to learn perceptual judgement grows with the degrees of freedom (>1000 parameters for deeploss metrics, <100 for Watsonbased metrics, none for traditional metrics).
Inspecting the errors revealed qualitative difference between the metrics, some representative examples are shown in Fig. 1. We observed that the deep networks are good at semantic matching (see biker in Fig 1
), but underestimate the perceptual impact of graphical artifacts such as noise (see treeline) and blur. We argue that this is because the features were originally optimized for object recognition, where invariance against distortions and spatial shifts is beneficial. In contrast, the Watsonbased metric is sensitive to changes in frequency (noise, blur) and large translations.
4.4 Resource requirements
During training, deeploss approaches require considerably more computation time and GPU memory – which is then missing for the VAE model and data  compared to the other approaches. Section D in the supplementary material summarizes an experimental comparison. For example, evaluation of WatsonDFT was 17 times faster than DeeplossVGG on greyscale images and required only a few megabytes of GPU memory instead of two gigabytes.
5 Discussion and conclusions
Discussion
The 2AFC dataset is suitable to evaluate and tune perceptual similarity measures. But it considers a special, limited, partially artificial set of images and transformations. On the 2AFC task our metric based on Watson’s perceptual model outperformed the simple and metrics as well as the popular structural similarity SSIM Wang et al. (2004). Learning a metric using deep neural networks on the 2AFC data gave better results on the corresponding test data. This does not come as a surprise given the high flexibility of this purely datadriven approach. However, the resulting neural networks did not work well when used as a loss function for training VAEs, indicating weak generalization beyond the images and transformations in the training data. This is in accordance with (1) the fact that the higher flexibility of DeeplossSqueeze compared to DeeplossVGG yields a better fit in the 2AFC task (see also Zhang et al. (2018)) but even worse results in the VAE experiments; (2) that DeepLoss approaches profit from extensive regularization, especially by including the squared error in the loss function (e.g., Kettunen et al. (2019)).
In contrast, our approach based on Watson’s Perceptual Model is not very complex (in terms of degrees of freedom) and it has a strong inductive bias to match human perception. Therefore it extrapolates much better in a way expected from a perceptual metric/loss.
Deep neural networks for object recognition are trained to be invariant against translation, noise and blur, distortions, and other visual artifacts. We observed the invariance against noise and artifacts even after tuning on the data from human experiments, see Fig. 1
. While these properties are important to perform well in many computer vision tasks, they are not desirable for image generation. The generator/decoder can exploit these areas of ‘blindness’ of the similarity metric, leading to significantly more visual artifacts in generated samples, as we observed in the image generation experiments.
Furthermore, the computational and memory requirements of neural network based loss functions are much higher compared to SSIM or Watson’s model, to an extent that limits their applicability in generative neural network training.
Conclusion
We introduced a novel image similarity metric and corresponding loss function based on Watson’s perceptual model, which we transformed to a trainable model and extended to colorimages. We replaced the underlying DCT by a DFT to disentangles amplitude and phase information in order to increase robustness against small shifts.
The novel loss function optimized on data from human experiments can be used to train deep generative neural networks to produce realistic looking, highquality samples. It is fast to compute and requires little memory. The new perceptual loss function does not suffer from the blurring effects of traditional similarity metrics like Euclidean distance or SSIM, and generates less visual artifacts than current stateoftheart losses based on deep neural networks.
CI acknowledges support by the Villum Foundation through the project Deep Learning and Remote Sensing for Unlocking Global Ecosystem Resource Dynamics (DeReEco).
References
 [1] (2008) Digital watermarking and steganography. The Morgan Kaufmann series in multimedia information and systems, Morgan Kaufmann Publishers. Cited by: §2.
 [2] (2016) Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 658–666. Cited by: §1.
 [3] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680. Cited by: §1.
 [4] (2017) Deep feature consistent variational autoencoder. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1133–1141. Cited by: §1, §2.

[5]
(2016)
SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and 0.5 mb model size.
In
Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2, §4. 
[6]
(2015)
Batch normalization: accelerating deep network training by reducing internal covariate shift.
In
International Conference on Machine Learning (ICML)
, pp. 448–456. Cited by: Table C.2.  [7] (2019) ELPIPS: robust perceptual image similarity via random transformation ensembles. CoRR abs/1906.03973. External Links: 1906.03973 Cited by: §1, §5.
 [8] (2015) Adam: a method for stochastic optimization. International Conference for Learning Representations (ICLR). Cited by: footnote 4.
 [9] (2013) Autoencoding variational Bayes. International Conference on Learning Representations (ICLR). Cited by: §1, §2.
 [10] (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §2.
 [11] (2016) Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning (ICML), pp. 1558–1566. Cited by: §1.
 [12] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Table C.1, §4.2.
 [13] (2007) Using perceptual models to improve fidelity and provide resistance to valumetric scaling for quantization index modulation watermarking. IEEE Transactions on Information Forensics and Security 2 (2), pp. 127–139. Cited by: §1.
 [14] (2015) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: Table C.2, §4.2.
 [15] (2013) Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning (ICML), Cited by: Table C.1, Table C.2.
 [16] (2016) Deep multiscale video prediction beyond mean square error. In International Conference on Learning Representations (ICLR), Cited by: §1.

[17]
(2017)
Automatic differentiation in pytorch
. Advances in Neural Information Processing Systems (NeurIPS), Workshop on Automatic Differentiation. Cited by: Appendix D.  [18] (198704) An experimental comparison of RGB, YIQ, LAB, HSV, and opponent color models. ACM Transactions on Graphics 6 (2), pp. 123–158. Cited by: footnote 1.
 [19] (2015) Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §2, §4.
 [20] (1978) Color gamut transform pairs. ACM SIGGRAPH Computer Graphics 12 (3), pp. 12–19. Cited by: footnote 1.
 [21] (2016) A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR), pp. 1–10. Cited by: §4.2.
 [22] (1992) The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38 (1), pp. xviii–xxxiv. Cited by: §3, footnote 1.
 [23] (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: Appendix A, §1, §2, §5.
 [24] (1985) Model of human visualmotion sensing. Journal of the Optical Society of America A 2 (2), pp. 322–342. Cited by: §3.
 [25] (1987) The cortex transform: rapid computation of simulated neural images. Computer Vision, Graphics, and Image Processing 39 (3), pp. 311–327. Cited by: §3.
 [26] (1993) DCT quantization matrices visually optimized for individual images. In Human vision, visual processing, and digital display IV, Vol. 1913, pp. 202–217. Cited by: §1, §2.
 [27] (1994) Image compression using the discrete cosine transform. Mathematica Journal 4 (1), pp. 81–88. Cited by: §3.
 [28] (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure F.12, §1, §1, §2, §4.1, §4.1, §4.1, §4.3, Figure 6, §5.
Appendix A Structural Similarity Loss Function
The Structured Similarity (SSIM) Wang et al. (2004), which models perceived image fidelity, is a popular loss function for VAE training. In SSIM, a sample is decomposed into blocks and individual channels. Errors are calculated per channel and finally averaged over the entire image. The structured similarity between two blocks , is defined as
(A.13) 
with denoting the average of , the average of , the variance of , the variance of and the covariance of and . The constants and stabilize division and are calculated depending on the dynamic range of pixel values. We use the recommended values for the parameters , and block size Wang et al. (2004). Blocks are weighted by a Gaussian sampling function and moved pixelbypixel over the image.
Appendix B 2AFC Data
Appendix C Model Training
MNISTVAE  Input Size  Layer 

Encoder  Conv. , leaky ReLU 

Maxpool  
Conv. , leaky ReLU  
Fullyconnected , leaky ReLU  
Fullyconnected , leaky ReLU  
Decoder  Fullyconnected , leaky ReLU  
Fullyconnected , leaky ReLU  
Conv. , leaky ReLU  
Bilinear Upsampling  
Conv. , leaky ReLU  
Conv. , leaky ReLU  
Conv. , Sigmoid  
. All convolutional layers use a stride of 1 and padding of 1. “Leaky ReLU” denotes leaky Rectified Linear Units
Maas et al. (2013). Fullyconnected layers state the number of hidden neurons.
celebAVAE  Input Size  Layer 

Encoder  Conv. , leaky ReLU  
Maxpool, Batch Normalization  
Conv. , leaky ReLU  
Maxpool, Batch Normalization  
Conv. , leaky ReLU  
Fullyconnected , leaky ReLU  
Fullyconnected , leaky ReLU  
Decoder  Fullyconnected , leaky ReLU  
Fullyconnected , leaky ReLU  
Conv. , leaky ReLU  
Bilinear Upsampling, Batch Normalization  
Conv. , leaky ReLU  
Bilinear Upsampling, Batch Normalization  
Conv. , leaky ReLU  
Conv. , leaky ReLU  
Conv. , Sigmoid  
Model  Similarity Metric  Hyperparameter 

MNISTVAE  SSIM  
WatsonDFT  
DeeplossVGG  
DeeplossSqueeze  
celebAVAE  SSIM  
WatsonDFT  
DeeplossVGG  
DeeplossSqueeze 
Appendix D Resource Requirements
When applied for training a generative model, the time and memory requirements of computing a loss function and its derivative are important. We measure these requirements by considering a typical learning scenario. Minibatches of 128 images of size
with either one (greyscale) or three channels (color) were forwardfed through the tested loss functions. The loss with regard to one input image was backpropagated, and the image was updated accordingly using stochastic gradient descent. We measured the time for
iterations and the maximum GPU memory allocated. Results are averaged over five runs of the experiment. We used PyTorch Paszke et al. (2017), 32bit precision, and a Tesla P100 GPU. The results are shown in Fig. D.4. For example, evaluation of WatsonDFT took 13s which was 5 times faster than DeeplossVGG on color images. This factor increased to 17 on greyscale images. Furthermore, WatsonDFT only required a few megabytes of GPU memory, compared to the 2 gigabytes of memory required for DeeplossVGG.Input  Metric  Runtime (s)  Mem. (Mb) 

Grey  
SSIM  
WatsonDCT  
WatsonDFT  
DeeplossVGG  
DeeplossSqueeze  
Color  
SSIM  
WatsonDCT  
WatsonDFT  
DeeplossVGG  
DeeplossSqueeze 