, achieving great success in numerous applications. Image-to-image translation is one such application, where the task involves the translation of one scene representation into another representation. It has been shown that neural network architectures are able to generalise to different datasets and learn various translations between scene representations. Further, semantic labels have been used to generate realistic looking scenes which can then be used for data augmentation, e.g., in an autonomous car system, where new scenes can be generated by handcrafted semantic label maps.
Most state of the art methods in image-to-image translation typically use a Generative Adversarial Network (GAN) loss with regularisation. The aim of this regularisation is to maintain the overall structure of the input image in the output image. This is typically achieved with functions such as the L1, L2 or mean squared error (MSE). However, these do not account for the human visual system’s perception of quality. For example, the L1 loss uses a pixel to pixel similarity which fails to capture the global or local structure of the image.
The main objective of these methods is to generate images that look perceptually
indistinguishable from the training data to humans. Despite this, metrics which attempt to capture different aspects of images that are important to humans are ignored. Although neural networks seem to transform the data to a domain where the Euclidean distance induce a spatially invariant image similarity metric, given a diverse enough training dataset, we believe that explicitly including key attributes of human perception is an important step when designing similarity metrics for image generation.
Therefore, in this paper we propose the use of a perceptual distance measure based on the human visual system that encapsulates the structure of the image at various scales, whilst normalising locally the energy of the image; the Normalised Laplacian Pyramid Distance (NLPD). This distance was found to correlate with human perceptual quality when images are subjected to perturbations such as Gaussian noise, mean shift and compression . NLPD has been shown to be superior in predicting human perceptual similarity, compared to a number of well-known metrics such as the MS-SSIM  and MSE.
The main contributions of this paper are as follows:
We argue that human perception should be used in the objective function of cGANs.
We propose a regulariser for cGANs that measures human perceptual quality in the form of NLPD.
We evaluate our proposed method, comparing it with the L1 loss using no-reference image quality metrics, image segmentation accuracy and an Amazon Mechanical Turk survey.
We show improved performance over L1 regularisation, demonstrating the benefits of an image quality metric inspired by the human visual system in the objective function.
Previously, image-to-image translation systems have been designed by experts and can only be applied to their respective representations, while being unable to learn different translations [9, 4]. Neural network are often able to generalise and learn a variety of mappings and have proven to be successful in image generation .
Conditional Generative Adversarial Networks
Generative Adversarial Networks (GANs) aim to generate data indistinguishable from the training data . The generator network
learns a mapping from noise vectorto target data , and the discriminator network learns mapping from data to label , corresponding to whether the data is real or generated. GANs have become very successful in complex tasks such as image generation . Conditional GANs (cGANs) aim to learn a generative model that will sample data according to some attribute e.g. ‘generate data from class A’ . This attribute is used to build a conditional generative model where the generator generates the data with respect to the attribute and the discriminator predicts whether the data is real or generated subject to the attribute.
Laplacian Pyramid Generative Adversarial Networks (LAPGANs) 
use the laplacian pyramid network framework in order to generate images of increasing resolution. At each stage of the pyramid, a separate GAN is trained to generate a higher resolution image, given the output of the previous stage. Although this algorithm uses the underlying framework, the method is vastly different to what is proposed in this paper. Training a GAN at each stage of a laplacian pyramid requires a large amount of parameters and computation time and given that GANs are troublesome to train on their own, training a cascade of GANs is extremely time consuming. As such we suggest the use of a similar loss function, using only a single GAN and with an additional normalisation step at each stage of the pyramid. This reduces the number of parameters and computation time massively.
One application of cGANs is image-to-image translation, where the generator is conditioned on an input image to generate a corresponding output image . Isola et al. proposed that the cGAN objective function has a structured loss, whereby the GAN considers the structure of the output space and pixels are conditionally-dependent on all other pixels in the image.
Optimising for the GAN objective alone creates images that lack outlines for the objects in the semantic label map and a common practice is to use either the L2 or L1 loss as a reconstruction loss. Isola et al. preferred the L1 loss, finding that the L2 loss encouraged smoothing in the generated images. The L1 loss is a pixel level similarity metric, meaning it only cares about the distance between single pixel values ignoring the local structure that could capture perceptual similarity.
Further using a related method, it has been shown that the style of one image can be changed to match the style of a specified image . CycleGAN is an extension of pix2pix where image-to-image translation is performed bidirectionally and the distance between ground truth images and images that have been translated to the other domain then translated back is calculated and used in the objective function. As a form of regularisation, a loss is introduced that aims to measure perceptual similarity often called the Visual Geometry Group (VGG) network loss.
When the output of a machine learning algorithm will be evaluated by human observers, the image quality metric (IQM) used in the optimisation objective should take into account human perception.
In the deep learning community, the VGG loss  has been used to address the issue of generating images using perceptual similarity metrics. This method relies on using a network trained to predict perceptual similarity between two images. It has been shown to be robust to small structural perturbations, such as rotations, which is a downfall of more traditional image quality metrics such as the structural similarity index (SSIM). However, the architecture design and the optimisation takes no inspiration from human perceptual systems and treats the problem as a simple regression task; given image A and image B, output a similarity that mimics the human perceptual score.
There is a long tradition of IQMs based on human perception. Probably the most well know is the SSIM or its multi scale version (MS-SSIM). While these distances focus on predicting the human perceptual similarity, their formulation is disconnected from the processing pipeline followed by the human visual system. On the contrary, metrics like the one proposed in by Laparra et al. are inspired by the early stages of the human visual cortex and show better performance in mimicking human perception than SSIM and MS-SSIM in different human rated databases . In this work we use an improved version of this metric, the Normalised Laplacian Pyramid Distance (NLPD), proposed by Laparra et al. .
Normalised Laplacian Pyramid
The Laplacian Pyramid is a well known image processing algorithm for image compression and encoding 
. The image is encoded by performing convolutions with a low-pass filter and then subtracting this from the original image multiple times, each time downsampling the image. The resulting filtered versions of the image have low variance and entropy and as such can be expressed with less storing information.
Normalised Laplacian Pyramid (NLP) extends the Laplacian pyramid with a local normalisation step on the output of each stage. These two steps are similar to the early stages of the human visual system. Laparra et al. proposed an IQM based on computing distances in the NLP transformed domain, the NLPD . It has been shown that NLPD correlates better with human perception than the previously proposed IQMs. NLPD has been employed successfully to optimise image processing algorithms, for instance to design an image compression algorithm  and to perceptually optimised image rendering processes . It has also been shown that the NLP reduces the correlation and mutual information between the image coefficients, which is in agreement with the efficient coding hypothesis , proposed as a principle followed by the human brain.
Specifically NLPD uses a series of low-pass filters, downsampling and local energy normalisation to transform the image into a ‘perceptual space’. A distance is then computed between two images within this space. The normalisation step divides by a local estimate of the amplitude. The local amplitude is a weighted sum of neighbouring pixels where the weights are pre-computed by optimising a prediction of the local amplitude using undistorted images from a different dataset. The downsampling and normalisation are done at stages, a parameter set by the user. An overview of the architecture is detailed in Figure (1).
After computing each output at every stage of the pyramid, the final distance is the root mean square error between the outputs of two images:
where is the number of stages in the pyramid, is the number of coefficients at stage , is the output at stage when the input is a training image and is the output at stage when the input is a generated image.
Qualitatively, the transformation to the perceptual space defined by NLPD transforms images such that the local contrast is normalised by the contrast of each pixels neighbours. This leads to NLPD heavily penalising differences in local contrast. Using NLPD as a regulariser enforces a more realistic local contrast and, due to NLPD observing multiple resolutions of the image, it also improves global contrast
In image generation, perceptual similarity is the overall goal; fooling a human into thinking a generated image is real. As such, NLPD would be an ideal candidate regulariser for generative models, GANs in particular.
NLPD as a Regulariser
For cGANs, the objective function is given by
where maps image and noise to target image , and maps image and target image to a label in .
With the L1 regulariser proposed by Isola et al.  for image-to-image translation, this becomes
is a tunable hyperparameter.
In this paper we propose replacing the L1 regulariser with a NLPD regulariser. In doing so the entire objective function is given by
NLPD involves convolution operations per stage in the pyramid, with the same convolution applied independently to each colour channel of the input. Although this is more computationally expensive than loss, relative to the entire training procedure of training a GAN, the increase in computation time is negligible.
In addition to this, with computational packages like Tensorflow and Pytorch, the process of transforming images into the perceptual space via a laplacian pyramid can simply be appended to the generator computation graph as extra convolutional layers with a very low number of parameters compared to traditional convolutional layers. There areconvolution filters, where is the number of stages in the pyramid, that should be stored in memory but the number of filters stored in a network is several orders of magnitude greater.
We evaluated our method on three public datasets, each varying in difficulty and subject matter; the Facades dataset , the Cityscapes dataset  and a Maps dataset . Colour images were generated from semantic label maps for both the Facades dataset and the Cityscapes dataset. The Facades dataset is a set of architectural label drawings and the corresponding colour image for various buildings. The Cityscapes dataset is a collection of label maps and colour images taken from the a front facing car camera, as it drives around various cities. For the Cityscapes dataset, images were resized to a resolution of and after generating the images they were resized to the original dataset aspect ratio of , as the network architecture used works best on square images. The third dataset is a Maps dataset of images taken from Google Maps that was constructed by Isola et al.. It contains a map layout image of an area and the corresponding aerial image resized to a resolution of .
The objective of all of these tasks is to generate a RGB image from the textureless label map. For all datasets, the same train and test splits were used as in the pix2pix paper, in order to ensure a fair comparison.
For all experiments, the architecture of both the generator and discriminator is the same as defined by Isola et al. . The generator is a U-net with skip connections between each mirroring layer. The discriminator is a patch discriminator which observes pixel patches at a time, with dropout applied at training. Full architecture can be found in the paper by Isola et al. or in the pix2pix repository 111https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix. In our method we use the least-squares adaptation of the GAN loss as it improves stability . We also used the Adam optimiser  with learning rate and trained each network for epochs. A batch-size of 1 was used with batch normalisation and each layer had ReLU activations applied to them. This methodology is essentially using an instance normalisation layer  and has been found to be ideal in training image-to-image translation models . Random cropping and mirroring were applied during training.
For the L1 regulariser, a value of was used, the optimal value found by Isola et al. . For NLPD, was found to be best after a hyperparameter search. The number of stages was chosen as ensuring that at the final stage the resolution of the output image will be . The normalisation filters were found by optimising the weights to recover the original local amplitude from various perturbed images using the McGill dataset . As these weights were found by optimising over black and white images, we apply the normalisation to each channel independently.
We vary the objective function that the network is trained with in order to highlight the effect of including the Normalised Laplacian Pyramid Distance as a regulariser.
Evaluating generative models is a difficult task . Therefore we have performed different experiments to illustrate the improvement in the performance when using NLPD as regulariser. In image-to-image translation, there is additional information in the form of the label map that images were generated with. A common metric involves evaluating how well a network trained on the ground truth performs at a task such as image segmentation on the generated images [10, 27]. Naturally, generated images which achieve higher performance at this task can be considered more realistic. One architecture that has been successfully used for image segmentation is the fully convolution network (FCN) .
In traditional image classification networks, the final layers often involve fully connected layers. FCNs replace these fully connected layers with fully convolutional layers to represent label heat maps . As such, most image classification networks can be adapted into image segmentation networks.
We use the typical approach from the literature  and train a FCN-8 for image segmentation on the Cityscapes dataset at a
resolution. Generated images are then produced from label maps in the validation set of the Cityscapes dataset. Following this, 3 image segmentation accuracy metrics are calculated. Per-pixel accuracy is the percentage of pixels correctly classified, per-class accuracy is the mean of the accuracies for all classes and class IOU is the intersection over union, which measures the percentage overlap between the ground truth label map and the predicted one.
We note that the ground truth accuracy is lower due to the network being trained on images of resolution , which are then upsampled to the full resolution of the label map, .
No-Reference Image Quality Metrics
Traditional image quality metrics often require a reference image, e.g., measuring the root mean square error between a generated image and the ground truth. However, when generating an image from a label map, the ground truth is just one possible solution.
There exist many images that could be feasibly generated from one label map and, as such, reference image quality metrics are unsuitable. Therefore we include two no-reference image quality metrics to more thoroughly evaluate the generated images, namely BRISQUE and NIQE.
Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) is an image quality metric that aims to measure the ‘naturalness’ of an image using statistics of locally normalised luminance coefficients 
. For natural images, these coefficients normally follow a Gaussian distribution and BRISQUE measures how well the mean subtracted contrast normalised (MSCN) coefficients fit a generalised Gaussian distribution. BRISQUE also measures how well a set of pairwise products between four orientations of the MSCN image fit an asymmetric generalised Gaussian distribution. The four orientations are vertical, horizontal, right-diagonal and left diagonal in order to capture the relationship between a pixel and it’s neighbours. Overall, BRISQUE was found to be an improvement over some full-reference image quality metrics, e.g., the structural scale similarity (SSIM).
Natural Image Quality Evaluator (NIQE)  is a fully blind image quality metric in that it has no knowledge of the types of distortions applied to the images. NIQE selects patches of the image that provide the most information and computes statistics such as local variance inside the set of patches. The distribution of these statistics for a query image is then compared to the distribution of natural images and a score is calculated.
Amazon Mechanical Turk
As our objective is to generate images which, to humans, look perceptually similar to the original images, we also evaluate the performance by asking humans to judge the quality of the generated images.
Experiments were conducted using Amazon Mechanical Turk (AMT) and users were asked to chose “Which image looks more natural?” when presented with one image generated using the L1 regulariser and another by NLPD regulariser. A random subset of images were chosen from the validation set of each dataset and unique decisions were gathered per image. The placement on the left or right of the images for each regulariser were randomly permuted.
|Loss Function||BRISQUE(NIQE) Scores|
|cGAN+L1||30.08 (5.23)||26.57 (3.86)||30.63 (4.71)|
|cGAN+NLPD||30.06 (5.21)||24.54 (3.57)||28.99 (4.59)|
|Ground Truth||37.29 (7.33)||25.40 (3.12)||28.48 (3.35)|
Table 1 shows results for the FCN-scores for the images generated using the Cityscapes database. In general the images generated using NLPD show improvement over the L1 regularisation, in particular in the per-pixel accuracy and class IOU. As such, it can be seen that the NLPD images contain more features of the original dataset according to the FCN image segmentation network.
Table 2 shows the scores for both the BRISQUE and NIQE image quality metrics. The two no-reference image quality metrics aim to measure the naturalness of an image. A lower value means a more natural image. On average, NLPD regularisation achieves lower values in both metrics. For Cityscapes and Maps, NLPD is close to the scores achieved by the ground truth. The ground truth scores for the Facades dataset can be worse than the generated images due to the large grey or black triangles that are in the Facades training set, included to crop out some of the sky and neighbouring buildings. These triangles are very unnatural textures and as such could cause the scores to be significantly worse.
Using Amazon Mechanical Turk we tested the human perceived quality by querying users regarding the naturalness of the presented images. The percentage of users that found the NLPD images more natural was above chance for the Maps () and Cityscapes datasets (), while similar for Facades (). Visual inspection of Fig. 2(a) shows that when generating from a map that contains a large building, NLPD produces more realistic textures, whereas L1 contains repeating patterns. In the Cityscapes dataset the contrast appears slightly more realistic, e.g., the white in the sky is lighter in Fig. 2, which could result in users preferring these images. In images generated using the Facades dataset, it is hard to visually find differences in Fig. 2(b) and therefore difficult to measure a preference between the two regularisers.
Taking into account human perception in machine learning algorithms is challenging and usually ignored in automatic image generation. In this paper we detailed a procedure to take into account human perception in a conditional GAN framework. We propose to modify the standard objective by incorporating a term that accounts for perceptual quality by using the Normalised Laplacian Pyramid Distance (NLPD). We illustrate its behaviour in the image-to-image translation task for a variety of datasets. The suggested objective shows better performance in all the evaluation procedures. Interestingly, it also has a better segmentation accuracy using a network trained on the original dataset, and produces more natural images according to two no-reference image quality metrics. In human perceptual experiments, users showed a preference for the images generated using the NLPD regulariser over those generated using L1 regularisation.
-  (2016-12) End-to-end optimization of nonlinear transform codes for perceptual quality. In Proceedings of the PCS, Nuremberg, Germany. External Links: Cited by: Normalised Laplacian Pyramid.
-  (1961) Possible principles underlying the transformation of sensory messages. Sensory Communication, pp. 217–234. Cited by: Normalised Laplacian Pyramid.
-  (1983) The laplacian pyramid as a compact image code. IEEE Transactions on communications 31 (4), pp. 532–540. Cited by: Normalised Laplacian Pyramid.
-  (2009) Sketch2photo: internet image montage. In ACM transactions on graphics, Vol. 28, pp. 124. Cited by: Related Work.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In IEEE CVPR, pp. 3213–3223. Cited by: Datasets.
-  (2015) Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pp. 1486–1494. Cited by: LAPGAN.
-  (2016) Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pp. 658–666. Cited by: Perceptual Distances.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: Conditional Generative Adversarial Networks.
-  (2001) Image analogies. In Computer graphics and interactive techniques, pp. 327–340. Cited by: Related Work.
Image-to-image translation with conditional adversarial networks. In IEEE CVPR, Cited by: Introduction, pix2pix, NLPD as a Regulariser, Datasets, Experimental Setup, Experimental Setup, FCN-Score, Evaluation.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Experimental Setup.
-  (2016) Perceptual image quality assessment using a normalized laplacian pyramid. Electronic Imaging 2016 (16), pp. 1–6. Cited by: Introduction, Perceptual Distances, Figure 1, Normalised Laplacian Pyramid.
-  (2017) Perceptually optimized image rendering. Journal Optical Society of America, A. Cited by: Normalised Laplacian Pyramid.
-  (2010) Divisive normalization image quality metric revisited. Journal of the Optical Society of America A 27 (4), pp. 852–864. Note: External Links: Cited by: Perceptual Distances.
-  (2015) Fully convolutional networks for semantic segmentation. In IEEE CVPR, pp. 3431–3440. Cited by: FCN-Score, Evaluation.
-  (2017) Least squares generative adversarial networks. In IEEE ICCV, pp. 2794–2802. Cited by: Experimental Setup.
-  (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. Cited by: Conditional Generative Adversarial Networks.
-  (2012) No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing 21 (12), pp. 4695–4708. Cited by: No-Reference Image Quality Metrics.
-  (2013) Making a” completely blind” image quality analyzer.. IEEE Signal Process. Lett. 20 (3), pp. 209–212. Cited by: No-Reference Image Quality Metrics.
-  (2017) Conditional image synthesis with auxiliary classifier GANs. In ICML, pp. 2642–2651. Cited by: Introduction.
-  (2004) A biologically inspired algorithm for the recovery of shading and reflectance images. Perception 33 (12), pp. 1463–1473. Cited by: Experimental Setup.
-  (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434. Cited by: Introduction, Conditional Generative Adversarial Networks, Related Work.
-  (1994) The statistics of natural images. Network: computation in neural systems 5 (4), pp. 517–548. Cited by: No-Reference Image Quality Metrics.
-  (2015) A note on the evaluation of generative models. International Conference on Learning Representations. Cited by: Evaluation.
Spatial pattern templates for recognition of objects with regular structure.
German Conference on Pattern Recognition, pp. 364–374. Cited by: Datasets.
Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6924–6932. Cited by: Experimental Setup.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In IEEE CVPR, pp. 8798–8807. Cited by: Evaluation.
-  (2003) Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems & Computers, Vol. 2, pp. 1398–1402. Cited by: Introduction, Perceptual Distances.
The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE CVPR, pp. 586–595. Cited by: Introduction.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE ICCV, pp. 2223–2232. Cited by: pix2pix.