A representation of the human visual system (HVS) is necessary to establish a robust image quality metric which is needed for computer vision applications[10.1117/12.135952, Mohamadi2020DeepBA]. Classical approaches considered hand-crafted strategies to mimic the properties of the HVS [Wang2004ImageQA, Xue2014GradientMS, Zhang2011FSIMAF, Wang2015AnOB] by implementing a stream of computational functions that are combined to identify the key perceptual properties of images. While these techniques played their role effectively, scaling up these methods have proven to be a daunting task, especially for applications with huge datasets. However, the introduction of neural networks has helped to improve the aforementioned task considerably [Isola2017ImagetoImageTW, Karras2018ProgressiveGO, Osahor2020QualityGS, Kancharla2019QualityAG, Aghdaie2021AttentionAW, Goodfellow2014GenerativeAN, Mostofa2020JointSRVDNetJS]. These deep networks consist of non-linear filters configured to extract key perceptual features within user-defined constraints from data.
In our work, we focus more on Full-Reference Image Quality Assessment (FR-IQA) models [Zhang2012SRSIMAF, Liang2016ImageQA], these models are mostly used to represent the HVS with the aim of deriving a quality measure for images by comparing the perceptual similarity between distorted images and their respective reference image. A standard FR-IQA model seeks to imitate the HVS by exploiting photographic computational algorithms that represent contrast sensitivity, visual masking, luminance, etc. A number of FR-IQA metrics have since being derived which include the Structure Similarity Index Measure (SSIM) [Wang2004ImageQA], Multi Scale Structure Similarity Index Measure (MS-SSIM) [10.1007/978-3-319-13671-4_40], Visual Information Fidelity (VIF) [Han2013ANI], Feature Similarity Index Measure (FSIM) [Zhang2011FSIMAF], Mean Deviation Similarity Index (MDSI) [Nafchi2016MeanDS], etc.
In this paper, we present a novel approach for improving the quality of GAN-synthesised images by combining the benefits of established FR-IQA metrics and the features of a deep convolutional neural network (DCNN). We extend the performance of popular GAN-based baseline approaches by introducing a novel image quality map fusion network that computes the perceptual properties of images and fuse them with a perceptual attention mechanism, as shown in Figure 1. We also introduce novel quality loss functions derived via Banach spaces to boost image quality. Our technique shows impressive results as compared to state-of-the-art. Our key contributions are as follows:
We introduced a new perceptual quality map fusion network that harnesses the perceptual qualities of computationally derived quality assessment metrics.
We propose a new norm implemented via Banach Wasserstein GAN (BWGAN) instead of the popular norm computed using the Wasserstein metric.
We also propose a perceptual attention mechanism (PAM) that augments image features to boost the overall visual appeal of the synthesised images.
2 Related Work
The FR-IQA model tries to simulate the HVS characteristics with good performance measures [Pei2015ImageQA, Reisenhofer2018AHW, Ye2012UnsupervisedFL, Zhang2014VSIAV]
. Two main reasons for the success of FR-IQA can be attributed to the deep learning based perceptual properties of the reference image and the hand-crafted features derived from statistical metrics which are similar to the HVS. Hence, it becomes easier to build a system that minimises the difference between these two corresponding features. In order to effectively model the properties of the HVS, a couple of related systems have been proposed. Zhang et al.[Zhang2011FSIMAF] proposed a similarity index metric which calculated the phase congruence and gradient magnitude to represent the HVS system, while [Xue2014GradientMS]
implemented an efficient standard deviation pooling strategy which demonstrated that the gradient magnitude of an image still holds true as a technique for representing the HVS.[Nafchi2016MeanDS] adopted a novel deviation pooling technique to compute the quality score from the gradient and chromaticity similarities as a measure for local structural distortions.
The Banach Wasserstein GAN (BWGAN) is a framework that makes use of arbitrary norms other than the popularly used norm as the underlying metric of choice in adversarial training. Adler et al. in [Adler2018BanachWG] translated the WGAN-GP model to Banach spaces which have the capacity to utilize norms that capture desired image features like edges, texture, etc.
3 Our Approach
We present a single GAN model capable of implementing image-to-image synthesis. We combine the benefits of established FR-IQA metrics [Zhang2011FSIMAF, 10.1007/978-3-319-13671-4_40, Nafchi2016MeanDS] and the low-level salient features of a deep convolutional neural network (DCNN) to aid adversarial image synthesis aimed at producing perceptually appealing images. In our model, we introduced an attention schema that exploits the salient perceptual features in a channel-wise fashion and the spatial map representation embeddings of standard FR-IQA metrics (SSIM, MDSI and FSIM). Our framework consists of five main components; a quality-aware generator network , where and represent the encoder and decoder section, respectively. and are coupled with a perceptual attentive mechanism (PAM) for quality encoding and a perceptual quality map fusion network at the latent space as shown in Figure 1. The discriminative networks critics images generated by in an adversarial manner without compromising image quality and the perceptual consistency with the reference image. The perceptual quality map generator combines the core quality metric functions that capture the sensitive perceptual features of a given image, while the score regression network pools the images synthesized by
to estimate reference quality score. Our overall objective consists of a Wasserstein Gradient Penalty, Structural Similarity Index Gradient Penalty (SSIM-GP) and a Natural Image Quality Estimator (NIQE) as defined in section 4.
3.1 Perceptual Attention Mechanism (PAM)
The attention mechanism augments perceptual features from a prior generator encoder network computed over input images . The aim is to establish a convex combination of quality-enhanced condensed representations of the input image for real time training. We begin by describing the channel attention in PAM, which is based on the CBAM module [Woo2018CBAMCB]. PAM involves two steps: first, per-channel “summary statistics” obtained from a 2-layer residual block
, is calculated to yield the global feature attention vector. Secondly, a multi-head network applies a non-linear multi-head attention transformation which allows the model to jointly attend to information and from different representation sub-spaces and of the bottleneck [Vaswani2017AttentionIA]. The channel-based attention output is given as = softmax which is multiplied with the encoder output from and processed by the residual block to produce the channel-based attention embeddings, denoted as where .
3.2 Perceptual Map Generator
We selected the FSIM [Zhang2011FSIMAF], MS-SSIM [10.1007/978-3-319-13671-4_40], and MDSI [Nafchi2016MeanDS] image quality metrics to generate similarity maps because the trio collectively capture key image characteristics that are similar to the HVS [Wang2004ImageQA, Nafchi2016MeanDS, 10.1007/978-3-319-13671-4_40, Zhang2011FSIMAF, Aghdaie2021DetectionOM] as shown in Figure 2. The FSIM metric captures the luminance, contrast and structural information. For the MS-SSIM metric, we considered multiple scales of the synthesised image and its reference for contrast and structure while the MDSI map is derived by extracting the gradient and chromaticity of the pair of synthesised and reference images, respectively. We use an intensity coefficient; to specify the intensity of the maps. The map fusion network is divided into three stages. First, we extract the feature similarity representations between the reference image and the generated images per iteration given as for the aforementioned similarity metrics, where is an arbitrary function used to calculate similarity index maps; MS-SSIM, FSIM and MDSI. Secondly, the generated maps, ( and ) are concatenated and pre-processed by two-layer MLP networks to form a spatial-based perceptual map representation . At the last stage, the predicted future states is then computed as the expectation of spatial features and the channel-based features . is then summed with the output of the encoder given as . The resulting output which is fed to decoder represents latent features that are optimized for better image quality.
4 Banach space gradient penalty
Quality assessment metrics for the distance between images has been limited to cost functions that take the form of or norms. However, issues like non-convexity and complications in gradient computations (vanishing gradients, exploding gradients, etc) are some of the struggles experienced in formulating optimization problems. To mitigate the aforementioned computational shortcomings, the Wasserstein distance was introduced in [Arjovsky2017WassersteinG].
However, a wide variety of untapped metrics [10.1007/978-3-319-13671-4_40, Wang2004ImageQA] exist that can be used to compare and emphasize key features of interest. In this regard, we extended the Wasserstein distance beyond the popular WGAN with the gradient penalty (WGAN-GP), which is constrained to norms and rather adapted a more complete space called the Banach space [Adler2018BanachWG]. Our technique, similar to [Arjovsky2017WassersteinG, Kancharla2019QualityAG, Adler2018BanachWG] shows that the characterisation of -lipschitz functions via the norm of the differential can be extended from the setting to arbitrary Banach spaces by considering the gradient as an element in the dual of . Such a loss function is given as:
where , are regularization parameters. These Banach space norms give room for specific image features such as texture, structure, contrast and luminance which highlight the perceptual appeal of a human observer, as described in section 4.1 and 4.2.
4.1 Structural Similarity (SSIM) index
The SSIM index measures the perceptual difference between two similar images. The local mean, variance and structure are computed to find an local quality score[Wang2004ImageQA]. The SSIM index computes changes to local mean, local variance and local structure between two images and . The local scores are then averaged across the image to find the image quality score.
where and refer to the input and synthesised images, the subscript is the pixel index, and are the local mean and standard deviation, respectively. , and are the local luminance, contrast and structure scores at pixel , respectively. Furthermore, since is bounded, the lipschitz constant can be imposed directly by introducing a gradient penalty regularization term given as:
This makes the SSIM a good candidate for quality awareness which is beneficial for regularizing GANs. The complete mathematical properties are described in [Brunet2012OnTM].
4.2 Natural Image Quality Estimator (NIQE)
The NIQE [Mittal2013MakingA]
is an NR-IQA metric of perceptually relevant spatial domain Natural Scene Statistics (NSS) features extracted from local image patches that capture the essential low-order statistics of natural images. The equation is given as:
where is the pixel index and and
are the local mean and standard deviation. The NIQE captures the naturalness of a pristine reference image by modelling a generalized gaussian distribution (GGD)[Ruderman1994TheSO], and models the products of neighbouring pixel coefficients using an Asymmetric GGD (AGGD). The parameters of both the GGD and AGGD are then modelled using a Multivariate Gaussian Model (MVG) distribution [Moorthy2010StatisticsON]. The quality of the test image is measured in terms of the “distance” of its MVG parameters and from the pristine MVG parameters obtained. Finally, discriminator gradients computed for both pristine reference and synthesised images are used to compute the distance between the pair. The expression is given as:
where , , and are the mean and covariance of the reference and synthesised images, respectively. In addition to the SSIM and NIQE metrics, we also used a 1-GP regularizer [Gulrajani2017ImprovedTO] designed to force the local statistics of the discriminator gradient to be as close to those of real images. Our claim is that such a regularization strategy results in improving visual quality of the generated images especially for attributes like hair, age, skin colour etc. We worked in the WGAN-GP framework to demonstrate our method. The overall discriminator cost function includes the NIQE function regularizer, the SSIM and the 1-GP regularizer defined as:
The full objective is given as:
where is the generated score from the regression network minimised over the groudtruth scores of the images. we use and as a means of tuning the objective functions to achieve better results.
5 Training Strategy
We trained our model using the Adam optimizer, with momentum values set at = 0.5 and = 0.99, we used a batch size of 8 for most experiments on CelebA [liu2015faceattributes], Celeba-HQ [Lee2019MaskGANTD] and FairFAce [Krkkinen2019FairFaceFA]
, respectively. A learning rate of 0.0001 for the first 10 epochs which linearly decayed to 0 over the corresponding epochs. We trained the entire model on three NVIDIA Titan X GPUs.
We evaluated the efficacy of our proposed technique on the following datasets: The CelebFaces Attributes (CelebA) [liu2015faceattributes] of 202,599 celebrity face images. We cropped the initial images to 178x178, then resized them to 64x64. The CIFAR-10 [Krizhevsky09learningmultiple] dataset consists of 60,000 32x32 colour images in 10 classes, with 6,000 images per class. The Fair-Face [Krkkinen2021FairFaceFA] image dataset contains 108,501 images, with an emphasis of balanced race composition in the dataset comprising 7 race groups: White, African, Indian, East Asian, Southeast Asian, Middle East, and Hispanic. For evaluations, we used the LIVE [Sheikh2006ASE] which consists of 982 distorted images with 5 different distortions. The TID2008 dataset [Reisenhofer2018AHW] that contains 25 reference images and a total of 1,700 distorted images. We also used Edges-to-shoes 50,000 training images from UT Zappos50K dataset [Yu2014FineGrainedVC] and Edges-to-Handbag 137,000 Amazon Handbag images from [Zhu2016GenerativeVM], trained for 15 epochs and batch size 8.
To evaluate the performance of the synthesized images as shown in Figure 3, two key evaluation criteria were adopted in our paper; the Spearman’s Rank Order Correlation Coefficient (SROCC) and the Linear Correlation Coefficient (LCC) [Forthofer1981]. SROCC is a measure of the monotonic relationship between the ground-truth and model prediction, while the LCC is a measure of the linear correlation between the ground-truth and model prediction. Table 1 and 2 shows the SROCC and LCC performance of the competing IQA methods for different distortion types, respectively. In general, our model performs competitively among most distortion types. Compared with BPSQM, our model shows more performance of about 4.5% overall in dealing with the distortion of AGN, SCN, HFN, JPEG and MN, respectively as indicated in Table 1. For comparison with previous models, we computed three quantitative measures: Inception Score (IS), Frechet Inception Distance (FID) and the Feature Similarity (FSIM) index. IS measures the sample quality and diversity by finding the entropy of the predicted labels. FID score measures the similarity between real and fake samples by fitting a multivariate Gaussian (MVG) model to the intermediate representation. The FSIM index computes quality estimates based on phase congruency as the primary feature, and incorporates the gradient magnitude as the complementary feature for the real and fake samples, respectively. Table 3 shows the quantitative comparison of the GAN-metric performance for BPGAN [Wang2017BackPA], CAGAN [Yu2017TowardsRF], CGAN [CycleGAN2017] , WAGAN [Arjovsky2017WassersteinG], QAGAN [Kancharla2019QualityAG] and ours for CelebA dataset. We also carried out pixel variation analysis on the synthesised images by using the second order features of the synthesized images, which are based on the gray level co-occurrence matrix (GLCM) [Haralick1979StatisticalAS]. We used the aforementioned technique to determine the Entropy, Homogeneity and Correlation of the synthesised images in comparison with state-of-the-art methods as shown in Figure 4. Entropy is useful for assessing sharpness while Homogeneity and Correlation are useful for evaluating the Contrast of an image. Entropy and Correlation increase in image quality, whereas Homogeneity energy values decrease with increase in image quality. From the Entropy plot in 5(a), our model performs decently well by over 3.5% compared to the QAGAN and WAGAN. The Contrast level improves drastically for our approach as compared to the other methods that are closely matched at a tolerance of about 2%. We also observed that most models possess similar homogeneity values except our model and QAGAN which reflect significant performance values.
5.3 Ablation Study
Ablation studies on our loss functions was implemented to test model robustness in general for the CIFAR-10, FairFace and CelebA datasets, respectively. The Lagrange coefficients and of the SSIM and NIQE losses were also changed empirically within ( and ) range, to check the effect on the perceptual appeal of the synthesised images. It was inferred that reducing the coefficients towards the lower limit weakens the discriminative power which in turn reduces the quality of the synthesised images from the generator. We also conducted a Histogram of Oriented Gradient (HOG) similarity performance with the Inception v3 model [Szegedy2016RethinkingTI] for the input and synthesised images on the FairFace dataset, in order to obtain the model layer-wise performance at specific iterations of the baseline of our model. Figure 6 (a) shows the HOG similarity performance at different iterations while training for our model compared to other quality metric techniques.
Our results show that our approach is closest to MDSI [Nafchi2016MeanDS], as compared to RVSIM [Ye2012UnsupervisedFL], GSMD [Xue2014GradientMS], SRSIM [Zhang2012SRSIMAF], FSIMc [Zhang2011FSIMAF] that perform slightly below our model. An SRCC plot representation in Figure 5 depicts the rank correlation performance for both FairFace and CelebA dataset. The values confirm that our model performs favourably over other aforementioned techniques. At different iteration values, we also observed decent image quality improvements at about 20k - 30k iterations as shown in Figure 6(b) for the FairFace dataset. Figures 7 and 8 show further results obtained from a combination of different loss functions and other competitive models, respectively.
We computed the FID and IS scores of synthesised images for ClebeA and CIFAR-10 datasets with resolutions of 64 x 64 and 32 x 32, respectively. Table 4 shows the performance of our model baseline (BL) for different combinations of attention schemes (PAM and ) and the IQA losses (NIQE and SSIM). By observation, we see from Table 4 that including the module significantly boost image quality, this is a confirmation that perceptual spatial salient maps are crucial in GAN models for better image quality [Yang2018BlindIQ, Saad2012BlindIQ, Johnson2016PerceptualLF].
Furthermore, we applied the PAM and attention modules to the StleGAN2 [Karras2020AnalyzingAI] architecture. We also added the proposed Banach space norms (SSIM NIQE) to compare the overall model performance with our model. In Table 5, we show the trade-offs of the QAGAN [Kancharla2019QualityAG], StyleGAN2 [Karras2020AnalyzingAI] and our model baseline (BL). We used different combinations of the standard IQA metrics as discussed in section 4.1 and 4.2. Our findings confirm that our approach is competitive with state-of-the-art. Most importantly, we see improved performance of our model at lower resolutions (32 x 32), this improvement can be attributed to the attention schema employed. In Figure 9, we showcase the performance of QAGAN [Kancharla2019QualityAG] and our model on image synthesis for CelebA dataset. Our results show that our model performs significantly well overall, Table 5 gives a clearer representation of the performance levels.
In this paper, we introduced a novel quality encoding protocol that harnesses the image quality maps mimicking the HVS and the perceptual properties from a deep convolutional neural network (DCNN) to provide perceptually consistent features that translate to better image quality. We identified visually sensitive parameters and adapted a quality perceptual attention scheme that narrows down these features to a localised embedding which incentives perceptual representations over other features. The aim was to target the most relevant intrinsic features responsible for image texture, structural contrast and luminance which we use to guide the adversarial model towards high quality image synthesis. We also introduced a critic model that monitors perceptual consistency for each image representation. We demonstrated state-of-the-art or comparable performance over other approaches.