Learning models can be divided into discriminative and generative [ng2002discriminative]. Many of the generative models and inferential autoencoders produce blurry images for different reasons. Variational Autoencoder (VAE) [doersch2016tutorial] has this flaw maybe because of the lower bound approximation or restriction on the distribution. However, another reason might be the use of a non-perceptual distance in its objective [wang2009mean]. Unconditional and conditional Generative Moment Matching Networks (GMMNs) [li2015generative, ren2016conditional]
also use radial basis function kernel havingnorm. Adversarial Autoencoder (AAE) [makhzani2015adversarial] and unconditional/conditional Generative Adversarial Networks (GANs) [goodfellow2014generative, mirza2014conditional] also use non-perceptual metrics in their objectives for comparison of real and fake data. Least Squares GAN (LSGAN) [mao2017least] uses
norm or Mean Square Error (MSE) in its loss function. However, MSE is shown not to be perfect for image quality assessment[wang2009mean]. Structural Similarity Index (SSIM) [wang2004image] is a perceptual measure for image quality. In this paper, we theoretically explain how SSIM can be used in different generative models and inferential autoencoders. Using SSIM can improve the perceptual quality of the generated images by these models. This is a poster paper and according to the expectation of the conference from a short poster paper, we suffice to the theoretical analysis and defer the empirical results to future work.
2 Structural Similarity Index, Image Structure Subspace, and SSIM Kernel
Consider two reshaped images . The SSIM between two reshaped image blocks and , in color intensity range , is: , where , , , , , and and are defined similarly for [wang2004image]. Since , we can simplify SSIM to , where and
. If the vectorsand have zero mean, i.e., , the SSIM becomes , where [otero2014unconstrained]. The distance based on SSIM, which we denote by , is [otero2014unconstrained, brunet2012mathematical]:
where . Note that if the means of blocks and are not very different, is still a good approximation to SSIM distance even without centering the blocks [brunet2012mathematical]. Some papers use this approximation and do not center the patches (cf. [zhao2016loss]).
Some works have used SSIM in machine learning for learning the image structure subspace[ghojogh2019image] which captures the intrinsic features of an image in terms of structural similarity and distortions. In [ghojogh2019image], a kernel, named SSIM kernel, is proposed which can be used in kernel methods in machine learning [hofmann2008kernel]. This kernel is where is the centering matrix, is the distance matrix, and is the sample size of data. Let be the distance map of two images and whose entry for every patch of these images is [brunet2012mathematical]:
Note that one may use Eq. (1) for (and in SSIM kernel) but should center every patch while Eq. (2) does not require preprocessing but may be harder to compute. The -th element of distance matrix is where is the Frobenius norm. Furthermore, note that the SSIM distance is quasi-convex [brunet2012mathematical] so it is suitable for optimization [brunet2018optimizing] in different applications such as machine learning [ghojogh2019image].
3 Generative Moment Matching Network
Maximum Mean Discrepancy (MMD), or kernel two sample test, is a measure of difference of two distributions by comparing their moments [gretton2012kernel]. Let and be two samples of the distributions and , respectively. The MMD is defined as where is a class of functions. If is a unit ball in a universal reproducing kernel Hilbert space , we have:
where is the pulling function and is the kernel [hofmann2008kernel].
GMMN [li2015generative] uses Eq. (3) as the loss for training a network where the Radial Basis Function (RBF) kernel is utilized. The GMMN is a network which accepts random uniform samples in input and tries to match the moments of network’s output with the batch of training data. It has two versions, i.e., in data space and code space. In the latter, the output layer of GMNN is the latent space of an autoencoder which is trained beforehand. Conditional GMMN [ren2016conditional] uses Conditional MMD (CMMD) where non-uniform weights are used in MMD. The CMMD is defined as where is the variable conditioned on and
is the tensor product (see[ren2016conditional] for more details).
In GMMN and conditional GMMN, the RBF kernel is used. SSIM kernel (see Section 2) can be used as the kernel in these two generative models. Note that only a universal kernel can be used in MMD [gretton2012kernel, li2015generative] and CMMD [ren2016conditional]. Paper [steinwart2001influence] has shown that according to the Stone-Weierstrass theorem [de1959stone], the universal kernels can be expanded in certain types of Taylor or Fourier series. The RBF kernel is an example. It is shown in [ghojogh2019image] that the SSIM kernel can be expanded by Taylor series similar to the RBF kernel; hence, SSIM kernel is a universal kernel and thus can be used in GMMN and conditional GMMN.
4 Variational Autoencoder
VAE [doersch2016tutorial] can be considered as the nonlinear generalization of factor analysis [harman1976modern]. Training of its encoder, with weights , and decoder, with weights
, can be seen as the E-step and M-step in expectation maximization algorithm, respectively. It maximizes the Evidence Lower Bound (ELBO) of the log likelihood of data[kingma2013auto]. The loss to be minimized in VAE is:
where is the KL-divergence, is the latent variable, is the input or re-generated data, and and are the conditional distributions in the encoder and decoder, respectively. The first and second terms in Eq. (4) are responsible for tuning the distribution of latent variable and better generation of data out of the latent variable, respectively. The second term, which takes care of data reconstruction, is usually replaced by the cross-entropy or norm of data and generated data. However, norm is not perfect for image fidelity [wang2009mean]. The fact that the generated images by VAE are not perceptually satisfactory has been addressed by literature [goodfellow2014generative]. We can use SSIM distance, i.e. Eq. (1) or (2), for the second term to measure how perceptually good the generated images are.
5 Generative Adversarial Networks & Adversarial Autoencoder
As mentioned, GAN [goodfellow2014generative] is proposed to cover the perceptual lack of VAE. It claims that the
norm used in VAE is a man-made distance but a complicated distance measured by a classifier network is used in GAN. This makes GAN’s generated images more perceptual. It is expected thatthe generated images become much more perceptual when the power of the complicated classifier and the SSIM metric are combined. The loss in GAN is a game-theoretical min-max problem:
where and are the distributions of the data and the latent variable, respectively, and and
are the discriminator (classifier) and generator, respectively. The probability of data coming from real distribution is denoted by. The first term in Eq. (5) is the log-likelihood of real data and the second term measures how different the generated and real data are. We can replace the second term by Eq. (1) or (2), i.e., , which is minimized and maximized by the generator and discriminator, respectively. This will measure how perceptually different the generated and real data are, resulting in perceptually better generated images because both the generator and discriminator become more powerful in terms of perceptual differences. Conditional GAN [mirza2014conditional] can use the same idea in its loss to have better generated images. AAE [makhzani2015adversarial], which is trained in an adversarial way like GAN, can also use the SSIM distance in its loss as was explained. Note that paper [kancharla2018improving] has used a similar technique in GAN but with multi-scale SSIM (MS-SSIM) [wang2003multiscale] used in Eq. (1) (see also [snell2017learning]). Other ideas are used in some papers such as [wu2017srpgan] which utilizes the middle-layer features of network for perceptually better generated images in GAN. On the other hand, LSGAN [mao2017least] uses norm or MSE in its objective function to be optimized by the discriminator and generator. However, MSE is not suitable for perceptual image assessment. We propose to use SSIM distance, i.e. Eq. (1) or (2), in place of the MSE terms in the LSGAN objectives to have perceptually better generated images.
We theoretically analyzed how to use SSIM in generative models and inferential autoencoders including GMMN, VAE, AAE, GAN, and LSGAN. The use of SSIM in these models can improve the perceptual quality of the generated images as these models use non-perceptual distance metrics in their loss functions.