SSD-GAN: Measuring the Realness in the Spatial and Spectral Domains. AAAI2021.
This paper observes that there is an issue of high frequencies missing in the discriminator of standard GAN, and we reveal it stems from downsampling layers employed in the network architecture. This issue makes the generator lack the incentive from the discriminator to learn high-frequency content of data, resulting in a significant spectrum discrepancy between generated images and real images. Since the Fourier transform is a bijective mapping, we argue that reducing this spectrum discrepancy would boost the performance of GANs. To this end, we introduce SSD-GAN, an enhancement of GANs to alleviate the spectral information loss in the discriminator. Specifically, we propose to embed a frequency-aware classifier into the discriminator to measure the realness of the input in both the spatial and spectral domains. With the enhanced discriminator, the generator of SSD-GAN is encouraged to learn high-frequency content of real data and generate exact details. The proposed method is general and can be easily integrated into most existing GANs framework without excessive cost. The effectiveness of SSD-GAN is validated on various network architectures, objective functions, and datasets. Code will be available at https://github.com/cyq373/SSD-GAN.READ FULL TEXT VIEW PDF
SSD-GAN: Measuring the Realness in the Spatial and Spectral Domains. AAAI2021.
Generative Adversarial Networks (GANs) Goodfellow et al. (2014) involve training a generator and discriminator network in an adversarial manner, such that the generator learns to reproduce the desired data distribution. Despite the remarkable achievements in image generation tasks Isola et al. (2017); Karras et al. (2019), as shown in recent works Zhang et al. (2019); Durall et al. (2020); Dzanic and Witherden (2019); Frank et al. (2020)
, we can efficiently distinguish GAN-generated images from real images in the frequency domain, which indicates that existing GANs fail to learn the spectral distributions.
Recent studies Dzanic and Witherden (2019); Frank et al. (2020) show that the frequency spectrum discrepancy mainly exists at high frequencies. Because high-frequency components of images influence the exactness of details, the discrepancy cannot be ignored for generative tasks where details matter. As shown in Fig. 1, when real data contains significant high frequencies, standard GAN might fail to reproduce the desired data distribution. Moreover, since the Fourier transform is a bijective mapping, the frequency spectrum discrepancy between real data and the generated samples also indicates that the data distribution in image space is not well captured. We believe that reducing the spectrum discrepancy would boost the performance of GANs.
In this paper, we first attempt to explore why there is a spectrum discrepancy between real data and the generated samples. By investigating the downsampling techniques widely used in the discriminator networks, we reveal that both of these downsampling strategies, downsampling with anti-aliasing and downsampling without anti-aliasing, would lead to high frequencies missing in the discriminator. Since the training of GANs is a two-player minimax game, the generator lacks incentives from the discriminator to learn the high-frequency information of the data.
To address the issue of high frequencies missing in the discriminator, we propose SSD-GAN, whose discriminator can measure the realness of the input in both the spatial and spectral domains. To instantiate the idea, we introduce an additional spectral classifier to detect frequency spectrum discrepancy between real and generated images and integrate it into the discriminator of GANs. With the enhanced discriminator, the generator of SSD-GAN is encouraged to reduce frequency spectrum discrepancy and generate realistic images in both the spatial and spectral domains. Since a lightweight spectral classifier can be effective, the proposed method is general and can be easily integrated into most existing GANs framework without excessive cost. In the experiment, the effectiveness of the proposed method is validated on various network architectures, objective functions, and datasets.
The contributions of the paper can be summarized as follows:
We observe there is an issue of high frequencies missing in the discriminator of GANs and reveal it stems from downsampling layers employed in the network architecture, which results in a significant spectrum discrepancy between generated images and real images.
We introduce SSD-GAN, an enhancement of GANs to alleviate spectral information loss in the discriminator. With a discriminator that can measure both the spatial and spectral realness of an input sample, SSD-GAN can better capture the data distribution than standard GANs.
We show experimentally that the quality of the generations can be improved by reducing the frequency spectrum discrepancy, which emphasizes the necessity of learning in the frequency domain.
Recent rapid advances in Generative Adversarial Networks (GANs) Goodfellow et al. (2014)et al. (2018); Ren et al. (2019) et al. (2017) et al. (2019), etc. To enhance the quality of generated samples, PG-GAN Karras et al. (2018) introduces a progressive growing manner for the training process to increase the resolution of synthesized images. StyleGAN Karras et al. (2019) propose a style-based generator for finer control over the image synthesis. Other lines of work focus mainly on improving the discriminator of GANs. As details are important for generative tasks, PatchGAN discriminator Isola et al. (2017) utilizes local discriminator feedback to capture better local structures. SNGAN Miyato et al. (2018) limits the spectral norm of the weight matrices in the discriminator for Lipschitz constraint. For countering discriminator forgetting and stabilize the training process, SS-GAN Chen et al. (2019) proposes to rotate the image and ask the discriminator to predict the rotation angle. To effectively balancing the performance of the generator and discriminator, variational discriminator bottleneck Peng et al. (2019) constrains information flow in the discriminator. In this paper, after observing the issue of high frequencies missing in the discriminator, we aim to enhance the ability in the frequency domain of the discriminator. Thus the generator is encouraged to learn the spectral distribution of real data.
Even though some GAN-generated images seem to be flawless for human perception, recent studies Zhang et al. (2019); Dzanic and Witherden (2019); Frank et al. (2020); Durall et al. (2020) find frequency analysis is effective for image forensics. They also show that existing GAN based models always fail to reproduce the spectral distribution of real data. AutoGAN Zhang et al. (2019) first identifies that spectral artifacts stem from upsampling modules included in the GANs pipeline. To compensate spectral distortions, a spectral regularization term Durall et al. (2020) is proposed to add to the generator loss. Frank et al. (2020) examines StyleGAN instances using different upsampling techniques and finds bilinear sampling followed by anti-aliasing filters would help to alleviate the problem. In this paper, we investigate another source of spectral distortions, the issue of high frequencies missing in the discriminator.
Apart from frequency analysis for image forensics, researchers prove that neural networks exhibit a spectral biasRahaman et al. (2019); Xu et al. (2019); they learn filters with a strong bias towards lower frequencies. Based on this observation, to effectively control the resource usage, band-limited convolutional layer Dziedzic et al. (2019) is introduced to constrain the frequency spectra of filters and data, while retaining high performance for classification tasks. However, high-frequency components cannot be ignored for generative tasks where details matter. To guarantee all the information can be kept in the model, MWCNN Liu et al. (2018) utilizes discrete wavelet transform (DWT) as a downsampling module in the network architecture for image restoration. Different from it, we introduce a spectral classifier to compensate for high-frequency information loss of GANs’s discriminator.
In the standard GANs Goodfellow et al. (2014), the adversarial loss for the discriminator is defined as:
represents the probability thatcomes from rather than the generator’s distribution . In other words, measures the realness of the sample . If is realistic, then it is realistic in all aspects, such as in the spatial and frequency domains. However, as pointed out in recent works Zhang et al. (2019); Durall et al. (2020); Dzanic and Witherden (2019); Wang et al. (2020b); Frank et al. (2020), existing GAN based models usually fail to synthesize samples that are realistic in the frequency domain. It suggests that we cannot only measure the realness in the spatial domain.
Why do these GAN based models fail to reproduce the spectral distributions? We suspect that the generator lacks incentives from the discriminator to learn the high-frequency information of the data, since the training of GANs is a two-player minimax game. To validate the assumption, we first randomly sample 1,000 images from the real dataset. Then we modulate the amplitude of high frequencies of different bands. For the discriminator of StyleGAN Karras et al. (2019), we compute the mean output of it for images after inverse Fourier transform of the modified spectra. As shown in Fig. 2, unless we change the spectrum over a large bandwidth, the discriminator cannot tell the difference of these spectra and the outputs are roughly the same. As a result, if the generated images contain some unusual high-frequency components, the discriminator may not distinguish them to be fake, which makes StyleGAN fail to reproduce spectral distribution.
Why does the discriminator fail to distinguish the high-frequency contents of the images? We believe that there is an issue of high frequencies missing in the architecture of the discriminator. Specifically, this issue stems from the downsampling modules of the discriminator. When downsampling an input image, based on the classical sampling criterion Nyquist (1928), a reasonable approach is to anti-alias by low-pass filtering the image. Some networks adopt this form of blurred-downsampling LeCun et al. (1990); Karras et al. (2019)
. However, the low-pass filter removes the high frequencies of the input images. Since low-pass filtering before downsampling usually results in performance degradation, it is rarely used today. Another line of downsampling methods, such as max-pooling, strided-convolution, and average-pooling, abandons the use of the low-pass filter. However, these commonly used downsampling methods ignore the sampling theoremZhang (2019), and high-frequency components are aliased and become invalid. To sum up, both of these downsampling strategies, downsampling with anti-aliasing and downsampling without anti-aliasing, lead to high frequencies missing in the discriminator.
As shown in Fig. 3, we provide evidence for the above statement. For an input image, we first enhance the amplitude of high frequencies using a sharpening filter. Then we downsample the raw image and the modulated image and compare the results. For Gaussian blur followed by average-pooling, which belongs to downsampling with anti-aliasing, the high-frequency components are attenuated, and there is no significant difference between the spectra before and after the amplitude modulation. For average-pooling, since it does not provide the anti-aliasing capability, the results exhibit aliasing (e.g., see shirt in the images), which indicates high-frequency details become invalid. The more downsampling modules the deep network has, the wider the bandwidth of the lost high frequencies, which indicates that the high frequencies missing issue cannot be ignored, especially for generative tasks where details matter.
In this section, we first introduce a spectral classifier to detect frequency spectrum discrepancy between real and generated images. Then we integrate into the discriminator of GANs to enhance its ability in the spectral domain, thereby reducing the spectrum discrepancy.
To address the issue of high frequencies missing in the discriminator, a straightforward approach is to discriminate in the frequency domain rather than the spatial domain. For a discrete two-dimensional signal representing an image of size , we first compute the discrete Fourier transform of it,
for the spectral coordinates and . Then we convert it from Cartesian coordinates and to polar coordinates and for better representing the frequencies of different bands,
Recent works Durall et al. (2020); Dzanic and Witherden (2019) have shown that a simple 1D representation of the Fourier power spectrum is effective to highlight the difference between the spectral characteristics of real and deep network generated images. Following these works, we get the reduced spectral representation by azimuthally averaging over ,
which represents the mean intensity of the signal with respect to the radial distance . The reduced spectral representation smooths the fluctuations in the spectrum at high frequencies.
For an input image
, we use the grayscale component of it to get its spectral vectorand denote the process as . The spectral classification loss is:
where measures the spectral realness of , and is the generator ’s distribution.
Since a sample is realistic if and only if it is realistic in both the spatial and frequency domains, we propose to measure the realness of with the combination of spatial realness and spectral realness. We integrate the spectral classifier into the discriminator of GANs to encourage the generator to learn the high-frequency content of the data, As shown in Fig. 4, our enhanced discriminator consists of two modules, a vanilla discriminator which measures the spatial realness, and a spectral classifier . Thus, is a discriminator measuring the realness of the input in both the spatial and spectral domains, and the overall realness of a sample is represented as:
is a hyperparameter that controls the relative importance of the spatial realness and the spectral realness. The adversarial loss of the framework can be written as:
where represents the generator ’s distribution.
To train our model, we alternately update spectral classifier , discriminator , and generator with the following gradients:
Since much information of the image is discarded in the spectral vector
, we found it cannot provide an effective gradient for the adversarial training, which degrades the performance of the model. To this end, we propose that the backpropagation process of Eq.7 does not pass through the spectral classifier , and serves as a spectral modulating factor to the adversarial loss.
We compare the gradient of standard GAN (SGAN) and the proposed method for further insight. For a generated image , the gradients of the discriminator and generator in non-saturating SGAN are respectively:
where and are the parameters of and , and is the Jacobian. As for our method, the gradients are:
From these gradients, it can be observed that our method performs a hard example mining, where ”hard” is defined in the frequency domain. For the discriminator, if , the generated sample has good spectral characteristics and is a hard example to be classified as fake. For the generator, is a hard example when . This means that has poor spectral realness and needs more attention from the generator. In our model, when is a hard example in the frequency domain, the gradients of the discriminator and generator are up-weighted, which induces the model to learn the spectral distribution of the real data.
Since is easy to compute and can be a lightweight classifier Durall et al. (2020), the proposed method is general and can be easily integrated into most existing GANs framework without excessive cost. In the experiment, we show that for various GANs frameworks of different objective functions, network architectures and datasets, our method can reduce the frequency spectrum discrepancy and improve the performance in the spatial domain.
Based on StyleGAN Karras et al. (2019), We evaluate the effectiveness of our method on the FFHQ Karras et al. (2019) dataset. It consists 70,000 high-quality images at 10241024 resolution. We use the same implementation as StyleGAN. In the discriminator, the activations are blurred before each downsampling layer for anti-aliasing. The training process is under a progressive growing manner Karras et al. (2018) which starts from 88 to 10241024. We apply the non-saturating loss Goodfellow et al. (2014) as our adversarial loss with regularization Mescheder et al. (2018). We train all our models with Adam optimizer Kingma and Ba (2015), setting . The total training time is 25M images. The hyperparameter is set to 0.5. The experiments are conducted on 4 Tesla V100 GPUs.
We first validate whether our proposed SSD-StyleGAN can reduce spectral distortions. We estimate the average spectrum by averaging over 5,000 images. Then we plot the absolute difference between the two average spectra. As depicted in Fig. 6
, compared to StyleGAN, the frequency spectrum discrepancy between images generated by SSD-StyleGAN and real data is significantly reduced. We also notice that both of the models have spectral distortions at the corners of the spectra that represent extremely high frequencies of images. In pratice, due to image compression algorithms applied to real data, these high-frequency bands contain little information. Therefore, to discourage overfitting, GANs tend not to learn these extremely high frequencies, and these components of the generated images behave like white noise which has constant power densityDzanic and Witherden (2019).
Table 1 reports the performance in the spatial domain. We adopt Fréchet Inception Distance (FID) Heusel et al. (2017) to evaluate the perceptual quality of generated images, and perceptual path length (PPL) Karras et al. (2019) to measure the degree of disentanglement of representations. Because the feature extractors used in these metrics are neural networks that map from a high-dimensional input space to a low-dimensional space, they also suffer from some degree of high-frequency loss and mainly measure the characteristics in the spatial domain. It is evident that our method performs better than StyleGAN on both these metrics. We attribute the performance improvement to alleviating the high frequencies missing problem in the discriminator. By reducing spectral distortions, it helps to reproduce the spatial distribution of real data, since the Fourier transform is a bijective mapping.
For qualitative evaluation, we utilize the recent embedding algorithm Abdal et al. (2019) to map a given image into the
space of a pre-trained StyleGAN and then reconstruct back for comparison. We note that the models may memory images during training and produce good reconstructions. To avoid overfitting, we propose to compare the interpolations of StyleGAN and SSD-StyleGAN. As shown in Fig.7, compared to StyleGAN, the results of our proposed method show a smoother morphing and have better details, which is consistent with the quantitative evaluation. Fig. 5 shows a collection of generations obtained from SSD-StyleGAN.
In this section, we evaluate the performance of the spectral classifier for estimating the spectral quality of samples. As shown in Fig. 8, for real data and generations of SSD-StyleGAN, we present two images with high and low spectral quality scores. In general, the samples with high spectral quality scores display a clear portrait. However, the images with low spectral quality scores are often overexposed and lose details, or have some unusual high-frequency components (e.g., see background and headwear in the right column). The above observation shows the effectiveness of the spectral classifier , which is beneficial to the learning process of SSD-StyleGAN.
Based on SNGAN Miyato et al. (2018), we evaluate the proposed method on a range of datasets including CIFAR100 Krizhevsky and Hinton (2009), STL10 Coates et al. (2011), and LSUN-bedroom Yu et al. (2015). Since these datasets have various kinds of resolution, we mark them with the resolution: CIFAR100-32, STL10-48, and LSUN-128. We use the same training configurations as Lee and Town (2020). We train all our models with Adam optimizer Kingma and Ba (2015), setting . The learning rate is set to 0.0002, and the minibatch size is 64. The hyperparameter is set to 0.5. All models are trained on a single Tesla V100 GPU.
We compare our method against three baselines, including:
SNGAN Miyato et al. (2018) limits the spectral norm of the weight matrices in the discriminator for Lipschitz constraint. It adopts a ResNet He et al. (2016) backbone and uses average-pooling as the downsampling layer in the discriminator. Different from StyleGAN, it utilizes the hinge versionMiyato et al. (2018) of the adversarial loss.
SNGAN+REG Durall et al. (2020) adds a spectral regularization loss to the generator loss to penalize for synthesizing samples with abnormal spectra. The term can be written as , where measures the binary cross entropy.
SNGAN+DWT Liu et al. (2018)
adopts discrete wavelet transform (DWT) as downsampling layer to avoid information loss. we use 2D Haar wavelet transform to decompose an input into an low-pass representation and three directions of high-requency coefficients. Specifically, this DWT downsampling layer transforms the input raw images or a group of feature maps with heith H, width W and channel C into a tensor of shape.
Table 2 reports the FID scores on CIFAR100-32, STL10-48, and LSUN-128. SNGAN+REG shows performance improvement over the baseline SNGAN, indicating that utilizing spectral information is effective. Compared with SNGAN, the scores of SNGAN+DWT are higher, especially for LSUN-128. We conjecture that because the DWT downsampling layer remains all the high-frequency information of the input, which contains both details and noises, it is difficult for the model to learn meaningful semantic representation. Moreover, as pointed out by Wang et al. (2020a), learning high-frequency information may degrade the robustness and generalization of a model. Since LSUN-128 has higher resolution and contains more high-frequency information, performance degradation of SNGAN+DWT on this dataset is more dramatic. Our method gets lower FID scores than SNGAN+REG and SNGAN+DWT, indicating it is a smarter way to utilize high-frequency information of images. It is remarkable that SSD-SNGAN+REG achieves the best FID scores among all the datasets. These two techniques have complementary advantages, since they encourage the model to utilize high-frequency information from the aspects of generator and discriminator respectively. Fig. 9 shows a collection of generations obtained from SSD-SNGAN+REG on LSUN-128.
In Eq. 6, we introduce a hyperparameter to control the relative importance of spatial realness and spectral realness. Here, we evaluate different hyperparameter settings on CIFAR100-32 to investigate the robustness of . The baseline model is SNGAN, and other training settings remain the same as the previous section. Fig. 10 compares FID scores over the course of training when setting different values for . Note that the case of is the baseline model. We observe that the proposed approach yields consistent performance improvements and enjoys considerable tolerance for the selection of the hyperparameter .
In this paper, we delve into why existing GANs fail to reproduce the spectral distribution of real data and reveal the issue of high frequencies missing in the discriminator. To alleviate the issue, we introduce SSD-GAN, whose discriminator is enhanced to measure the realness of samples in both the spatial and spectral domains. We provide empirical evidence that the proposed SSD-GAN can reduce frequency spectrum discrepancy, thus achieving performance improvement in the image domain.
Frequency analysis provides a novel perspective for analyzing and understanding GANs. It also opens some avenues for future research. First, although recent GAN based models achieve good performance under existing metrics, the generated samples of these models can easily be distinguished from real data in the frequency domain. A metric that quantitatively measure the performance of generative models in the frequency domain would promote the image synthesis community. Moreover, besides the discriminator in GANs, many machine learning tasks involve learning a mapping from a high-dimensional input space to a low-dimensional space. Constructing a general network architecture to learn semantic representations while taking high frequencies into consideration is interesting and challenging for future work.
Band-limited training and inference for convolutional neural networks. In ICML, Cited by: Frequency Analysis for CNNs.
Variational discriminator bottleneck: improving imitation learning, inverse rl, and gans by constraining information flow. In ICLR, Cited by: Generative Adversarial Networks.
LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: Implementation.
In the toy example, we aim to describe a simple yet prototypical counterexample to show that standard GAN (SGAN) fails to learn high frequencies of real data. The real data distribution is given by a Dirac-distribution concentrated at a single image, which has pixels and a checkerboard pattern with significant high-frequency information. We train SGAN and SSD-GAN with Adam optimizer, setting . The learning rate is set to 0.0002. The models are trained for 10K iterations. For SSD-GAN, the hyperparameter is 0.5.
in SSD-GAN has only one fully connected layer. There are some notations: N: the number of output channels, K: kernel size, S: stride size, P: padding size, FC: fully connected layer, BN: batch normalization, SN: spectral normalization, Up: upsampling using bilinear interpolation.
|Layer||Input Output Shape|
|FC-(8,1024), Reshape||(8) (64,4,4)|
|Layer||Input Output Shape|
|ReLU, GlobalSumPool||(128,4,4) (128)|
|FC-(128,1), SN||(128) (1)|
We provide more qualitative results for interpolations in Fig.11 and generations of multiple datasets in Fig.12. As shown in Fig.11, compared to StyleGAN, the results of our proposed method show a smoother morphing and have better details. Fig.12
shows more generated samples of our proposed method trained on multiple datasets including LSUN-CAT, CIFAR100 and STL-10.