Unsupervised learning with generative adversarial networks (GANs) has proven hugely successful. Regular GANs hypothesize the discriminator as a classifier with the sigmoid cross entropy loss function. However, we found that this loss function may lead to the vanishing gradients problem during the learning process. To overcome such a problem, we propose in this paper the Least Squares Generative Adversarial Networks (LSGANs) which adopt the least squares loss function for the discriminator. We show that minimizing the objective function of LSGAN yields minimizing the Pearson χ^2 divergence. We also present a theoretical analysis about the properties of LSGANs and χ^2 divergence. There are two benefits of LSGANs over regular GANs. First, LSGANs are able to generate higher quality images than regular GANs. Second, LSGANs perform more stable during the learning process. For evaluating the image quality, we train LSGANs on several datasets including LSUN and a cat dataset, and the experimental results show that the images generated by LSGANs are of better quality than the ones generated by regular GANs. Furthermore, we evaluate the stability of LSGANs in two groups. One is to compare between LSGANs and regular GANs without gradient penalty. We conduct three experiments, including Gaussian mixture distribution, difficult architectures, and a new proposed method --- datasets with small variance, to illustrate the stability of LSGANs. The other one is to compare between LSGANs with gradient penalty and WGANs with gradient penalty (WGANs-GP). The experimental results show that LSGANs with gradient penalty succeed in training for all the difficult architectures used in WGANs-GP, including 101-layer ResNet.READ FULL TEXT VIEW PDF
Unsupervised learning with generative adversarial networks (GANs) has pr...
We propose a loss function for generative adversarial networks (GANs) us...
Generative Adversarial Networks (GANs) have become a very popular tool f...
This paper presents a methodology and workflow that overcome the limitat...
We generalize the concept of maximum-margin classifiers (MMCs) to arbitr...
In this paper, we address the incremental classifier learning problem, w...
Matching the performance of conditional Generative Adversarial Networks ...
. These tasks obviously fall into the scope of supervised learning, which means that a lot of labeled data are provided for the learning processes. Compared with supervised learning, however, unsupervised learning tasks, such as generative models, obtain limited impact from deep learning. Although some deep generative models, e.g. RBM, DBM  and VAE , have been proposed, these models face the difficulty of intractable functions or the difficulty of intractable inference, which in turn restricts the effectiveness of these models.
Recently, Generative adversarial networks (GANs)  have demonstrated impressive performance for unsupervised learning tasks. Unlike other deep generative models which usually adopt approximation methods for intractable functions or inference, GANs do not require any approximation and can be trained end-to-end through the differentiable networks. The basic idea of GANs is to simultaneously train a discriminator and a generator: the discriminator aims to distinguish between real samples and generated samples; while the generator tries to generate fake samples as real as possible, making the discriminator believe that the fake samples are from real data. So far, plenty of works have shown that GANs can play a significant role in various tasks, such as image generation 
, image super-resolution
, and semi-supervised learning.
In spite of the great progress for GANs in image generation, the quality of generated images by GANs is still limited for some realistic tasks. Regular GANs adopt the sigmoid cross entropy loss function for the discriminator . We argue that this loss function, however, will lead to the problem of vanishing gradients when updating the generator using the fake samples that are on the correct side of the decision boundary, but are still far from the real data. As Figure 1(b) shows, when we use the fake samples (in magenta) to update the generator by making the discriminator believe they are from real data, it will cause almost no error because they are on the correct side, i.e., the real data side, of the decision boundary. However, these samples are still far from the real data and we want to pull them close to the real data. Based on this observation, we propose the Least Squares Generative Adversarial Networks (LSGANs) which adopt the least squares loss function for the discriminator. The idea is simple yet powerful: the least squares loss function is able to move the fake samples toward the decision boundary, because the least squares loss function penalizes samples that lie in a long way on the correct side of the decision boundary. As Figure 1(c) shows, the least squares loss function will penalize the fake samples (in magenta) and pull them toward the decision boundary even though they are correctly classified. Based on this property, LSGANs are able to generate samples that are closer to real data.
Another benefit of LSGANs is the improved stability of learning process. Generally speaking, training GANs is a difficult issue in practice because of the instability of GANs learning . Recently, several papers have pointed out that the instability of GANs learning is partially caused by the objective function [12, 13, 14]. Specifically, minimizing the objective function of regular GAN suffers from vanishing gradients, which makes it hard to update the generator. LSGANs can relieve this problem because LSGANs penalize samples based on their distances to the decision boundary, which generates more gradients to update the generator.
Recently, Arjovsky et al.  have proposed a method to evaluate the stability of GANs by using difficult architectures. However, in practice, one will always select the stable architectures for their tasks. Inspired by this motivation, we propose to use difficult datasets but stable architectures to evaluate the stability of GANs. Specifically, we create two synthetic digit datasets with small variance by rendering digits using some kind of font. We find that datasets with small variance are difficult for GANs to learn, since the discriminator can distinguish the real samples very easily for datasets with small variance.
Gradient penalty has been proved to be effective for improving the stability of GANs training [15, 16]. We find that gradient penalty is also able to improve the stability of LSGANs. By adding the gradient penalty in , LSGANs are able to train successfully for all the difficult architectures used in .
Our contributions in this paper can be summarized as follows:
We propose LSGANs which adopt least squares loss function for the discriminator. We show that minimizing the objective function of LSGAN yields minimizing the Pearson divergence.
We evaluate LSGANs on several datasets including LSUN and a cat dataset, and the experimental results demonstrate that LSGANs can generate higher quality images than regular GANs.
We propose a new method for evaluating the stability of GANs. Two synthetic digit datasets with small variance are created and published. We train LSGANs and regular GANs with stable architecture on these synthetic datasets. LSGANs succeed in generating high quality digits, while regular GANs suffer from severe mode collapse problem. Another two comparison experiments are also conducted to prove the stability of LSGANs.
We evaluate LSGANs with gradient penalty for six difficult architectures which are used in WGANs-GP. The experimental results show that LSGANs with gradient penalty succeed in training for all the six architectures including 101-layer ResNet.
This paper extends our previous conference work  in a number of ways. First, we present more theoretical analysis about the properties of LSGANs and divergence. Second, we provide new results on a cat dataset, which also shows that LSGANs are able to generate higher quality images than regular GANs. Third, we propose a new method for evaluating the stability of GANs. In addition to using difficult architectures , we propose to use difficult datasets but stable architectures to evaluate the stability of GANs. This experiment also shows that LSGANs perform more stable than regular GANs. Finally, we present a new comparison experiment between LSGANs with gradient penalty and WGANs with gradient penalty (WGANs-GP) . The results show that LSGANs with gradient penalty succeed in training for all the architectures in , including 101-layer ResNet, and the stability of LSGANs with gradient penalty is comparable to WGANs-GP.
Generative Adversarial Networks (GANs) were proposed by Goodfellow et al. , who explained the theory of GANs learning based on a game theoretic scenario. Showing the powerful capability for unsupervised tasks, GANs have been applied to many specific tasks, like image generation , image super-resolution , text to image synthesis , and image to image translation . By combining the traditional content loss and the adversarial loss, super-resolution generative adversarial networks  achieve state-of-the-art performance for the task of image super-resolution. Reed et al.  proposed a model to synthesize images given text descriptions based on the conditional GANs . Isola et al.  also used the conditional GANs to transfer images from one representation to another. In addition to unsupervised learning tasks, GANs also show the potential for semi-supervised learning tasks. Salimans et al. 
proposed a GAN-based framework for semi-supervised learning, in which the discriminator not only outputs the probability that an input image is from real data but also outputs the probabilities of belonging to each class.
Despite the great successes GANs have achieved, improving the quality of generated images is still a challenge. A lot of works have been proposed to improve the quality of images for GANs. Radford et al.  first introduced convolutional layers to GANs architecture, and proposed a network architecture called deep convolutional generative adversarial networks (DCGANs). Denton et al.  proposed another framework called Laplacian pyramid of generative adversarial networks (LAPGANs). They constructed a Laplacian pyramid to generate high-resolution images starting from low-resolution images. Further, Salimans et al.  proposed a technique called feature matching to get better convergence. The idea is to make the generated samples match the statistics of the real data by minimizing the mean square error on an intermediate layer of the discriminator.
Another critical issue for GANs is the stability of learning process. Many works have been proposed to address this problem by analyzing the objective functions of GANs [12, 23, 13, 24, 14]. Viewing the discriminator as an energy function,  used an auto-encoder architecture to improve the stability of GANs learning. To make the generator and the discriminator be more balanced, Metz et al.  created an unrolled objective function to enhance the generator. Che et al.  incorporated a reconstruction module and use the distance between real samples and reconstructed samples as a regularizer to get more stable gradients. Nowozin et al.  pointed out that the objective of the original GAN 
which is related to Jensen-Shannon divergence is a special case of divergence estimation, and generalized it to arbitrary f-divergences. Arjovsky et al.  extended this by analyzing the properties of four different divergences or distances over two distributions and concluded that Wasserstein distance is nicer than Jensen-Shannon divergence. Qi  proposed the Loss-Sensitive GAN whose loss function is based on the assumption that real samples should have smaller losses than fake samples and proved that this loss function has non-vanishing gradient almost everywhere.
The learning process of the GANs is to train a discriminator and a generator simultaneously. The target of is to learn the distribution over data . starts from sampling input variables
from a uniform or Gaussian distribution, then maps the input variables to data space through a differentiable network. On the other hand, is a classifier that aims to recognize whether an image is from training data or from . The minimax objective for GANs can be formulated as follows:
Viewing the discriminator as a classifier, regular GANs adopt the sigmoid cross entropy loss function. As stated in Section 1, when updating the generator, this loss function will cause the problem of vanishing gradients for the samples that are on the correct side of the decision boundary, but are still far from the real data. To remedy this problem, we propose the Least Squares Generative Adversarial Networks (LSGANs). Suppose we use the - coding scheme for the discriminator, where and are the labels for fake data and real data, respectively. Then the objective functions for LSGANs can be defined as follows:
where denotes the value that wants to believe for fake data.
The benefits of LSGANs can be derived from two aspects. First, unlike regular GANs which cause almost no loss for samples that lie in a long way on the correct side of the decision boundary (Figure 1(b)), LSGANs will penalize those samples even though they are correctly classified (Figure 1(c)). When we update the generator, the parameters of the discriminator are fixed, i.e., the decision boundary is fixed. As a result, the penalization will make the generator to generate samples toward the decision boundary. On the other hand, the decision boundary should go across the manifold of real data for a successful GANs learning. Otherwise, the learning process will be saturated. Thus moving the generated samples toward the decision boundary leads to making them be closer to the manifold of real data.
Second, penalizing the samples lying a long way to the decision boundary can generate more gradients when updating the generator, which in turn relieves the problem of vanishing gradients. This allows LSGANs to perform more stable during the learning process. This benefit can also be derived from another perspective: as shown in Figure 2, the least squares loss function is flat only at one point, while the sigmoid cross entropy loss function will saturate when is relatively large. Furthermore, we provide the theoretical analysis about the stability of LSGANs in Section 3.3.2.
Here we also explore the relation between LSGANs and f-divergence. Consider the following extension of Equation 2:
|(a) Generated images () by LSGANs.|
|(b) Generated images () by DCGANs.|
|(c) Generated images () by DCGANs (reported in ).|
, stride =” denotes a convolutional/deconvolutional layer with kernel, output filters and stride =
. The layer with BN means that the layer is followed by a batch normalization layer. “fc,” denotes a fully-connected layer with output nodes. The activation layers are omitted. (a): The generator. (b): The discriminator.
Note that adding the term to does not change the optimal values since this term does not contain parameters of .
We first prove that the optimal discriminator for a fixed is as below :
Proof. Given any generator , we try to minimize with respect to the discriminator :
Consider the internal function:
It achieves the mimimum at with respect to , concluding the proof.
In the following equations we use to denote for simplicity. Then we can reformulate in Equation 4 as follows:
If we set and , then
where is the Pearson divergence. Thus minimizing Equation 4 yields minimizing the Pearson divergence between and if , , and satisfy the condtions of and .
can be viewed as an interpolation betweenand :
where Equation 3 corresponds to . They also found that optimizing Equation 3 tends to perform similarly to . is widely used in variational inference due to the convenient evidence lower bound . However, optimizing has the problem of mode-seeking behavior or under-dispersed approximations [28, 29, 27]. This problem also appears in GAN learning, which is known as the mode collapse problem. The definition of is given as follows:
The mode-seeking behavior of can be understood by noting that will be close to zero where is near zero, because will be infinite if and . This is called the zero-forcing property .
|(a) Church outdoor.||(b) Dining room.|
|(c) Kitchen.||(d) Conference room.|
Recently, divergence has drawn researchers’ attentaion in variational inference since divergence is able to produce over-dispersed approximations . For the objective function in Equation 9, it will become infinite if and , which will not happen since and . Thus has no zero-forcing property. This makes LSGAN less mode-seeking and relieves the mode collapse problem.
One method to determine the values of , , and in Equation 2 is to satisfy the conditions of and , such that minimizing Equation 2 yields minimizing the Pearson divergence between and . For example, by setting , , and , we get the following objective functions:
Another method is to make generate samples as real as possible by setting . For example, by using the - binary coding scheme, we get the following objective functions:
|(a) Generated cats () by LSGANs.|
|(b) Generated cats () by DCGANs.|
In this section, we first present some details of our implementation. Next, we present the results of the qualitative evaluation and quantitative evaluation of LSGANs. Then we compare the stability between LSGANs and DCGANs without gradient penalty by three comparison experiments. Finally, we evaluate the stability of LSGANs with gradient penalty.
The implementation of our proposed models is based on a public implementation of DCGANs111https://github.com/carpedm20/DCGAN-tensorflow
using TensorFlow. The learning rate is set to except for LSUN-scenes whose learning rate is set to . Following DCGANs, for Adam optimizer is set to 0.5. Our implementation is available at https://github.com/xudonmao/improved_LSGAN222Some of the implementation is available at the project of conference work: https://github.com/xudonmao/LSGAN.
Scenes Generation We train LSGANs and DCGANs with the same network architecture (Figure 4) and same resolution () on LSUN-bedroom dataset. The generated images by the two methods are presented in Figure 3. Compared with the images generated by DCGANs, the texture detail (e.g., the textures of beds) of the images generated by LSGANs is more exquisite, and the images generated by LSGANs look sharper. We also train LSGANs on four other scene datasets including church, dining room, kitchen, and conference room. The generated results are shown in Figure 5.
Cats Generation We also evaluate LSGANs on a cat dataset . We first use the preprocess methods in a public project333https://github.com/AlexiaJM/Deep-learning-with-cats to get cat head images whose resolution is bigger than , and then resize all the images to the resolution of . The network architecture used in this task consists of four transposed convolutional layers and four convolutional layers for the generator and discriminator, respectively. We use the following evaluation protocol for comparing the performance between LSGANs and DCGANs. First, we train LSGANs and DCGANs using the same architecture on the cat dataset. During training, we save a checkpoint of the model and a batch of generated images every iterations. Second, we select the best models of LSGANs and DCGANs by checking the quality of saved images in every iterations. Finally, we use the selected best models to randomly generate cat images and compare the quality of generated images. The selected models of LSGANs and DCGANs are available at https://github.com/xudonmao/improved_LSGAN. Figure 6 shows the generated cat images of LSGANs and DCGANs, and more results can be found in the appendix. We observe that LSGANs generate some cats (e.g., the second and fourth cats of row 1 in Figure 6) with sharper and more exquisite furs and faces than the ones generated by DCGANs. By checking more samples in the appendix or using the above saved models to generate more samples, we also observe that the overall quality of generated images by LSGANs is better than DCGANs.
|DCGAN (reported in )|
Inception Score We train LSGANs and DCGANs with the same network architecture on CIFAR-10  and use the models to randomly generate images for calculating the inception scores . The evaluated inception scores of LSGANs and DCGANs are shown in Table I. As we observe that the inception scores vary for different trained models, the reported inception scores in Table I are averaged over different trained models for both LSGANs and DCGANs. For this quantitative evaluation of inception score, LSGANs show comparable performance to DCGANs.
Human Subjective Study To further evaluate the performance of LSGANs, we conduct a human subjective study using the generated bedroom images () from LSGANs and DCGANs with the same network architectures. We randomly construct image pairs, where one image is from LSGANs and the other one is from DCGANs. We ask Amazon Mechanical Turk annotators to judge which image looks more realistic. With 4,000 votes totally, DCGANs get 43.6% votes and LSGANs get 56.4% votes. LSGANs get 12.8% more votes than DCGANs.
|(a) LSGANs: without BN in using Adam.||(b) DCGANs: without BN in using Adam.|
|(c) LSGANs: without BN in and
|(d) DCGANs: without BN in and using RMSProp.|
|(a) Training on a synthetic digit dataset with random horizontal shift.|
|(b) Training on a synthetic digit dataset with random horizontal shift and rotation.|
|(a) G: No BN and a constant number of filgers, D: DCGAN.||
(b) G: 4-layer 512-dim ReLU MLP, D: DCGAN.
|(c) No normalization in either G or D.||(d) Gated multiplicative nonlinearities everywhere in G and D.|
|(e) Tanh nonlinearities everywhere in G and D.||(f) 101-layer ResNet G and D.|
In this section, we evaluate the stability of our proposed LSGANs and compare with two baselines including DCGANs and WGANs-GP. Gradient penalty has been proved to be effective for improving the stability of GANs training [15, 16]. We find that gradient penalty is also able to improve the stability of LSGANs. But it also has some obvious disadvantages such as additional computational cost and memory cost. Thus we evaluate the stability of LSGANs in two groups. One is to compare with the model without gradient penalty, DCGANs. The other one is to compare with the model with gradient penalty, WGANs-GP.
We first compare LSGANs with DCGANs without gradient penalty. Three comparison experiments are conducted: 1) learning on a Gaussian mixture distribution; 2) learning with difficult architectures; and 3) learning on datasets with small variance.
Gaussian Mixture Distribution Learning on a Gaussian mixture distribution to evaluate the stability is proposed in literature . The model will only generate samples around one mode, if it has the mode-seeking behavior. We train LSGANs and DCGANs on a 2D mixture of eight Gaussian mixture distribution using a simple network architecture, where both the generator and the discriminator contain three fully-connected layers. Figure 7
shows the dynamic results of Gaussian kernel density estimation. We can see that DCGANs suffer from mode collapse starting atk iterations. They only generate samples around a single valid mode of the data distribution. But LSGANs learn the Gaussian mixture distribution successfully.
Difficult Architectures Another experiment is to train GANs with difficult architectures, which is proposed in . The model will generate very similar images, if it suffers from mode collapse problem. Based on the network architecture presented in , two architectures are designed to compare the stability. The first one is to exclude the batch normalization in the generator ( for short), and the second one is to exclude the batch normalization in both the generator and discriminator ( for short). As pointed out in , the selection of optimizer is critical to the model performance. Thus we evaluate the two architectures with two optimizers, Adam  and RMSProp . In summary, we have the following four training settings: (1) with Adam, (2) with RMSProp, (3) with Adam, and (4) with RMSProp. We train the above models on the LSUN-bedroom dataset using LSGANs and DCGANs separately and have the following four major observations. First, for with Adam, there is a chance for LSGANs to generate relatively good quality images. We test times, and of those succeed to generate relatively good quality images. But for DCGANs, we never observe successful learning. DCGANs suffer from a severe degree of mode collapse. The generated images by LSGANs and DCGANs are shown in Figure 8. Second, for with RMSProp, as Figure 8 shows, LSGANs generate higher quality images than DCGANs which have a slight degree of mode collapse. Third, LSGANs and DCGANs have similar performances for with RMSProp and with Adam. Specifically, for with RMSProp, both LSGANs and DCGANs are able to generate relatively good images. For with Adam, both have a slight degree of mode collapse. Last, RMSProp performs more stable than Adam for DCGANs, since DCGANs succeed in generating relatively good images for with RMSProp, but fail to learn with Adam.
Datasets with Small Variance Using difficult architectures  is an effective way to evaluate the stability of GANs. However, in practice, one will always select the stable architectures for their tasks. The difficulty of a practical task is the task itself. Inspired by this motivation, we propose to use difficult datasets but stable architectures to evaluate the stability of GANs. We find that the datasets with small variance are difficult for GANs to learn, since the discriminator can distinguish the real samples very easily for the datasets with small variance. Specifically, we construct the datasets by rendering digits using the Times-New-Roman font. Two datasets are created444Available at https://github.com/xudonmao/improved_LSGAN: 1) one is applied with random horizontal shift; and 2) the other one is applied with random horizontal shift and random rotation from to degree. Each category contains one thousand samples for both datasets. Note that the second dataset is with larger variance than the first one. Examples of the two synthetic datasets are shown in the first column of Figure 9. We use a stable architecture for digits generation and follow the suggestions in , where the discriminator is similar to LeNet and the generator contains three transposed convolutional layers. We train DCGANs and LSGANs on the above two datasets, and the generated images are shown in Figure 9. We have two major observations. First, LSGANs succeed in generating digits for both datasets, while DCGANs suffer from severe mode collapse problem. Second, the generated image quality of the second dataset by LSGANs is better than the first one. This implies that increasing the variance of the dataset is able to improve the generated image quality and relieve the mode collapse problem. Based on this observation, applying data augmentation such as shifting, cropping, and rotation is an effective way to improve the GAN learning.
Gradient penalty has been proved to be effective for improving the stability of GAN training [15, 16]. To compare with WGANs with gradient penalty (WGANs-GP), which is the state-of-the-art GAN model in stability, we adopt the gradient penalty in  for LSGANs and set the hyper-parameters and to and , respectively. For this experiment, our implementation is based on the official implementation of WGANs-GP555https://github.com/igul222/improved_wgan_training. We follow the evaluation method in WGANs-GP: to train with six difficult architectures including 1) no normalization and a constant number of filters in the generator; 2) 4-layer 512-dimension ReLU MLP generator; 3) no normalization in either the generator or discriminator; 4) gated multiplicative nonlinearities in both the generator and discriminator; 5) tanh nonlinearities in both the generator and the discriminator; and 6) 101-layer ResNet for both the generator and discriminator. The results are presented in Figure 10, where the generated images by WGANs-GP are duplicated from their paper. We have the following two major observations. First, like WGANs-GP, LSGANs with gradient penalty also succeed in training for each architecture, including 101-layer ResNet. Second, LSGANs with 101-layer ResNet generate higher quality images than the other five architectures.
Based on the above experiments, we have the following suggestions for practical tasks. First, we suggest to use LSGANs without gradient penalty if it works. Because using gradient penalty will introduce additional computational cost and memory cost, and may influence the image quality. Second, we observe that the quality of generated images by LSGANs may shift between good and bad during the training process. Thus we suggest to keep a record of generated images at every thousand or hundred iterations and select the model manually by checking the image quality. Third, if LSGANs without gradient penalty do not work, we suggest to add the gradient penalty and set the hyper-parameters according to the suggestions in literature . In our experiments, we find that the hyper-parameters setting, and , works for all the tasks.
In this paper, we have proposed the Least Squares Generative Adversarial Networks (LSGANs). The experimental results show that LSGANs generate higher quality images than regular GANs. Three comparison experiments for evaluating the stability are also conducted, and the results demonstrate that LSGANs perform more stable than regular GANs. We also compare the stability between LSGANs with gradient penalty and WGANs-GP. We train LSGANs with gradient penalty on LSUN-bedroom using six difficult architectures. LSGANs with gradient penalty are able to train successfully for all the six architectures. Based on the present findings, we hope to extend LSGANs to more complex datasets such as ImageNet in the future. Instead of pulling the generated samples toward the decision boundary, designing a method to pull the generated samples toward the real data directly is also worth our further investigation.
G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,”Science, vol. 313, no. 5786, pp. 504 – 507, 2006.
R. Salakhutdinov and G. Hinton, “Deep Boltzmann machines,” in
Proceedings of the International Conference on Artificial Intelligence and Statistics, vol. 5, 2009, pp. 448–455.
Proceedings of The 33rd International Conference on Machine Learning (ICML), 2016.