Improving GAN with neighbors embedding and gradient matching

11/04/2018 ∙ by Ngoc-Trung Tran, et al. ∙ 0

We propose two new techniques for training Generative Adversarial Networks (GANs). Our objectives are to alleviate mode collapse in GAN and improve the quality of the generated samples. First, we propose neighbor embedding, a manifold learning-based regularization to explicitly retain local structures of latent samples in the generated samples. This prevents generator from producing nearly identical data samples from different latent samples, and reduces mode collapse. We propose an inverse t-SNE regularizer to achieve this. Second, we propose a new technique, gradient matching, to align the distributions of the generated samples and the real samples. As it is challenging to work with high-dimensional sample distributions, we propose to align these distributions through the scalar discriminator scores. We constrain the difference between the discriminator scores of the real samples and generated ones. We further constrain the difference between the gradients of these discriminator scores. We derive these constraints from Taylor approximations of the discriminator function. We perform experiments to demonstrate that our proposed techniques are computationally simple and easy to be incorporated in existing systems. When Gradient matching and Neighbour embedding are applied together, our GN-GAN achieves outstanding results on 1D/2D synthetic, CIFAR-10 and STL-10 datasets, e.g. FID score of 30.80 for the STL-10 dataset. Our code is available at: https://github.com/tntrung/gan

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Generative Adversarial Networks (GANs) [Goodfellow et al.2014, Goodfellow2016] are popular methods for training generative models. GAN training is a two-player minimax game between the discriminator and the generator. While the discriminator learns to distinguish between the real and generated (fake) samples, the generator creates samples to confuse the discriminator to accept its samples as “real”. This is an attractive approach. However, stabilizing the training of GAN is still an on-going important research problem.

Mode collapse is one of the most challenging issues when training GANs. Many advanced GANs have been proposed to improve the stability [Nowozin, Cseke, and Tomioka2016, Arjovsky, Chintala, and Bottou2017, Gulrajani et al.2017]. However, mode collapse is still an issue.

In this work, we propose two techniques to improve GAN training. First, inspired by t-distribution stochastic neighbors embedding (t-SNE) [Maaten and Hinton2008], which is a well-known dimensionality reduction method, we propose an inverse

t-SNE regularizer to reduce mode collapse. Specifically, while t-SNE aims to preserve the structure of the high-dimensional data samples in the reduced-dimensional manifold of latent samples, we reverse the procedure of t-SNE to explicitly retain local structures of latent samples in the high-dimensional generated samples. This prevents generator from producing nearly identical data samples from different latent samples, and reduces mode collapse. Second, we propose a new objective function for the generator by aligning the real and generated sample distributions, in order to generate realistic samples. We achieve the alignment via minimizing the difference between the discriminator scores of the real samples and generated ones. By using the discriminator and its scores, we can avoid working with high-dimensional data distribution. We further constrain the difference between the gradients of discriminator scores. We derive these constraints from Taylor approximation of the discriminator function. Our principled approach is significantly different from the standard GAN

[Goodfellow et al.2014]: our generator does not attempt to directly fool the discriminator; instead, our generator produces fake samples that have similar discriminator scores as the real samples. We found that with this technique the distribution of the generated samples approximates well that of the real samples, and the generator can produce more realistic samples.

Related Works

Addressing issues of GANs [Goodfellow2016], including gradient vanishing and mode collapse, is an important research topic. A popular direction is to focus on improving the discriminator objective. The discriminator can be formed via the f-divergence [Nowozin, Cseke, and Tomioka2016], or distance metrics [Arjovsky and Bottou2017, Bellemare et al.2017]. And the generator is trained by fooling the discriminator via the zero-sum game. Many methods in this direction have to regularize their discriminators; otherwise, they would cause instability issues, as the discriminator often converges much faster than the generator. Some regularization techniques are weight-clipping [Arjovsky and Bottou2017], gradient penalty constraints [Gulrajani et al.2017, Roth et al.2017, Kodali et al.2017, Petzka, Fischer, and Lukovnicov2017, Liu2018], consensus constraint, [Mescheder, Nowozin, and Geiger2017, Mescheder, Geiger, and Nowozin2018], or spectral norm [Miyato et al.2018]. However, over-constraint of the discriminator may cause the cycling issues [Nagarajan and Kolter2017, Mescheder, Geiger, and Nowozin2018].

Issues of GAN can also be tackled via the optimizer regularization: changing optimization process [Metz et al.2017], using two-time scale update rules for better convergence [Heusel et al.2017], or averaging network parameters [Yazıcı et al.2018].

Regularizing the generator is another direction: i) It can be achieved by modifying the generator objective function with feature matching [Salimans et al.2016] or discriminator-score distance [Tran, Bui, and Cheung2018] ii) Or, using Auto-Encoders (AE) or latent codes to regularize the generator. AAE [Makhzani et al.2016] uses AE to constrain the generator. The goal is to match the encoded latent distribution to some given prior distribution by the minimax game. The problem of AAE is that pixel-wise reconstruction with -norm would cause the blurry issue. And the minimax game on the latent samples has the same problems (e.g., mode collapse) as on the data samples. It is because AE alone is not powerful enough to overcome these issues. VAE/GAN [Larsen et al.2015] combined VAE and GAN into one single model and used feature-wise distance for the reconstruction to avoid the blur. The generator is regularized in the VAE model to reduce the mode collapse. Nevertheless, VAE/GAN has the similar limitation of VAE [Kingma and Welling2013], including re-parameterization tricks for back-propagation, or, requirement to access to an exact functional form of prior distribution. ALI [Dumoulin et al.2016] and BiGAN [Donahue, Krähenbühl, and Darrell2016] jointly train the data/latent samples in GAN framework. This method can learn the AE model implicitly after training. MDGAN [Che et al.2016] required two discriminators for two separate steps: manifold and diffusion. The manifold step manages to learn a good AE. The diffusion step is similar to the original GAN, except that the constructed samples are used as real samples instead. InfoGAN [Chen et al.2016] learned the disentangled representation by maximizing the mutual information for inducing latent codes. MMGAN [Park et al.2018] makes strong assumption that manifolds of real and fake samples are spheres. First, it aligns real and fake sample statistics by matching the two manifold spheres (centre and radius), and then it applies correlation matrix to reduce mode collapse. Dist-GAN [Tran, Bui, and Cheung2018] constrains the generator by the regularized auto-encoder. Furthermore, the authors use the reconstructed samples to regularize the convergence of the discriminator.

Auto-encoder can be also used in the discriminator objectives. EBGAN [Zhao, Mathieu, and LeCun2017]

introduces the energy-based model, in which the discriminator is considered as the energy function minimized via reconstruction errors. BEGAN

[Berthelot, Schumm, and Metz2017] extends EBGAN by optimizing Wasserstein distance between AE loss distributions.

Proposed method

Our proposed system with gradient matching (GM) and neighbor embedding (NE) constraints, namely GN-GAN, consists of three main components: the auto-encoder, the discriminator, and the generator. In our model, we first train the auto-encoder, then the discriminator and finally the generator as presented in Algorithm 1.

1:  Initialize discriminator, encoder and generator respectively. is the number of iterations.
2:  repeat
3:       Random mini-batch of data points from dataset.
4:       Random samples from noise distribution
5:      // Training the auto-encoder using and by Eqn. 1
6:      
7:      // Training discriminator according to Eqn. 7 on
8:      
9:      // Training the generator on according to Eqn. 13.
10:      
11:  until 
12:  return  
Algorithm 1 Our GN-GAN model

Neighbors embedding constraint for Auto-encoder

We use auto-encoder (AE) in our model for two reasons: i) to prevent the generator from being severely collapsed. ii) to regularize the generator in producing samples that resemble real ones. However, using AE alone is not adequate to avoid mode collapse, especially for high-dimensional data. Therefore, we propose additional regularization as in Eq. 1:

(1)

Eq. 1 is the objective of our regularized AE. The first term is reconstruction error in conventional AE. The second term is our proposed neighbors embedding constraint, to be discussed. Here, is GAN generator (decoder in AE), is the encoder and is a constant.

Figure 1: Illustration of the neighbor-embedding (NE) constraint. NE regularizes the generator to produce high-dimensional data samples such that latent sample distance and data sample distance are consistent.

Mode collapse is a failure case of GAN when the generator often generates similar samples. The diversity of generated samples is small compared with those of the original dataset. As discussed in previous work (e.g. [Tran, Bui, and Cheung2018]

), with mode collapse, the generator would map two far-apart latent samples to nearby data points in the high-dimensional data space with high probability. This observation motivates our idea to constrain the distances between generated data points in order to alleviate mode collapse. In particular, the data point distances and the corresponding latent sample distances should be consistent.

The motivation of our neighbors-embedding constraint is to constrain the relative distance among data points and their corresponding latent points within the data and latent manifold respectively (Fig. 1). In our model, we apply the probabilistic relative distance (PRDist) in t-SNE [Maaten and Hinton2008], which takes into account the distributions of latent sample structure and data sample structure. t-SNE has been shown to preserve both the local structure of data space (the relation inside each cluster) and the global structure (the relation between each pair of clusters). Notably, our method applies PRDist in the reverse direction of t-SNE for different purpose. While t-SNE aims to preserve significant structures of the high-dimensional data in the reduced-dimensional samples, in our work, we aim to preserve the structures in low-dimensional latent samples in its high-dimensional mappings via the generator. Specifically, the objective is as shown in Eq. 2:

(2)

The probability distribution of latent structure

is a joint, symmetric distribution, computed as below:

(3)

and are the conditional probabilities, whose center points are and respectively. Here, and are indices of -th and -th samples respectively in a mini-batch of training data. Accordingly, and are -th and -th latent samples. is the number of samples in the mini-batch. The conditional probability is given by:

(4)

where

is the variance of all pairwise distances in a mini-batch of latent samples. Similar to t-SNE method, the joint distribution

is to prevent the problem of outliers in high-dimensional space.

Similarly, the probability distribution of data sample structure is the joint, symmetric computed from two conditional probabilities as below:

(5)

where is the conditional probability of pairwise distance between samples and the center point , computed as follow:

(6)

is the variance of all pairwise distances of data samples in the mini-batch. The regularization term is the dissimilarity between two joint distributions: and , where each distribution represents the neighbor distance distribution. Similar to t-SNE, we set the values of and to zero. The dissimilarity is Kullback-Leibler (KL) divergence as in Eq. 2. is a merged dataset of encoded and random latent samples, and is a merged dataset of reconstruction and generated samples. Here, the reconstruction samples and their latent samples are considered as the anchor points of data and latent manifolds respectively to regularize the generation process.

Discriminator objective

(7)

Our discriminator objective is shown in Eq. 7. Our model considers the reconstructed samples as “real” represented by the term , so that the gradients from discriminator are not saturated too quickly. This constraint slows down the convergence of discriminator, similar goal as [Arjovsky, Chintala, and Bottou2017], [Miyato et al.2018] and [Tran, Bui, and Cheung2018]. In our method, we use a small weight for with for the discriminator objective. We observe that is important at the beginning of training. However, towards the end, especially for complex image datasets, the reconstructed samples may not be as good as real samples, resulting in low quality of generated images. Here, is the expectation, is a constant, and , is a uniform random number . enforces sufficient gradients from the discriminator even when approaching convergence. Fig. 2 illustrates gradients at convergence time.

We also apply hinge loss similar to [Miyato et al.2018] by replacing with . We empirically found that hinge loss could also improve the quality of generated images in our model. Here, because , the hinge loss version of Eq. 7 (ignore constants) is as follows:

(8)

Generator objective with gradient matching

In this work, we propose to train the generator via aligning distributions of generated samples and real samples. However, it is challenging to work with high-dimensional sample distribution. We propose to overcome this issue in GAN by using the scalar discriminator scores. In GAN, the discriminator differentiates real and fake samples. Thus, the discriminator score can be viewed as the probability that sample drawn from the real data distribution. Although exact form of is unknown, but the scores at some data points (from training data) can be computed via the discriminator network. Therefore, we align the distributions by minimizing the difference between discriminator scores of real and generated samples. In addition, we constrain the gradients of these discriminator scores. These constraints can be derived from Taylor approximation of discriminator functions as followings.

Assume that the first derivative of exists, and the training set has data samples . For a sample point , by first-order Taylor expansion (TE), we can approximate with TE at a data point :

(9)

Here is the TE approximation error. Alternatively, we can approximate with TE at a generated sample :

(10)

Our goal is to enforce the distribution of generated sample to be similar to that of real sample . For a given , its discriminator score can be approximated by first-order TE at with error . Note that, here we define to be the approximation error of with first-order TE at point . Likewise, is the approximation error of with first-order TE at point . If and were from the same distribution, then . Therefore, we propose to enforce when training the generator. Note that , because is a constant and is independent of and . Therefore, we propose to enforce in order to align to real sample distribution . From Eq. 9, we have:

(11)

From Eq. 10, we have:

(12)

To equate Eqs. 11 and 12, we enforce equality of corresponding terms. This leads to minimization of the following objective function for the generator:

(13)

Here, we use -norm for the first term of generator objective, and -norm for two last terms. Empirically, we observe that using -norm is more stable than using -norm. . In practice, our method is more stable when we implement as and as in the second and third term of Eq. 13. Note that this proposed objective can be used in other GAN models. Note also that a recent work [Tran, Bui, and Cheung2018] has also used the discriminator score as constraint. However, our motivation and formulation are significantly different. In the experiment, we show improved performance compared to [Tran, Bui, and Cheung2018].

Experimental Results

Synthetic 1D dataset

Figure 2: We compare our method and Dist-GAN (Tran, Bui and Cheung 2018) on the 1D synthetic dataset of three Gaussian modes. Figures are the last frames of the demo videos (can be found here: https://github.com/tntrung/gan). The blue curve is discriminator scores, the green and orange modes are the training data the generated data respectively.

For 1D synthetic dataset, we compare our model to Dist-GAN [Tran, Bui, and Cheung2018], a recent state-of-the-art GAN. We use the code (https://github.com/tntrung/gan) for this 1D experiment. Here, we construct the 1D synthetic data with 3 Gaussian modes (green) as shown in Fig. 2. It is more challenging than the one-mode demo by Dist-GAN.

We use small networks for both methods. Specifically, we create the encoder and generator networks with three fully-connected layers and the discriminator network with two fully-connected layers. We use ReLU for hidden layers and sigmoid for the output layer of the discriminator. The discriminator is smaller than the generator to make the training more challenging. The number of neurons for each hidden layer is 4, the learning rate is 0.001,

for both method, for our generator objective.

Fig. 2 shows that our model can recover well three modes, while Dist-GAN cannot (see attached video demos in the supplementary material). Although both methods have good gradients of the discriminator scores (decision boundary) for the middle mode, it’s difficult to recover this mode with Dist-GAN as gradients computed over generated samples are not explicitly forced to resemble those of real samples as in our proposed method. Note that for this 1D experiment and the 2D experiment in the next section, we only evaluate our model with gradient matching (+GM), since we find that our new generator with gradient matching alone is already good enough; neighbors embedding is more useful for high-dimensional data samples, as will be discussed.

Synthetic 2D dataset

Figure 3: Examples of the number of modes (classes) and registered points of compared methods.
Figure 4:

Our 2D synthetic data has 25 Gaussian modes (red dots). The black arrows are gradient vectors of the discriminator computed around the ground-truth modes. Figures from left to right are examples of gradient maps of GAN, WGAN-GP, Dist-GAN and ours.

For 2D synthetic data, we follow the experimental setup on the same 2D synthetic dataset [Tran, Bui, and Cheung2018]. The dataset has 25 Gaussian modes in the grid layout (red points in Fig. 4) that contains 50K training points. We draw 2K generated samples for evaluating the generator. However, the performance reported in [Tran, Bui, and Cheung2018] is nearly saturated. For example, it can re-cover entirely 25 modes and register more than 90% of the total number of points. It’s hard to see the significant improvement of our method in this case. Therefore, we decrease the number of hidden layers and their number of neurons for networks to be more challenging. For a fair comparison, we use equivalent encoder, generator and discriminator networks for all compared methods.

Encoder () 2 2 2 64
Generator () 2 2 2 64
Discriminator () 2 1 2 64
Table 1: Network structures for 1D synthetic data in our experiments.

The detail of network architecture is presented in Table 1. , , are dimensions of input, output and hidden layers respectively. is the number of hidden layers. We use ReLU for hidden layers and sigmoid for output layers. To have a fair comparison, we carefully fine-tune other methods to ensure that they can perform their best on the synthetic data. For evaluation, a mode is missed if there are less than 20 generated samples registered in this mode, which is measured by its mean and variance of 0.01. A method has mode collapse if there are missing modes. For this experiment, the prior distribution is the 2D uniform

. We use Adam optimizer with learning rate lr = 0.001, and the exponent decay rate of first moment

. The parameters of our model are: . The learning rate is decayed every 10K steps with a base of

. This decay rate is to avoid the learning rate saturating too quickly that is not fair for slow convergence methods. The mini-batch size is 128. The training stops after 500 epochs.

In this experiment, we compare our model to several state-of-the-art methods. ALI [Donahue, Krähenbühl, and Darrell2016], VAE-GAN [Larsen et al.2015] and Dist-GAN [Tran, Bui, and Cheung2018] are recent works using encoder/decoder in their models. WGAN-GP [Gulrajani et al.2017] is one of the state-of-the-arts. We also compare to VAE-based methods: VAE [Kingma and Welling2014] and -VAE [Higgins et al.2016]. The numbers of covered (registered) modes and registered points during training are presented in Fig. 3. The quantitative numbers of last epochs are in Table 2. In this table, we also report the Total Variation scores to measure the mode balance [Tran, Bui, and Cheung2018]. The result for each method is the average of eight runs. Our method outperforms all others on the number of covered modes. Although WGAN-GP and Dist-GAN are stable with larger networks and this experimental setup [Tran, Bui, and Cheung2018], they are less stable with our network architecture, miss many modes and sometimes diverge.VAE based method often address well mode collapse, but in our experiment setup where the small networks may affect the reconstruction quality, consequently reduces their performance. Our method does not suffer serious mode collapse issues for all eight runs. Furthermore, we achieve a higher number of registered samples than all others. Our method is also better than the rest with Total Variation (TV).

Method #registered modes #registered points TV (True) TV (Differential)
GAN [Goodfellow et al.2014] 14.25 2.49 1013.38 171.73 1.00 0.00 0.90 0.22
ALI [Donahue, Krähenbühl, and Darrell2016] 17.81 1.80 1281.43 117.84 0.99 0.01 0.72 0.19
VAEGAN [Larsen et al.2015] 12.75 3.20 1042.38 170.17 1.35 0.70 1.34 0.98
VAE [Kingma and Welling2014] 13.48 2.31 1265.45 72.47 1.81 0.71 2.16 0.72
-VAE [Higgins et al.2016] 18.00 2.33 1321.17 95.61 1.17 0.24 1.47 0.28
WGAN-GP [Gulrajani et al.2017] 21.71 1.35 1180.25 158.63 0.90 0.07 0.51 0.06
Dist-GAN [Tran, Bui, and Cheung2018] 20.71 4.42 1188.62 311.91 0.82 0.19 0.43 0.12
Ours 24.39 0.44 1461.83 222.86 0.57 0.17 0.31 0.12
Table 2: Results on 2D synthetic data. Columns indicate the number of covered modes, and the number of registered samples among 2000 generated samples, and two types of Total Variation (TV). We compare our model to state of the art models: WGAN-GP and Dist-GAN.

In addition, we follow [Thanh-Tung, Tran, and Venkatesh2018] to explore the gradient map of the discriminator scores of compared methods: standard GAN, WGAN-GP, Dist-GAN and ours as shown in Fig. 4. This map is important because it shows the potential gradient to pull the generated samples towards the real samples (red points). The gradient map of standard GAN is noisy, uncontrolled and vanished for many regions. The gradient map of WGAN-GP has more meaningful directions than GAN. Its gradient concentrates in the centroids (red points) of training data and has gradients around most of the centroids. However, WGAN-GP still has some modes where gradients are not towards the ground-truth centroids. Both Dist-GAN and our method show better gradients than WGAN-GP. The gradients of our method are more informative for the generator to learn when they guide better directions to all real ground-truths.

CIFAR-10 and STL-10 datasets

Figure 5: FID scores of our method compared to Dist-GAN.

For CIFAR-10 and STL-10 datasets, we measure the performance with FID scores [Heusel et al.2017]. FID can detect intra-class mode dropping, and measure the diversity as well as the quality of generated samples. We follow the experimental procedure and model architecture in [Miyato et al.2018] to compare methods. FID is computed from 10K real samples and 5K generated samples. Our default parameters are used for all experiments . Learning rate, , for Adam is (lr = 0.0002, , ). The generator is trained with 350K updates for logarithm loss version (Eq. 7) and 200K for “hinge” loss version (Eq. 8) to converge better. The dimension of the prior input is 128. All our experiments are conducted using the unsupervised setting.

Method CIFAR STL CIFAR (R)
GAN-GP 37.7 - -
WGAN-GP 40.2 55.1 -
SN-GAN 25.5 43.2 21.7 .21
Dist-GAN 22.95 36.19 -
Ours 21.70 30.80 16.47 .28
Table 3: Comparing the FID score to the state of the art (Smaller is better). Methods with the CNN and ResNet (R) architectures. FID scores of SN-GAN, Dist-GAN and our method reported with hinge loss. Results of compared methods are from [Miyato et al.2018, Tran, Bui, and Cheung2018].
Figure 6: Generated samples of our method. Two first samples are on CIFAR-10 with CNN and ResNet architectures, and the last one is on STL-10 with CNN.

In the first experiment, we conduct the ablation study with our new proposed techniques to understand the contribution of each component into the model. Experiments with standard CNN [Miyato et al.2018] on the CIFAR-10 dataset. We use the logarithm version for the discriminator objective (Eq. 7). Our original model is similar to Dist-GAN model, but we have some modifications, such as: using lower weights for the reconstruction constraint as we find that it can improve FID scores. We consider Dist-GAN as a baseline for this comparison. FID is computed for every 10K iterations and shown in Fig. 5. Our original model converges a little bit slow at the beginning, but at the end, our model achieves better FID score than Dist-GAN model. Once we replace separately each proposed techniques, either the neighbors embedding technique (+NE) or gradient matching (+GM), into our original model, it converges faster and reaches a better FID score than the original one. Combining two proposed techniques further speeds up the convergence and reach better FID score than other versions. This experiment proves that our proposed techniques can improve the diversity of generated samples. Note that in Fig. 5, we compared Dist-GAN and our model (original) with only discriminator scores. With GM, our model converges faster and achieves better FID scores.

We compare our best setting (NE + GM) with a hinge loss version (Eq. 8) with other methods. Results are shown in Table 3

. The FID score of SN-GAN and Dist-GAN are also with hinge loss function. We also report our performance with the ResNet (R) architecture

[Miyato et al.2018] for CIFAR-10 dataset. For both standard CNN and ResNet architectures, our model outperforms other state-of-the-art methods with FID score, especially significantly higher on STL-10 dataset with CNN and on CIFAR-10 dataset with ResNet. For STL-10 dataset and the ResNet architecture, the generator is trained with 200K iterations to reduce training time. Training it longer does not significantly improve the FID score. Fig. 6 are some generated samples of our method trained on CIFAR-10 and STL-10 datasets.

Our proposed techniques are not only usable in our model, but can be used for other GAN models. We demonstrate this by applying them for standard GAN [Goodfellow et al.2014]. This experiment is conducted on the CIFAR-10 dataset using the same CNN architecture as [Miyato et al.2018]. First, we regularize the generator of GAN by our propose neighbors embedding or gradient matching separately or their combination to replace the original generator objective of GAN. When applying NE and GM separately, each of them itself can significantly improve FID as shown in Fig. 6. In addition, from Fig. 7, GM+NE achieves FID of 26.05 (last iteration), and this is significant improvement compared to GM alone with FID of 31.50 and NE alone with FID of 38.10. It’s interesting that GM can also reduce mode collapse, we let the further investigation of it in the future work. Although both can handle the mode collapse, NE and GM are very different ideas: NE is a manifold learning based regularization to explicitly prevent mode collapse; GM aligns distributions of generated samples and real samples. The results (Figs. 5 and 7) show GM+NE leads to better convergence and FID scores than individual techniques.

Figure 7: FID scores of GAN when applying our proposed techniques for the generator, and its zoomed figure on the right.

To examine the computational time of gradient matching of our proposed generator objective, we measure its training time for one mini-batch (size 64) with/without GM (Computer: Intel Xeon Octa-core CPU E5-1260 3.7GHz, 64GB RAM, GPU Nvidia 1080Ti) with CNN for CIFAR-10. It takes about 53ms and 43ms to train generator for one mini-batch with/without the GM term respectively. For 300K iterations (one mini-batch per iteration), training with GM takes about one more hour compared to without GM. The difference is not serious. Note that GM includes ,

norms of the difference of discriminator scores and gradients, which can be computed easily in Tensorflow.

Conclusion

We propose two new techniques to address mode collapse and improve the diversity of generated samples. First, we propose an inverse t-SNE regularizer to explicitly retain local structures of latent samples in the generated samples to reduce mode collapse. Second, we propose a new gradient matching regularization for the generator objective, which improves convergence and the quality of generated images. We derived this gradient matching constraint from Taylor expansion. Extensive experiments demonstrate that both constraints can improve GAN. The combination of our proposed techniques leads to state of the art FID scores on benchmark datasets. Future work applies our model for other applications, such as: person re-identification [Guo and Cheung2012]

, anomaly detection

[Lim et al.2018].

Acknowledgement

This work was supported by both ST Electronics and the National Research Foundation(NRF), Prime Minister’s Office, Singapore under Corporate Laboratory @ University Scheme (Programme Title: STEE Infosec - SUTD Corporate Laboratory).

References