Introduction
Generative Adversarial Networks (GANs) [Goodfellow et al.2014, Goodfellow2016] are popular methods for training generative models. GAN training is a twoplayer minimax game between the discriminator and the generator. While the discriminator learns to distinguish between the real and generated (fake) samples, the generator creates samples to confuse the discriminator to accept its samples as “real”. This is an attractive approach. However, stabilizing the training of GAN is still an ongoing important research problem.
Mode collapse is one of the most challenging issues when training GANs. Many advanced GANs have been proposed to improve the stability [Nowozin, Cseke, and Tomioka2016, Arjovsky, Chintala, and Bottou2017, Gulrajani et al.2017]. However, mode collapse is still an issue.
In this work, we propose two techniques to improve GAN training. First, inspired by tdistribution stochastic neighbors embedding (tSNE) [Maaten and Hinton2008], which is a wellknown dimensionality reduction method, we propose an inverse
tSNE regularizer to reduce mode collapse. Specifically, while tSNE aims to preserve the structure of the highdimensional data samples in the reduceddimensional manifold of latent samples, we reverse the procedure of tSNE to explicitly retain local structures of latent samples in the highdimensional generated samples. This prevents generator from producing nearly identical data samples from different latent samples, and reduces mode collapse. Second, we propose a new objective function for the generator by aligning the real and generated sample distributions, in order to generate realistic samples. We achieve the alignment via minimizing the difference between the discriminator scores of the real samples and generated ones. By using the discriminator and its scores, we can avoid working with highdimensional data distribution. We further constrain the difference between the gradients of discriminator scores. We derive these constraints from Taylor approximation of the discriminator function. Our principled approach is significantly different from the standard GAN
[Goodfellow et al.2014]: our generator does not attempt to directly fool the discriminator; instead, our generator produces fake samples that have similar discriminator scores as the real samples. We found that with this technique the distribution of the generated samples approximates well that of the real samples, and the generator can produce more realistic samples.Related Works
Addressing issues of GANs [Goodfellow2016], including gradient vanishing and mode collapse, is an important research topic. A popular direction is to focus on improving the discriminator objective. The discriminator can be formed via the fdivergence [Nowozin, Cseke, and Tomioka2016], or distance metrics [Arjovsky and Bottou2017, Bellemare et al.2017]. And the generator is trained by fooling the discriminator via the zerosum game. Many methods in this direction have to regularize their discriminators; otherwise, they would cause instability issues, as the discriminator often converges much faster than the generator. Some regularization techniques are weightclipping [Arjovsky and Bottou2017], gradient penalty constraints [Gulrajani et al.2017, Roth et al.2017, Kodali et al.2017, Petzka, Fischer, and Lukovnicov2017, Liu2018], consensus constraint, [Mescheder, Nowozin, and Geiger2017, Mescheder, Geiger, and Nowozin2018], or spectral norm [Miyato et al.2018]. However, overconstraint of the discriminator may cause the cycling issues [Nagarajan and Kolter2017, Mescheder, Geiger, and Nowozin2018].
Issues of GAN can also be tackled via the optimizer regularization: changing optimization process [Metz et al.2017], using twotime scale update rules for better convergence [Heusel et al.2017], or averaging network parameters [Yazıcı et al.2018].
Regularizing the generator is another direction: i) It can be achieved by modifying the generator objective function with feature matching [Salimans et al.2016] or discriminatorscore distance [Tran, Bui, and Cheung2018] ii) Or, using AutoEncoders (AE) or latent codes to regularize the generator. AAE [Makhzani et al.2016] uses AE to constrain the generator. The goal is to match the encoded latent distribution to some given prior distribution by the minimax game. The problem of AAE is that pixelwise reconstruction with norm would cause the blurry issue. And the minimax game on the latent samples has the same problems (e.g., mode collapse) as on the data samples. It is because AE alone is not powerful enough to overcome these issues. VAE/GAN [Larsen et al.2015] combined VAE and GAN into one single model and used featurewise distance for the reconstruction to avoid the blur. The generator is regularized in the VAE model to reduce the mode collapse. Nevertheless, VAE/GAN has the similar limitation of VAE [Kingma and Welling2013], including reparameterization tricks for backpropagation, or, requirement to access to an exact functional form of prior distribution. ALI [Dumoulin et al.2016] and BiGAN [Donahue, Krähenbühl, and Darrell2016] jointly train the data/latent samples in GAN framework. This method can learn the AE model implicitly after training. MDGAN [Che et al.2016] required two discriminators for two separate steps: manifold and diffusion. The manifold step manages to learn a good AE. The diffusion step is similar to the original GAN, except that the constructed samples are used as real samples instead. InfoGAN [Chen et al.2016] learned the disentangled representation by maximizing the mutual information for inducing latent codes. MMGAN [Park et al.2018] makes strong assumption that manifolds of real and fake samples are spheres. First, it aligns real and fake sample statistics by matching the two manifold spheres (centre and radius), and then it applies correlation matrix to reduce mode collapse. DistGAN [Tran, Bui, and Cheung2018] constrains the generator by the regularized autoencoder. Furthermore, the authors use the reconstructed samples to regularize the convergence of the discriminator.
Autoencoder can be also used in the discriminator objectives. EBGAN [Zhao, Mathieu, and LeCun2017]
introduces the energybased model, in which the discriminator is considered as the energy function minimized via reconstruction errors. BEGAN
[Berthelot, Schumm, and Metz2017] extends EBGAN by optimizing Wasserstein distance between AE loss distributions.Proposed method
Our proposed system with gradient matching (GM) and neighbor embedding (NE) constraints, namely GNGAN, consists of three main components: the autoencoder, the discriminator, and the generator. In our model, we first train the autoencoder, then the discriminator and finally the generator as presented in Algorithm 1.
Neighbors embedding constraint for Autoencoder
We use autoencoder (AE) in our model for two reasons: i) to prevent the generator from being severely collapsed. ii) to regularize the generator in producing samples that resemble real ones. However, using AE alone is not adequate to avoid mode collapse, especially for highdimensional data. Therefore, we propose additional regularization as in Eq. 1:
(1) 
Eq. 1 is the objective of our regularized AE. The first term is reconstruction error in conventional AE. The second term is our proposed neighbors embedding constraint, to be discussed. Here, is GAN generator (decoder in AE), is the encoder and is a constant.
Mode collapse is a failure case of GAN when the generator often generates similar samples. The diversity of generated samples is small compared with those of the original dataset. As discussed in previous work (e.g. [Tran, Bui, and Cheung2018]
), with mode collapse, the generator would map two farapart latent samples to nearby data points in the highdimensional data space with high probability. This observation motivates our idea to constrain the distances between generated data points in order to alleviate mode collapse. In particular, the data point distances and the corresponding latent sample distances should be consistent.
The motivation of our neighborsembedding constraint is to constrain the relative distance among data points and their corresponding latent points within the data and latent manifold respectively (Fig. 1). In our model, we apply the probabilistic relative distance (PRDist) in tSNE [Maaten and Hinton2008], which takes into account the distributions of latent sample structure and data sample structure. tSNE has been shown to preserve both the local structure of data space (the relation inside each cluster) and the global structure (the relation between each pair of clusters). Notably, our method applies PRDist in the reverse direction of tSNE for different purpose. While tSNE aims to preserve significant structures of the highdimensional data in the reduceddimensional samples, in our work, we aim to preserve the structures in lowdimensional latent samples in its highdimensional mappings via the generator. Specifically, the objective is as shown in Eq. 2:
(2) 
The probability distribution of latent structure
is a joint, symmetric distribution, computed as below:(3) 
and are the conditional probabilities, whose center points are and respectively. Here, and are indices of th and th samples respectively in a minibatch of training data. Accordingly, and are th and th latent samples. is the number of samples in the minibatch. The conditional probability is given by:
(4) 
where
is the variance of all pairwise distances in a minibatch of latent samples. Similar to tSNE method, the joint distribution
is to prevent the problem of outliers in highdimensional space.
Similarly, the probability distribution of data sample structure is the joint, symmetric computed from two conditional probabilities as below:
(5) 
where is the conditional probability of pairwise distance between samples and the center point , computed as follow:
(6) 
is the variance of all pairwise distances of data samples in the minibatch. The regularization term is the dissimilarity between two joint distributions: and , where each distribution represents the neighbor distance distribution. Similar to tSNE, we set the values of and to zero. The dissimilarity is KullbackLeibler (KL) divergence as in Eq. 2. is a merged dataset of encoded and random latent samples, and is a merged dataset of reconstruction and generated samples. Here, the reconstruction samples and their latent samples are considered as the anchor points of data and latent manifolds respectively to regularize the generation process.
Discriminator objective
(7) 
Our discriminator objective is shown in Eq. 7. Our model considers the reconstructed samples as “real” represented by the term , so that the gradients from discriminator are not saturated too quickly. This constraint slows down the convergence of discriminator, similar goal as [Arjovsky, Chintala, and Bottou2017], [Miyato et al.2018] and [Tran, Bui, and Cheung2018]. In our method, we use a small weight for with for the discriminator objective. We observe that is important at the beginning of training. However, towards the end, especially for complex image datasets, the reconstructed samples may not be as good as real samples, resulting in low quality of generated images. Here, is the expectation, is a constant, and , is a uniform random number . enforces sufficient gradients from the discriminator even when approaching convergence. Fig. 2 illustrates gradients at convergence time.
We also apply hinge loss similar to [Miyato et al.2018] by replacing with . We empirically found that hinge loss could also improve the quality of generated images in our model. Here, because , the hinge loss version of Eq. 7 (ignore constants) is as follows:
(8) 
Generator objective with gradient matching
In this work, we propose to train the generator via aligning distributions of generated samples and real samples. However, it is challenging to work with highdimensional sample distribution. We propose to overcome this issue in GAN by using the scalar discriminator scores. In GAN, the discriminator differentiates real and fake samples. Thus, the discriminator score can be viewed as the probability that sample drawn from the real data distribution. Although exact form of is unknown, but the scores at some data points (from training data) can be computed via the discriminator network. Therefore, we align the distributions by minimizing the difference between discriminator scores of real and generated samples. In addition, we constrain the gradients of these discriminator scores. These constraints can be derived from Taylor approximation of discriminator functions as followings.
Assume that the first derivative of exists, and the training set has data samples . For a sample point , by firstorder Taylor expansion (TE), we can approximate with TE at a data point :
(9) 
Here is the TE approximation error. Alternatively, we can approximate with TE at a generated sample :
(10) 
Our goal is to enforce the distribution of generated sample to be similar to that of real sample . For a given , its discriminator score can be approximated by firstorder TE at with error . Note that, here we define to be the approximation error of with firstorder TE at point . Likewise, is the approximation error of with firstorder TE at point . If and were from the same distribution, then . Therefore, we propose to enforce when training the generator. Note that , because is a constant and is independent of and . Therefore, we propose to enforce in order to align to real sample distribution . From Eq. 9, we have:
(11) 
From Eq. 10, we have:
(12) 
To equate Eqs. 11 and 12, we enforce equality of corresponding terms. This leads to minimization of the following objective function for the generator:
(13) 
Here, we use norm for the first term of generator objective, and norm for two last terms. Empirically, we observe that using norm is more stable than using norm. . In practice, our method is more stable when we implement as and as in the second and third term of Eq. 13. Note that this proposed objective can be used in other GAN models. Note also that a recent work [Tran, Bui, and Cheung2018] has also used the discriminator score as constraint. However, our motivation and formulation are significantly different. In the experiment, we show improved performance compared to [Tran, Bui, and Cheung2018].
Experimental Results
Synthetic 1D dataset
For 1D synthetic dataset, we compare our model to DistGAN [Tran, Bui, and Cheung2018], a recent stateoftheart GAN. We use the code (https://github.com/tntrung/gan) for this 1D experiment. Here, we construct the 1D synthetic data with 3 Gaussian modes (green) as shown in Fig. 2. It is more challenging than the onemode demo by DistGAN.
We use small networks for both methods. Specifically, we create the encoder and generator networks with three fullyconnected layers and the discriminator network with two fullyconnected layers. We use ReLU for hidden layers and sigmoid for the output layer of the discriminator. The discriminator is smaller than the generator to make the training more challenging. The number of neurons for each hidden layer is 4, the learning rate is 0.001,
for both method, for our generator objective.Fig. 2 shows that our model can recover well three modes, while DistGAN cannot (see attached video demos in the supplementary material). Although both methods have good gradients of the discriminator scores (decision boundary) for the middle mode, it’s difficult to recover this mode with DistGAN as gradients computed over generated samples are not explicitly forced to resemble those of real samples as in our proposed method. Note that for this 1D experiment and the 2D experiment in the next section, we only evaluate our model with gradient matching (+GM), since we find that our new generator with gradient matching alone is already good enough; neighbors embedding is more useful for highdimensional data samples, as will be discussed.
Synthetic 2D dataset
Our 2D synthetic data has 25 Gaussian modes (red dots). The black arrows are gradient vectors of the discriminator computed around the groundtruth modes. Figures from left to right are examples of gradient maps of GAN, WGANGP, DistGAN and ours.
For 2D synthetic data, we follow the experimental setup on the same 2D synthetic dataset [Tran, Bui, and Cheung2018]. The dataset has 25 Gaussian modes in the grid layout (red points in Fig. 4) that contains 50K training points. We draw 2K generated samples for evaluating the generator. However, the performance reported in [Tran, Bui, and Cheung2018] is nearly saturated. For example, it can recover entirely 25 modes and register more than 90% of the total number of points. It’s hard to see the significant improvement of our method in this case. Therefore, we decrease the number of hidden layers and their number of neurons for networks to be more challenging. For a fair comparison, we use equivalent encoder, generator and discriminator networks for all compared methods.
Encoder ()  2  2  2  64 
Generator ()  2  2  2  64 
Discriminator ()  2  1  2  64 
The detail of network architecture is presented in Table 1. , , are dimensions of input, output and hidden layers respectively. is the number of hidden layers. We use ReLU for hidden layers and sigmoid for output layers. To have a fair comparison, we carefully finetune other methods to ensure that they can perform their best on the synthetic data. For evaluation, a mode is missed if there are less than 20 generated samples registered in this mode, which is measured by its mean and variance of 0.01. A method has mode collapse if there are missing modes. For this experiment, the prior distribution is the 2D uniform
. We use Adam optimizer with learning rate lr = 0.001, and the exponent decay rate of first moment
. The parameters of our model are: . The learning rate is decayed every 10K steps with a base of. This decay rate is to avoid the learning rate saturating too quickly that is not fair for slow convergence methods. The minibatch size is 128. The training stops after 500 epochs.
In this experiment, we compare our model to several stateoftheart methods. ALI [Donahue, Krähenbühl, and Darrell2016], VAEGAN [Larsen et al.2015] and DistGAN [Tran, Bui, and Cheung2018] are recent works using encoder/decoder in their models. WGANGP [Gulrajani et al.2017] is one of the stateofthearts. We also compare to VAEbased methods: VAE [Kingma and Welling2014] and VAE [Higgins et al.2016]. The numbers of covered (registered) modes and registered points during training are presented in Fig. 3. The quantitative numbers of last epochs are in Table 2. In this table, we also report the Total Variation scores to measure the mode balance [Tran, Bui, and Cheung2018]. The result for each method is the average of eight runs. Our method outperforms all others on the number of covered modes. Although WGANGP and DistGAN are stable with larger networks and this experimental setup [Tran, Bui, and Cheung2018], they are less stable with our network architecture, miss many modes and sometimes diverge.VAE based method often address well mode collapse, but in our experiment setup where the small networks may affect the reconstruction quality, consequently reduces their performance. Our method does not suffer serious mode collapse issues for all eight runs. Furthermore, we achieve a higher number of registered samples than all others. Our method is also better than the rest with Total Variation (TV).
Method  #registered modes  #registered points  TV (True)  TV (Differential) 

GAN [Goodfellow et al.2014]  14.25 2.49  1013.38 171.73  1.00 0.00  0.90 0.22 
ALI [Donahue, Krähenbühl, and Darrell2016]  17.81 1.80  1281.43 117.84  0.99 0.01  0.72 0.19 
VAEGAN [Larsen et al.2015]  12.75 3.20  1042.38 170.17  1.35 0.70  1.34 0.98 
VAE [Kingma and Welling2014]  13.48 2.31  1265.45 72.47  1.81 0.71  2.16 0.72 
VAE [Higgins et al.2016]  18.00 2.33  1321.17 95.61  1.17 0.24  1.47 0.28 
WGANGP [Gulrajani et al.2017]  21.71 1.35  1180.25 158.63  0.90 0.07  0.51 0.06 
DistGAN [Tran, Bui, and Cheung2018]  20.71 4.42  1188.62 311.91  0.82 0.19  0.43 0.12 
Ours  24.39 0.44  1461.83 222.86  0.57 0.17  0.31 0.12 
In addition, we follow [ThanhTung, Tran, and Venkatesh2018] to explore the gradient map of the discriminator scores of compared methods: standard GAN, WGANGP, DistGAN and ours as shown in Fig. 4. This map is important because it shows the potential gradient to pull the generated samples towards the real samples (red points). The gradient map of standard GAN is noisy, uncontrolled and vanished for many regions. The gradient map of WGANGP has more meaningful directions than GAN. Its gradient concentrates in the centroids (red points) of training data and has gradients around most of the centroids. However, WGANGP still has some modes where gradients are not towards the groundtruth centroids. Both DistGAN and our method show better gradients than WGANGP. The gradients of our method are more informative for the generator to learn when they guide better directions to all real groundtruths.
CIFAR10 and STL10 datasets
For CIFAR10 and STL10 datasets, we measure the performance with FID scores [Heusel et al.2017]. FID can detect intraclass mode dropping, and measure the diversity as well as the quality of generated samples. We follow the experimental procedure and model architecture in [Miyato et al.2018] to compare methods. FID is computed from 10K real samples and 5K generated samples. Our default parameters are used for all experiments . Learning rate, , for Adam is (lr = 0.0002, , ). The generator is trained with 350K updates for logarithm loss version (Eq. 7) and 200K for “hinge” loss version (Eq. 8) to converge better. The dimension of the prior input is 128. All our experiments are conducted using the unsupervised setting.
Method  CIFAR  STL  CIFAR (R) 

GANGP  37.7     
WGANGP  40.2  55.1   
SNGAN  25.5  43.2  21.7 .21 
DistGAN  22.95  36.19   
Ours  21.70  30.80  16.47 .28 
In the first experiment, we conduct the ablation study with our new proposed techniques to understand the contribution of each component into the model. Experiments with standard CNN [Miyato et al.2018] on the CIFAR10 dataset. We use the logarithm version for the discriminator objective (Eq. 7). Our original model is similar to DistGAN model, but we have some modifications, such as: using lower weights for the reconstruction constraint as we find that it can improve FID scores. We consider DistGAN as a baseline for this comparison. FID is computed for every 10K iterations and shown in Fig. 5. Our original model converges a little bit slow at the beginning, but at the end, our model achieves better FID score than DistGAN model. Once we replace separately each proposed techniques, either the neighbors embedding technique (+NE) or gradient matching (+GM), into our original model, it converges faster and reaches a better FID score than the original one. Combining two proposed techniques further speeds up the convergence and reach better FID score than other versions. This experiment proves that our proposed techniques can improve the diversity of generated samples. Note that in Fig. 5, we compared DistGAN and our model (original) with only discriminator scores. With GM, our model converges faster and achieves better FID scores.
We compare our best setting (NE + GM) with a hinge loss version (Eq. 8) with other methods. Results are shown in Table 3
. The FID score of SNGAN and DistGAN are also with hinge loss function. We also report our performance with the ResNet (R) architecture
[Miyato et al.2018] for CIFAR10 dataset. For both standard CNN and ResNet architectures, our model outperforms other stateoftheart methods with FID score, especially significantly higher on STL10 dataset with CNN and on CIFAR10 dataset with ResNet. For STL10 dataset and the ResNet architecture, the generator is trained with 200K iterations to reduce training time. Training it longer does not significantly improve the FID score. Fig. 6 are some generated samples of our method trained on CIFAR10 and STL10 datasets.Our proposed techniques are not only usable in our model, but can be used for other GAN models. We demonstrate this by applying them for standard GAN [Goodfellow et al.2014]. This experiment is conducted on the CIFAR10 dataset using the same CNN architecture as [Miyato et al.2018]. First, we regularize the generator of GAN by our propose neighbors embedding or gradient matching separately or their combination to replace the original generator objective of GAN. When applying NE and GM separately, each of them itself can significantly improve FID as shown in Fig. 6. In addition, from Fig. 7, GM+NE achieves FID of 26.05 (last iteration), and this is significant improvement compared to GM alone with FID of 31.50 and NE alone with FID of 38.10. It’s interesting that GM can also reduce mode collapse, we let the further investigation of it in the future work. Although both can handle the mode collapse, NE and GM are very different ideas: NE is a manifold learning based regularization to explicitly prevent mode collapse; GM aligns distributions of generated samples and real samples. The results (Figs. 5 and 7) show GM+NE leads to better convergence and FID scores than individual techniques.
To examine the computational time of gradient matching of our proposed generator objective, we measure its training time for one minibatch (size 64) with/without GM (Computer: Intel Xeon Octacore CPU E51260 3.7GHz, 64GB RAM, GPU Nvidia 1080Ti) with CNN for CIFAR10. It takes about 53ms and 43ms to train generator for one minibatch with/without the GM term respectively. For 300K iterations (one minibatch per iteration), training with GM takes about one more hour compared to without GM. The difference is not serious. Note that GM includes ,
norms of the difference of discriminator scores and gradients, which can be computed easily in Tensorflow.
Conclusion
We propose two new techniques to address mode collapse and improve the diversity of generated samples. First, we propose an inverse tSNE regularizer to explicitly retain local structures of latent samples in the generated samples to reduce mode collapse. Second, we propose a new gradient matching regularization for the generator objective, which improves convergence and the quality of generated images. We derived this gradient matching constraint from Taylor expansion. Extensive experiments demonstrate that both constraints can improve GAN. The combination of our proposed techniques leads to state of the art FID scores on benchmark datasets. Future work applies our model for other applications, such as: person reidentification [Guo and Cheung2012]
[Lim et al.2018].Acknowledgement
This work was supported by both ST Electronics and the National Research Foundation(NRF), Prime Minister’s Office, Singapore under Corporate Laboratory @ University Scheme (Programme Title: STEE Infosec  SUTD Corporate Laboratory).
References
 [Arjovsky and Bottou2017] Arjovsky, M., and Bottou, L. 2017. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862.
 [Arjovsky, Chintala, and Bottou2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. ICML.
 [Bellemare et al.2017] Bellemare, M. G.; Danihelka, I.; Dabney, W.; Mohamed, S.; Lakshminarayanan, B.; Hoyer, S.; and Munos, R. 2017. The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743.
 [Berthelot, Schumm, and Metz2017] Berthelot, D.; Schumm, T.; and Metz, L. 2017. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717.
 [Che et al.2016] Che, T.; Li, Y.; Jacob, A. P.; Bengio, Y.; and Li, W. 2016. Mode regularized generative adversarial networks. CoRR.
 [Chen et al.2016] Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, 2172–2180.
 [Donahue, Krähenbühl, and Darrell2016] Donahue, J.; Krähenbühl, P.; and Darrell, T. 2016. Adversarial feature learning. arXiv preprint arXiv:1605.09782.
 [Dumoulin et al.2016] Dumoulin, V.; Belghazi, I.; Poole, B.; Lamb, A.; Arjovsky, M.; Mastropietro, O.; and Courville, A. 2016. Adversarially learned inference. arXiv preprint arXiv:1606.00704.
 [Goodfellow et al.2014] Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672–2680.
 [Goodfellow2016] Goodfellow, I. 2016. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160.
 [Gulrajani et al.2017] Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, 5767–5777.
 [Guo and Cheung2012] Guo, Y., and Cheung, N.M. 2012. Efficient and deep person reidentification using multilevel similarity. In CVPR.
 [Heusel et al.2017] Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, 6626–6637.
 [Higgins et al.2016] Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2016. betavae: Learning basic visual concepts with a constrained variational framework. ICLR.
 [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Autoencoding variational bayes. ICLR.
 [Kodali et al.2017] Kodali, N.; Abernethy, J.; Hays, J.; and Kira, Z. 2017. On convergence and stability of gans. arXiv preprint arXiv:1705.07215.
 [Larsen et al.2015] Larsen, A. B. L.; Sønderby, S. K.; Larochelle, H.; and Winther, O. 2015. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300.
 [Lim et al.2018] Lim, S. K.; Loo, Y.; Tran, N.T.; Cheung, N.M.; Roig, G.; and Elovici, Y. 2018. Doping: Generative data augmentation for unsupervised anomaly detection. In Proceeding of IEEE International Conference on Data Mining (ICDM).
 [Liu2018] Liu, K. 2018. Varying klipschitz constraint for generative adversarial networks. arXiv preprint arXiv:1803.06107.

[Maaten and Hinton2008]
Maaten, L. v. d., and Hinton, G.
2008.
Visualizing data using tsne.
Journal of machine learning research
9(Nov):2579–2605. 
[Makhzani et al.2016]
Makhzani, A.; Shlens, J.; Jaitly, N.; and Goodfellow, I.
2016.
Adversarial autoencoders.
In International Conference on Learning Representations.  [Mescheder, Geiger, and Nowozin2018] Mescheder, L.; Geiger, A.; and Nowozin, S. 2018. Which training methods for gans do actually converge? In International Conference on Machine Learning, 3478–3487.
 [Mescheder, Nowozin, and Geiger2017] Mescheder, L.; Nowozin, S.; and Geiger, A. 2017. The numerics of gans. In Advances in Neural Information Processing Systems, 1825–1835.
 [Metz et al.2017] Metz, L.; Poole, B.; Pfau, D.; and SohlDickstein, J. 2017. Unrolled generative adversarial networks. ICLR.
 [Miyato et al.2018] Miyato, T.; Kataoka, T.; Koyama, M.; and Yoshida, Y. 2018. Spectral normalization for generative adversarial networks. ICLR.
 [Nagarajan and Kolter2017] Nagarajan, V., and Kolter, J. Z. 2017. Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, 5585–5595.
 [Nowozin, Cseke, and Tomioka2016] Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. fgan: Training generative neural samplers using variational divergence minimization. In NIPS.
 [Park et al.2018] Park, N.; Anand, A.; Moniz, J. R. A.; Lee, K.; Chakraborty, T.; Choo, J.; Park, H.; and Kim, Y. 2018. MMGAN: manifold matching generative adversarial network for generating images. CoRR abs/1707.08273.
 [Petzka, Fischer, and Lukovnicov2017] Petzka, H.; Fischer, A.; and Lukovnicov, D. 2017. On the regularization of wasserstein gans. arXiv preprint arXiv:1709.08894.
 [Roth et al.2017] Roth, K.; Lucchi, A.; Nowozin, S.; and Hofmann, T. 2017. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, 2018–2028.
 [Salimans et al.2016] Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; and Chen, X. 2016. Improved techniques for training gans. In NIPS, 2234–2242.
 [ThanhTung, Tran, and Venkatesh2018] ThanhTung, H.; Tran, T.; and Venkatesh, S. 2018. On catastrophic forgetting and mode collapse in generative adversarial networks. In Workshop on Theoretical Foundation and Applications of Deep Generative Models.
 [Tran, Bui, and Cheung2018] Tran, N.T.; Bui, T.A.; and Cheung, N.M. 2018. Distgan: An improved gan using distance constraints. In ECCV.
 [Yazıcı et al.2018] Yazıcı, Y.; Foo, C.S.; Winkler, S.; Yap, K.H.; Piliouras, G.; and Chandrasekhar, V. 2018. The unusual effectiveness of averaging in gan training. arXiv preprint arXiv:1806.04498.
 [Zhao, Mathieu, and LeCun2017] Zhao, J.; Mathieu, M.; and LeCun, Y. 2017. Energybased generative adversarial network. ICLR.
Comments
There are no comments yet.