One of the most fundamental problems in machine learning is the unsupervised learning of high-dimensional data. A class of problems in unsupervised learning is density estimation, where it is assumed that there exist a class of probabilistic models underlying observed dataand the goal of learning is to infer the right model(s). The generative adversarial network proposed by Goodfellow et al. (goodfellow2014generative, ) is an elegant framework, which transforms the problem of density estimation to an adversarial process in a minimax game between a generative network and a discriminative network . However, despite their simplicity, GANs are notoriously difficult to train.
Mode collapse— There are different schools in diagnosing and addressing the problems with training GANs, that have resulted in a variety of algorithms, network architectures, training procedures, and novel objective functions (radford2015unsupervised, ; salimans2016improved, ; zhao2016energy, ; arjovsky2017wasserstein, ; nowozin2016f, ). The roots of the problems in training GANs lie on the unbalanced nature of the game being played, the difficulty with high-dimensional minimax optimizations, and the fact that the data manifold is highly structured in the ambient space . Perhaps, the biggest challenge is that the natural data in the world reside on a very low-dimensional manifold of their ambient space (narayanan2010sample, ). Early in training the generative network is far off from this low-dimensional manifold and the discriminative network learns quickly to reject the generated samples, causing little room to improve . This was analyzed in depth by Arjovsky & Bottou (arjovsky2017towards, ), which highlighted the deficiencies of -divergences when the generative network has a low-dimensional support. The other challenging issue is that GANs’ optimal point is a saddle point. We have good understanding and a variety of optimization methods to find local minima/maxima of objective functions, but minimax optimization in high-dimensional spaces have proven to be challenging. Because of these two obstacles, i.e. the nature of high-dimensional data and the nature of the optimization, GANs suffer from stability issues and the ubiquitous problem of mode collapse, where the generator completely ignores parts of the low-dimensional data manifold.
-GAN— In this work, we address these two issues at the same time by lifting the minimax game, where the initial objective is to find the GAN equilibrium in an easier game of learning to map to . Here, is the noise variable corresponding to the latent space, and is the dimension of the ambient space . The subscript in refers to the inverse temperature , which is defined in the next section. After arriving at the equilibrium for , we anneal the uniform distribution towards the data distribution while performing the adversarial training simultaneously. Our assumption in this work is that once GAN is stable for the uniform distribution, it will remain stable in the continuous annealing limit irrespective of the divergence measure being used in the objective function. In this work, we used the original Jensen-Shannon formulation of Goodfellow et al. (goodfellow2014generative, ). The objective to learn the uniform distribution puts constraints on the architecture of the generative network, most importantly , which has deep consequences for the adversarial training as discussed below.
Related works— There are similarities between our approach here and recent proposals in stabilizing the GAN training by adding noise to samples from the generator and to the data points (kaae2016amortised, ; arjovsky2017towards, ). This was called instance noise in (kaae2016amortised, ). The key insight was provided in (arjovsky2017towards, ), where the role of noise was to enlarge the support of the generative network and the data distribution, which leads to stronger learning signals for the generative network during training. The crucial difference in this work is that we approached this problem from the perspective of annealing distributions and our starting point is to generate the uniform distribution, which has the support of the whole ambient space . This simple starting point is a straightforward solution to theoretical problems raised in (arjovsky2017towards, ) in using -divergences for adversarial training, where it was assumed that the support of the generative network has measure in the ambient space . Since the uniform distribution is not normalized in , we assumed to be a finite -dimensional box in . A good physical picture to have is to imagine the data manifold diffusing to the uniform distribution like ink in a -dimensional vase filled with water. What -GAN achieves during annealing is to shape the space-filling samples, step-by-step, to samples that lie on the low-dimensional manifold of the data distribution. Therefore, in our framework, there is no need to add any noise to samples from the generator (in contrast to (kaae2016amortised, ; arjovsky2017towards, )) since the generator support is initialized to be the ambient space. Finally, one can also motivate -GAN from the perspective of curriculum learning (bengio2009curriculum, ), where learning the uniform distribution is the initial task in the curriculum.
In this section, we define the parameter , which plays the role of inverse temperature and parametrizes annealing from the uniform distribution () to the data distribution (). We provide a new algorithm for training GANs based on a conjecture with stability guarantees in the continuous annealing limit. We used the Jensen-Shannon formulation of GANs (goodfellow2014generative, ) below but the conjecture holds for other measures including -divergences (nowozin2016f, ) and the Wasserstein metric (arjovsky2017wasserstein, ).
We assume the generative and discriminative networks and
have very large capacity, parameterized by deep neural networksand . Here, is the (noise) input to the generative network , and
is the discriminative network that is performing logistic regression. The discriminative network is trained with the binary classification labelsfor the observations , and otherwise. The GAN objective is to find such that . This is achieved at the Nash equilibrium of the following minimax objective:
where at the equilibrium (goodfellow2014generative, ). One way to introduce is to go back to the empirical distribution and rewrite it as a mixture of Gaussians with zero widths:
The heated data distribution at finite is therefore given by:
The -dimensional box— The starting point in -GAN is to learn to sample from the uniform distribution. Since the uniform distribution is not normalized in , we set to be the finite interval . The uniform distribution sets the scale in our framework, and the samples are rescaled to the same interval. This hard -dimensional box for the data particles is thus assumed throughout the paper. Its presence is conceptually equivalent to a diffusion process of the data particles in the box , where they diffuse to the uniform distribution like ink dropped in water (sohl2015deep, ). In this work, we parametrized the distributions with instead of the diffusion time. We also mention a non-Gaussian path to the uniform distribution in the discussion section.
With this setup, the minimax optimization task at each is:
Note that the optimal parameters and depend on implicitly. In -GAN, the first task is to learn to sample the uniform distribution. It is then trained simultanously as the uniform distribution is smoothly annealed to the empirical distribution by increasing . We chose a simple fixed geometric scheduling for annealing in this work. The algorithm is given below (see Fig. 1 for the schematic):
) for the final epochs.
Minibatch stochastic gradient descent training of annealed generative adversarial networks. The inner loop can be replaced with other GAN architectures and/or other divergence measures. The one below uses the Jensen-Shannon formulation of Goodfellowet al. as the objective, as are all experiments in this paper.
The convergence of the algorithm is based on the following conjecture:
In the continuous annealing limit from the uniform distribution to the data distribution GAN remains stable at the equilibrium, assuming G and D have large capacity and that they are initialized at the minimax equilibrium for generating the uniform distribution111This requires . in the ambient space .
, which is the total number of gradient evaluations from the start. We use the architecture G:[z(3) | ReLU(128) | ReLU(128) | Linear(3)] and D:[x(3) | Tanh(128) | Tanh(128) | Tanh(128) | Sigmoid(1)] for generator and discriminator where the numbers in the parentheses show the number of units in each layer. The annealing parameters are .
-GAN starts with learning to generate the uniform distribution in the ambient space of data. The mapping that transforms the uniform distribution222We used the uniform prior for in all our experiments. to the uniform distribution of the same dimension is an affine function. We therefore used only ReLU nonlinearity in the generative network to make the job for the generator easier. The performance of the network in generating the uniform distribution was degraded by using smooth nonlinearities like Tanh. It led to immediate mode collapse to frozen noise instead of generating high-entropy noise (see Figure 4). The mode collapse to frozen noise was especially prominent in high dimensions.
3.1 Toy examples
To check the stability of -GAN, we ran experiments on mixtures of 1D, 2D, 3D Gaussians, and a mixture of two cubic frames in 3D. The 3D results are presented here. The reported results for vanilla GAN (top row of Fig. 2) was the best among many runs; in most experiments vanila-GAN captured only one mode or failed to capture any mode. However, -GAN produced similar results consistently. In addition, vanilla GAN requires the modification of the generator loss to to avoid saturation of discriminator (goodfellow2014generative, ), while in -GAN we did not make any modification, staying with the generator loss . In the experiments, the total number of training iterations in -GAN was the same as vanilla GAN, but distributed over many intermediate temperatures, thus curbing the computational cost. We characterized the computation cost by the total number of gradient evaluations reported in the Fig. 2. We also compared the training curves of -GAN and vanilla GAN for mixtures of five and ten Gaussians (see Fig. 3).
We also synthesized a dataset that is a mixture of two cubic frames, one enclosed by the other. This dataset is interesting since the data is located on disjoint 1D manifolds within the 3D ambient space. -GAN performs well in this case in every run of the algorithm (see bottom row of Fig. 2)
We should emphasize that different GAN architectures can be easily augmented with -GAN as the outer loop. In the 3D experiments here, we chose the original architecture of generative adversarial network from (goodfellow2014generative, ) as the inner loop (see Algorithm 1). In the next section we show the results for more sophisticated GAN architectures.
3.2 High-dimensional examples
To check the performance of our method in higher dimensions we applied -GAN to the MNIST dataset (lecun1998gradient, ) with the dimension and CelebA dataset (liu2015faceattributes, ) with the the dimension
. Once again, we start from generating the uniform distribution in the ambient space of the data and we use only piecewise linear activation functions for the generative network due to thefrozen noise mode collapse that we discussed earlier.
The performance of -GAN for the MNIST dataset with a fully connected network is shown in Fig.. 5. As gradually increases, the network learns to generate noisy images corresponding to each temperature. The results converge to clean MNIST images in the last epochs of training, where data distribution is cooled down at high value of . Also during intermediate epochs, noisy digits are generated, which are still diverse. This behavior is in contrast with the training of vanilla GAN, where collapsing at single mode is common in intermediate iterations. The same experiment was performed for CelebA dataset with the same annealing procedure, starting from the uniform distribution and annealing to the data distribution. The results are reported in Figure 6.
Regarding annealing from the uniform distribution to the data distribution, we used the same annealing schedule in all our experiments – for mixture of Gaussians (different number of modes), mixture of interlaced cubes, MNIST and CelebA – and we consistently achieved the results reported here. This highlights the stability of -GAN. We think this stability is due to the -GAN conjecture (see Section 2) even though the annealing is not continuous in the experiments.
We emphasize that both MNIST and CelebA images were generated with and , the dimensions of their ambient space respectively. At the beginning, the support of the generated distribution (i.e. the uniform distribution) is the ambient space. -GAN learns during annealing, step-by-step, to shape the space-filling samples to samples that lie on the manifold of MNIST digits and CelebA faces.
The curves shown here are the output of the discriminator (which is a classifier in this case) for the real and generated samples. For-GAN the training curves show a more stable behavior with more robustness to the complexity of input data (a,b). However, when the data gets more complex, vanilla GAN performance gets worse signified the growing gap between and (c,d).
In this work, we took a departure from the current practices in training adversarial networks by giving the generative network the capacity to fill the ambient space in the form of the uniform distribution. The uniform distribution was motivated from statistical mechanics, where we imagined the data particles diffusing like ink dropped in water. The parameter can be thought of as a surrogate for this diffusion process. There are in fact many ways to transform the data distribution to the uniform distribution. An approach that is non-Gaussian is flipping bits randomly in the bit representation (saremi2013hierarchical, ; saremi2016correlated, ) – this process will take any distribution to the uniform distribution in the limit of many bit flips. The starting point in -GAN has deep consequences for the adversarial training. It is a straightforward solution to the theoretical problems raised in (arjovsky2017towards, ), since the results there were based on . However, despite -GAN’s success in our experiments, the brute force may not be practical in large dimensions. We are working on ideas to incorporate multi-scale representations (denton2015deep, ) into this framework, and are considering dimensionality reduction as a pre-processing step before feeding data into -GAN. To emphasize the robustness of -GAN, we reported results with a fixed annealing schedule, but we have also explored ideas from feedback control (BerthelotSM17BEGAN, ) to make the annealing adaptive.
. We use the fullly connected architecture G:[z(784) | BNReLU(256) | BNReLU(256) | BNReLU(256) | Linear(784)] and D:[x(784) | BNReLU(256) | BNReLU(512) | BNReLU(512) | Sigmoid(1)] for generator and discriminator where the numbers in the parentheses show the number of units in each layer. BNReLU is batch normalization(IoffeS15batchnormal, ) concatenated with ReLU activation. The annealing parameters are  the same as 3D experiment in Fig. 2.
SS acknowledges the support by CIFAR. We also acknowledge comments by Brian Cheung on the manuscript.
-  Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
-  Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
-  David Berthelot, Tom Schumm, and Luke Metz. BEGAN: boundary equilibrium generative adversarial networks. CoRR, abs/1703.10717, 2017.
-  Emily L Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
-  Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490, 2016.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
Proceedings of International Conference on Computer Vision (ICCV), 2015.
Hariharan Narayanan and Sanjoy Mitter.
Sample complexity of testing the manifold hypothesis.In Advances in Neural Information Processing Systems, pages 1786–1794, 2010.
-  Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pages 2226–2234, 2016.
-  Saeed Saremi and Terrence J Sejnowski. Hierarchical model of natural images and the origin of scale invariance. Proceedings of the National Academy of Sciences, 110(8):3071–3076, 2013.
-  Saeed Saremi and Terrence J Sejnowski. Correlated percolation, fractal structures, and scale-invariant distribution of clusters in natural images. IEEE transactions on pattern analysis and machine intelligence, 38(5):1016–1020, 2016.
-  Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, 2015.
-  Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.