Since the introduction of generative adversarial networks (GANs) [Goodfellow et al.2014], researchers have dove deeply into improving the quality of generated images. Recently, a number of new approaches have been proposed for high-quality image generation, e.g., ProgressiveGAN [Karras et al.2017], SplittingGAN [Grinblat et al.2017], SGAN [Huang et al.2017], and WGAN-GP [Salimans et al.2016].
We propose a novel GAN model equipped with multiple generators, each of which specializes in learning a certain modality of dataset (see Fig. 1). In addition to the generators, we employ an auxiliary network that determines a generator that will be trained from a certain training instance. We name the auxiliary network as gating networks following the precedent [Jacobs et al.1991].
Ensembling multiple neural networks coupled with gating networks was first introduced to achieve a higher performance in multi-speaker phoneme recognition[Hampshire and Waibel1990]
. In their method, the design of the loss function caused the neural networks to cooperate. A later research introduced a new loss function that stimulates competitions among neural networks, where the involved neural networks attempt to specialize in a certain task rather than redundantly learn the same feature[Jacobs et al.1991]. The algorithm is now called mixture of experts
in various machine learning domains. Reminiscent of their work, we name our proposed GAN approach as MEGAN, short for the mixture of experts GAN.
The gating networks in our proposed MEGAN are responsible for selecting one particular generator that would perform best given a certain condition. The gating networks consist of two submodules, an assignment module and Straight-Through Gumbel-Softmax [Jang et al.2016], which we will discuss in detail in Section 4.2.
Although MEGAN inherits the idea of multiple generators and the gating networks, we will not adopt the proposed loss function [Jacobs et al.1991] but utilize adversarial learning to leverage the latest success of GANs.
Our work has two contributions. First, we build a mixture of experts GAN algorithms that are capable of encouraging generators to learn different modalities existing in our data. Second, we utilize the newly discovered Gumbel-Softmax reparameterization trick and develop the regularization for load-balancing to further stabilize the training of MEGAN. We evaluate our model using various criteria, notably achieving an MS-SSIM score of 0.2470 for CelebA, which suggests that MEGAN generates more diverse images compared to other baseline models. Our generated samples also achieve a competitive inception score of 8.33 in an unsupervised setting.
2 Related Work
Several studies on GANs have been proposed to stabilize the learning process and improve the quality of generated samples. Some of these studies incorporated novel distance metrics to achieve better results. For instance, the original GAN [Goodfellow et al.2014]
suffers from the vanishing gradient problem arising from the sigmoid cross-entropy loss function used by the discriminator. LSGAN[Mao et al.2017] solves this problem by substituting the cross-entropy loss with the least-squares loss function. WGAN [Arjovsky et al.2017] adopts the Earth mover’s distance that enables an optimal training and solves the infamous mode collapse. WGAN-GP progresses one step further by adopting a gradient penalty term for a stable training and higher performance. Meanwhile, BEGAN [Berthelot et al.2017] aims to match auto-encoder loss distributions using a loss elicited from the Wasserstein distance, instead of matching the data distributions directly. In addition, DRAGAN [Kodali et al.2017] prevents mode collapse using a no-regret algorithm.
Other algorithms such as AdaGAN [Tolstikhin et al.2017] and MGAN [Hoang et al.2017] employ ensembling approaches, owing to having multiple generators to learn complex distributions. Based on the idea of boosting in the context of the ensemble model, AdaGAN trains generators sequentially while adding a new individual generator into a mixture of generators. While AdaGan gradually decreases the importance of generators as more generators are added into the model, within the framework of our proposed MEGAN, we pursue an equal balancing between generators by explicitly regularizing the model, which avoids the problem of being dominated by a particular generator.
MGAN adopts a predefined mixture weight of generators and trains all generators simultaneously; however, our proposed MEGAN dynamically alters generators through gating networks and train the generators one at a time. The MGAN’s fixed mixture model is sub-optimal, compared to our trainable mixture model. Our proposed MEGAN is different from these models in that each generator can generate images on its own and learn different and salient features.
MAD-GAN [Ghosh et al.2017]
has strong relevance to our work. Having multiple generators and a carefully designed discriminator, MAD-GAN overcomes the mode-collapsing problem by explicitly forcing each generator to learn different mode clusters of a dataset. Our MEGAN and MAD-GAN are similar in that both models allow the generators to specialize in different submodalities. However, MEGAN is differentiated from MAD-GAN in two aspects. First, all generators in MEGAN share the same latent vector space, while the generators of MAD-GAN are built on separated latent vector spaces. Second, the generators of MAD-GAN can theoretically learn the identical mode clusters; however, the gating networks built in our MEGAN ensure that each generator learns different modes by its design.
3 Categorical Reparameterization
Essentially, GANs generate images when given latent vectors. Given generators and a latent vector , our model aims to select a particular generator that will produce the best-quality image. It essentially raises the question as to how a categorical decision is made. The Gumbel-Max trick [Gumbel1954, Maddison et al.2014] allows to sample a one-hot vector
based on the underlying probability distribution:
where is sampled from Uniform(0,1). However, the operator in the Gumbel-Max trick is a stumbling block when training via back propagation because it gives zero gradients passing through the stochastic variable and precludes gradients from flowing further. A recent finding [Jang et al.2016]
suggests an efficient detour to back propagate even in the presence of discrete random variables by a categorical reparameterization trick. The Gumbel-Softmax function generates a samplethat approximates as follows:
where is an -th component of the vector , and is the temperature that determines how closely the function approximates the sample . It is noteworthy that in practice, we directly predict through the assignment module that we will discuss in Section 4.2.
3.1 Straight-Through Gumbel-Softmax
The Gumbel-Softmax method approximates the discrete sampling by gradually annealing the temperature . It appears to be problematic in our setting because when the temperature is high; a Gumbel-Softmax distribution is not categorical, leading all generators to be engaged in producing a fake image for a given latent vector . Our objective is to choose the most appropriate generator. Therefore, we do not use the Gumbel-Softmax but adopt the following Straight-Through Gumbel-Softmax (STGS).
The STGS always generates discrete outputs (even when the temperature is high) while allowing the gradients flow. In practice, the STGS calculates but returns :
where is a variable having the same value as but is detached from the computation graph. With this trick, the gradients flow through and allows the networks to be trained with the annealing temperature.
4 Mixture of Experts GAN
In this section, we illustrate the details of our proposed MEGAN and discuss how generators become specialized in generating images with particular characteristics following the notion of the mixture of experts [Jacobs et al.1991].
4.1 Proposed Network Architecture
Let denote a generator in a set }, and is a random latent vector. A latent vector is fed into each generator, yielding images } and their feature vectors }. Each feature vector is produced in the middle of
. In our experiments, we particularly used the ReLU activation map from the second transposed convolution layer of the generator as the representative feature vector. The latent vectorand all of the feature vectors are then passed to the gating networks to measure how well fits each generator. The gating networks produce a one-hot vector where . We formulate the entire process as follows:
where denotes the generated fake image that will be delivered to the discriminator, and GN is the gating network. Fig. 2 provides an overview of the proposed networks.
4.2 Gating Networks
In the context of the mixture of experts, the gating networks play a central role in specialization of submodules [Jacobs et al.1991]. We use auxiliary gating networks that assign each to a single generator so as to motivate each generator to learn different features. Concretely, we aim to train each generator to (1) be in charge of the images with certain characteristics potentially corresponding to a particular area in the entire feature space, and (2) learn the specialized area accordingly. The gating networks consist of two distinctive modules, an assignment module that measures how well the latent vector fits each generator and an STGS module that samples a generator based on the underlying distribution given by the assignment module.
The assignment module uses the feature vectors and first encodes each of them into a hidden state in a smaller dimension :
denotes a linear transformation for feature vector. Encoding each feature vector reduces the total complexity significantly, because , the dimension of a feature map , is typically large, e.g., = 8192 in our implementation. The reduced dimension
is a hyperparameter that we set as 100.
l is an unnormalized density that determines the generator that most adequately fits the latent vector. The STGS samples a one-hot vector with l as an underlying distribution. We denote the sampled one-hot vector as , which corresponds to yhard illustrated in Eq. (3). It strictly yields one-hot vectors. Thus, with the STGS, we can select a particular generator among many, enabling each generator to focus on a sub-area of the latent vector space decided by the gating networks. It is noteworthy that the assignment module is updated by the gradients flowing through the STGS module.
4.3 Load-Balancing Regularization
We observed that the gating networks converge too quickly, often resorting to only a few generators. The networks tend to be strongly affected by the first few data and favor generators chosen in its initial training stages over others. The fast convergence of the gating networks is undesirable because it leaves little room for other generators to learn in the later stages. Our goal is to assign the data space to all the generators involved.
To prevent the networks from resorting to a few generators, we force the networks to choose the generators in equal frequencies in a mini-batch. Thus, we introduce a regularization to the model as follows:
where indicates the load-balancing loss, is the mini-batch size, is the -th element of the one-hot vector for the -th data of a training mini-batch. is the probability that a certain generator will be chosen. The indicator function returns 1 if and only if . Concretely speaking, we train the model with mini-batch data; further, for all the data in a mini-batch, we count every assignment to each generator. Thus, the regularization loss pushes to equally select generators.
4.4 Total Loss
The total loss of our model set for training is as follows:
where is any adversarial loss computed through an existing GAN model. We do not specify in this section, because it may vary based on the GAN framework used for training. We set to control the impact of the load-balancing regularization.
In this chapter, we discuss a couple of potential issues about some difficulties in the mixture model.
Mechanism of Specialization
In MEGAN, what forces the generators to specialize? We presume it is the implicit dynamics between multiple generators and the STGS. No explicit loss function exists to teach the generators to be specialized. Nevertheless, they should learn how to generate realistic images because the STGS isolates a generator from others by a categorical sampling. The gating networks learn the type of that best suits a certain generator and keep assigning similar ones to the generator. The generators learn that specializing on a particular subset of data distribution helps to obfuscate the discriminator by generating more realistic images. As the training iterations proceed, the generators converge to different local clusters of data distribution.
Effect of Load-Balancing on Specialization
Another important aspect in training MEGAN is determining the hyperparameter for the load-balancing regularization. A desired outcome from the assignment module is a logit vector l
with high variance among its elements, while maintaining the training of generators in a balanced manner. Although the load-balancing regularization is designed to balance workloads between generators, it slightly nudges the assignment module to yield a logit vector closer to a uniform distribution. Thus we observe when an extremely large value is set for(e.g., 1000), the logit values follow a uniform distribution. It is not a desired consequence, because a uniform distribution of means the gating networks failed to properly perform the generator assignment, and the specialization effect of generators is minimized. To prevent this, we suggest two solutions.
The first solution is to obtain an optimal value of where training is stable, and the logit values are not too uniform. It is a simple but reliable remedy, for finding the optimal is not demanding. The second possible solution is to increase when the logit values follow a uniform distribution. Most of our experiments were performed by the first method in which we fix , because a stable point could be found quickly, and it allows us to focus more on the general capability of the model.
Some may claim that our model lacks data efficiency because each generator focuses on a small subset of a dataset. When trained with our algorithm, a single generator is exposed to a smaller number of images, because the generators specialize in a certain subset of the images. However, it also means that each generator can focus on learning fewer modes. Consequently, we observed that our model produces images with an improved quality, as described in detail in Section 6.
5 Experiment Details
In this section, we describe our experiment environments and objectives. All the program codes are available in https://github.com/heykeetae/MEGAN.
5.1 Experiment Environments
We describe the detailed experiment environments in this section, such as baseline methods, datasets, etc.
We apply our algorithm on both DCGAN and WGAN-GP (DCGAN layer architecture) frameworks, chosen based on their stability and high performance. The experiments consist of visual inspections, visual expertise analysis, quantitative evaluations, and user studies for generalized qualitative analyses.
We compared our quantitative results with many state-of-the-art GAN models such as BEGAN, LSGAN, WGAN-GP, improved GAN (-L+HA) [Salimans et al.2016], MGAN and SplittingGAN. AdaGAN could not be included for our evaluation, because the official code for AdaGAN does not provide a stable training for the datasets we used.
We used three datasets for our evaluation: CIFAR-10, CelebA, and LSUN. CIFAR-10 has 60,000 images from 10 different object classes. CelebA has 202,599 facial images of 10,177 celebrities. LSUN has various scenic images but we evaluated with the church outdoor subset, which consists of 126,227 images.
We evaluated our model on two standard metrics to quantitatively measure the quality of the generated images: inception score (IS) and multiscale structural similarity (MS-SSIM). The IS is calculated through the inception object detection network and returns high scores when various and high-quality images are generated. MS-SSIM is also a widely used measure to check the similarity of two different images. We generated 2,000 images and checked their average pairwise MS-SSIM scores. The lower the score is, the better is the algorithm in terms of diversity.
We also conducted web-based user studies. In our test website,111http://gantest.herokuapp.com - a test run can be made by entering the following key: 5B3309 randomly selected real and fake images are displayed, and users are asked to downvote images that they think are fake. Nine images were provided per test and users iterate the test for 100 times. Regarding the CelebA dataset, we observed that the participants were good at detecting the generated facial images of the same race. Therefore, we diversified the ethnicity in our user groups by having thirty participants from three different continents.
We tested the following hyperparameter setups: the number of generators as 3, 5, and 10; the mini-batch size = ; annealing temperature [Jang et al.2016] where denotes the iteration number as in the Algorithm LABEL:alg:gan ; load-balancing parameter =100; and feature vector of dimension for CIFAR-10 and for both LSUN and CelebA.
6 Experimental Results
6.1 Evaluation on Specialization
We describe our results based on various visual inspections.
Throughout our evaluations, each generator is found to learn different context and features, for at least up to 10 generators that we have inspected. The decision to assign a particular subset of data to a particular generator is typically based on visually recognizable features, such as background colors or the shape of the primary objects. Figs. 3 and 4 show the generated samples drawn from different generators trained with 10 generators on CelebA and LSUN-Church outdoor, respectively. Each of the block of four images are from the same generator. We chose six generators that have learned the most conspicuous and distinctive features readily captured even by the human eyes.
All four images share some features in common, while having at least one distinguishing characteristic from other blocks of images. For instance, the top-left celebrities in Fig. 3 have black hair without noticeable facial expressions. On the contrary, the top-right celebrities have light-colored hair with smiling faces. Among the samples from LSUN in Fig. 4, we also detected distinguishing patterns specific to each generator.
Visual Expertise Analysis
If the model learns properly, a desirable outcome is that each generator produces images of different features. We generated 2,000 CIFAR-10 images from MEGAN trained for 20 epochs, fed them to a pretrained VGG19 network, and extracted the feature vectors from therelu4_2 layer. Subsequently, the 8,192-dimensional feature vectors are reduced to two-dimensional vectors using the t-SNE algorithm [Maaten and Hinton2008]. Fig. 5 shows the results. Each two-dimensional vector is represented as a dot in the figure, and samples from the same generator are of the same color. The colored shades indicate the clusters of images that are generated by the same generator. We tested MEGAN with 5 generators and 10 generators, confirming that each generator occupies its own region in the feature vector space. It is noteworthy that they overlap in the figure owing to dimensionality reduction for visualization purposes. In the original 8192-dimension space, they may overlap much less.
|Improved GAN (-L+HA)|
|AdaGAN||Not being properly trained|
6.2 Quantitative analysis
We introduce our quantitative experiment results using the inception score, MS-SSIM, and user study.
Table 1 lists the inception scores (Section 5.1) of various models on CIFAR-10. MEGAN trained on the DCGAN records an inception score of 8.33 — our MEGAN shows a slightly better variance, i.e., 0.09 in MEGAN vs.0.1 in the MGAN. The official code for the AdaGAN does not provide a stable training for the CIFAR-10 dataset, and its inception score is not comparable to other baseline methods.
CelebA and LSUN-Church outdoor
The MS-SSIM scores (Section 5.1) measured for CelebA and LSUN-Church outdoor are reported in Table 3. As the MS-SSIM scores of the baseline models are missing in their papers, we evaluate them after generating many samples using their official codes. In this experiment, MEGAN is trained to minimize the WGAN-GP loss function that was found to perform best in our preliminary experiments. We report that MEGAN outperforms all baseline models in terms of the diversity of generated images, as shown by its lowest MS-SSIM scores. Notably, MEGAN with five generators achieve the lowest MS-SSIM scores for both datasets.
Table 3 shows the result of the web-based user study on CelebA and LSUN-Church outdoor datasets. The score is computed by dividing the number of downvoted fake images by the total number of fake images shown to users. Thus, a low score indicates that users struggle to distinguish generated images from real images. For both datasets, MEGAN records competitive performance, and especially for LSUN-Church outdoor it outperforms all the baseline models.
In conjunction with the previous MS-SSIM results, MEGAN’s low detection rates indicate that it can generate more diverse images in better quality than other baseline methods. We observed that BEGAN achieves the lowest detection rate for the CelebA dataset, in exchange for the low diversity of generated images, as indicated by the high MS-SSIM score of BEGAN in Table 3.
This paper proposed a novel generative adversarial networks model called MEGAN, for learning the complex underlying modalities of datasets. Both our quantitative and qualitative analyses suggest that our method is suitable for various datasets. Future work involves extending our algorithm to other variants of GANs and a broader range of generative models.
- [Arjovsky et al.2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- [Berthelot et al.2017] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
- [Ghosh et al.2017] Arnab Ghosh, Viveka Kulharia, Vinay Namboodiri, Philip HS Torr, and Puneet K Dokania. Multi-agent diverse generative adversarial networks. arXiv preprint arXiv:1704.02906, 2017.
- [Goodfellow et al.2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- [Grinblat et al.2017] Guillermo L Grinblat, Lucas C Uzal, and Pablo M Granitto. Class-splitting generative adversarial networks. arXiv preprint arXiv:1709.07359, 2017.
- [Gumbel1954] Emil Julius Gumbel. Statistical theory of extreme valuse and some practical applications. Nat. Bur. Standards Appl. Math. Ser. 33, 1954.
- [Hampshire and Waibel1990] John B Hampshire and Alex H Waibel. The meta-pi network: Connectionist rapid adaptation for high-performance multi-speaker phoneme recognition. In Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference on, pages 165–168. IEEE, 1990.
- [Hoang et al.2017] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung. Multi-generator gernerative adversarial nets. arXiv preprint arXiv:1708.02556, 2017.
- [Huang et al.2017] Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and Serge Belongie. Stacked generative adversarial networks. In , volume 2, page 4, 2017.
- [Jacobs et al.1991] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- [Jang et al.2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- [Karras et al.2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
- [Kodali et al.2017] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. How to train your dragan. arXiv preprint arXiv:1705.07215, 2017.
- [Maaten and Hinton2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
- [Maddison et al.2014] Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In Advances in Neural Information Processing Systems, pages 3086–3094, 2014.
- [Mao et al.2017] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2813–2821. IEEE, 2017.
- [Salimans et al.2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
- [Tolstikhin et al.2017] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard Schölkopf. Adagan: Boosting generative models. In Advances in Neural Information Processing Systems, pages 5430–5439, 2017.