1 Introduction
Since the introduction of generative adversarial networks (GANs) [Goodfellow et al.2014], researchers have dove deeply into improving the quality of generated images. Recently, a number of new approaches have been proposed for highquality image generation, e.g., ProgressiveGAN [Karras et al.2017], SplittingGAN [Grinblat et al.2017], SGAN [Huang et al.2017], and WGANGP [Salimans et al.2016].
We propose a novel GAN model equipped with multiple generators, each of which specializes in learning a certain modality of dataset (see Fig. 1). In addition to the generators, we employ an auxiliary network that determines a generator that will be trained from a certain training instance. We name the auxiliary network as gating networks following the precedent [Jacobs et al.1991].
Ensembling multiple neural networks coupled with gating networks was first introduced to achieve a higher performance in multispeaker phoneme recognition
[Hampshire and Waibel1990]. In their method, the design of the loss function caused the neural networks to cooperate. A later research introduced a new loss function that stimulates competitions among neural networks, where the involved neural networks attempt to specialize in a certain task rather than redundantly learn the same feature
[Jacobs et al.1991]. The algorithm is now called mixture of expertsin various machine learning domains. Reminiscent of their work, we name our proposed GAN approach as MEGAN, short for the mixture of experts GAN.
The gating networks in our proposed MEGAN are responsible for selecting one particular generator that would perform best given a certain condition. The gating networks consist of two submodules, an assignment module and StraightThrough GumbelSoftmax [Jang et al.2016], which we will discuss in detail in Section 4.2.
Although MEGAN inherits the idea of multiple generators and the gating networks, we will not adopt the proposed loss function [Jacobs et al.1991] but utilize adversarial learning to leverage the latest success of GANs.
Our work has two contributions. First, we build a mixture of experts GAN algorithms that are capable of encouraging generators to learn different modalities existing in our data. Second, we utilize the newly discovered GumbelSoftmax reparameterization trick and develop the regularization for loadbalancing to further stabilize the training of MEGAN. We evaluate our model using various criteria, notably achieving an MSSSIM score of 0.2470 for CelebA, which suggests that MEGAN generates more diverse images compared to other baseline models. Our generated samples also achieve a competitive inception score of 8.33 in an unsupervised setting.
2 Related Work
Several studies on GANs have been proposed to stabilize the learning process and improve the quality of generated samples. Some of these studies incorporated novel distance metrics to achieve better results. For instance, the original GAN [Goodfellow et al.2014]
suffers from the vanishing gradient problem arising from the sigmoid crossentropy loss function used by the discriminator. LSGAN
[Mao et al.2017] solves this problem by substituting the crossentropy loss with the leastsquares loss function. WGAN [Arjovsky et al.2017] adopts the Earth mover’s distance that enables an optimal training and solves the infamous mode collapse. WGANGP progresses one step further by adopting a gradient penalty term for a stable training and higher performance. Meanwhile, BEGAN [Berthelot et al.2017] aims to match autoencoder loss distributions using a loss elicited from the Wasserstein distance, instead of matching the data distributions directly. In addition, DRAGAN [Kodali et al.2017] prevents mode collapse using a noregret algorithm.Other algorithms such as AdaGAN [Tolstikhin et al.2017] and MGAN [Hoang et al.2017] employ ensembling approaches, owing to having multiple generators to learn complex distributions. Based on the idea of boosting in the context of the ensemble model, AdaGAN trains generators sequentially while adding a new individual generator into a mixture of generators. While AdaGan gradually decreases the importance of generators as more generators are added into the model, within the framework of our proposed MEGAN, we pursue an equal balancing between generators by explicitly regularizing the model, which avoids the problem of being dominated by a particular generator.
MGAN adopts a predefined mixture weight of generators and trains all generators simultaneously; however, our proposed MEGAN dynamically alters generators through gating networks and train the generators one at a time. The MGAN’s fixed mixture model is suboptimal, compared to our trainable mixture model. Our proposed MEGAN is different from these models in that each generator can generate images on its own and learn different and salient features.
MADGAN [Ghosh et al.2017]
has strong relevance to our work. Having multiple generators and a carefully designed discriminator, MADGAN overcomes the modecollapsing problem by explicitly forcing each generator to learn different mode clusters of a dataset. Our MEGAN and MADGAN are similar in that both models allow the generators to specialize in different submodalities. However, MEGAN is differentiated from MADGAN in two aspects. First, all generators in MEGAN share the same latent vector space, while the generators of MADGAN are built on separated latent vector spaces. Second, the generators of MADGAN can theoretically learn the identical mode clusters; however, the gating networks built in our MEGAN ensure that each generator learns different modes by its design.
3 Categorical Reparameterization
Essentially, GANs generate images when given latent vectors. Given generators and a latent vector , our model aims to select a particular generator that will produce the bestquality image. It essentially raises the question as to how a categorical decision is made. The GumbelMax trick [Gumbel1954, Maddison et al.2014] allows to sample a onehot vector
based on the underlying probability distribution
:(1) 
where is sampled from Uniform(0,1). However, the operator in the GumbelMax trick is a stumbling block when training via back propagation because it gives zero gradients passing through the stochastic variable and precludes gradients from flowing further. A recent finding [Jang et al.2016]
suggests an efficient detour to back propagate even in the presence of discrete random variables by a categorical reparameterization trick. The GumbelSoftmax function generates a sample
that approximates as follows:(2) 
where is an th component of the vector , and is the temperature that determines how closely the function approximates the sample . It is noteworthy that in practice, we directly predict through the assignment module that we will discuss in Section 4.2.
3.1 StraightThrough GumbelSoftmax
The GumbelSoftmax method approximates the discrete sampling by gradually annealing the temperature . It appears to be problematic in our setting because when the temperature is high; a GumbelSoftmax distribution is not categorical, leading all generators to be engaged in producing a fake image for a given latent vector . Our objective is to choose the most appropriate generator. Therefore, we do not use the GumbelSoftmax but adopt the following StraightThrough GumbelSoftmax (STGS).
The STGS always generates discrete outputs (even when the temperature is high) while allowing the gradients flow. In practice, the STGS calculates but returns :
(3) 
where is a variable having the same value as but is detached from the computation graph. With this trick, the gradients flow through and allows the networks to be trained with the annealing temperature.
4 Mixture of Experts GAN
In this section, we illustrate the details of our proposed MEGAN and discuss how generators become specialized in generating images with particular characteristics following the notion of the mixture of experts [Jacobs et al.1991].
4.1 Proposed Network Architecture
Let denote a generator in a set }, and is a random latent vector. A latent vector is fed into each generator, yielding images } and their feature vectors }. Each feature vector is produced in the middle of
. In our experiments, we particularly used the ReLU activation map from the second transposed convolution layer of the generator as the representative feature vector. The latent vector
and all of the feature vectors are then passed to the gating networks to measure how well fits each generator. The gating networks produce a onehot vector where . We formulate the entire process as follows:(4)  
(5)  
(6) 
where denotes the generated fake image that will be delivered to the discriminator, and GN is the gating network. Fig. 2 provides an overview of the proposed networks.
4.2 Gating Networks
In the context of the mixture of experts, the gating networks play a central role in specialization of submodules [Jacobs et al.1991]. We use auxiliary gating networks that assign each to a single generator so as to motivate each generator to learn different features. Concretely, we aim to train each generator to (1) be in charge of the images with certain characteristics potentially corresponding to a particular area in the entire feature space, and (2) learn the specialized area accordingly. The gating networks consist of two distinctive modules, an assignment module that measures how well the latent vector fits each generator and an STGS module that samples a generator based on the underlying distribution given by the assignment module.
Assignment Module
The assignment module uses the feature vectors and first encodes each of them into a hidden state in a smaller dimension :
(7) 
where
denotes a linear transformation for feature vector
. Encoding each feature vector reduces the total complexity significantly, because , the dimension of a feature map , is typically large, e.g., = 8192 in our implementation. The reduced dimensionis a hyperparameter that we set as 100.
The
are then concatenated along with the latent vector. The merged vector is then passed to a threelayer perceptron, which consists of batch normalizations and ReLU activations:
(8) 
where the resulting
is a logit vector, an input for the STGS.
l also corresponds to explained in Section 3.STGS Module
l is an unnormalized density that determines the generator that most adequately fits the latent vector. The STGS samples a onehot vector with l as an underlying distribution. We denote the sampled onehot vector as , which corresponds to y_{hard} illustrated in Eq. (3). It strictly yields onehot vectors. Thus, with the STGS, we can select a particular generator among many, enabling each generator to focus on a subarea of the latent vector space decided by the gating networks. It is noteworthy that the assignment module is updated by the gradients flowing through the STGS module.
4.3 LoadBalancing Regularization
We observed that the gating networks converge too quickly, often resorting to only a few generators. The networks tend to be strongly affected by the first few data and favor generators chosen in its initial training stages over others. The fast convergence of the gating networks is undesirable because it leaves little room for other generators to learn in the later stages. Our goal is to assign the data space to all the generators involved.
To prevent the networks from resorting to a few generators, we force the networks to choose the generators in equal frequencies in a minibatch. Thus, we introduce a regularization to the model as follows:
(9) 
where indicates the loadbalancing loss, is the minibatch size, is the th element of the onehot vector for the th data of a training minibatch. is the probability that a certain generator will be chosen. The indicator function returns 1 if and only if . Concretely speaking, we train the model with minibatch data; further, for all the data in a minibatch, we count every assignment to each generator. Thus, the regularization loss pushes to equally select generators.
algocf[t]
4.4 Total Loss
The total loss of our model set for training is as follows:
(10) 
where is any adversarial loss computed through an existing GAN model. We do not specify in this section, because it may vary based on the GAN framework used for training. We set to control the impact of the loadbalancing regularization.
4.5 Discussions
In this chapter, we discuss a couple of potential issues about some difficulties in the mixture model.
Mechanism of Specialization
In MEGAN, what forces the generators to specialize? We presume it is the implicit dynamics between multiple generators and the STGS. No explicit loss function exists to teach the generators to be specialized. Nevertheless, they should learn how to generate realistic images because the STGS isolates a generator from others by a categorical sampling. The gating networks learn the type of that best suits a certain generator and keep assigning similar ones to the generator. The generators learn that specializing on a particular subset of data distribution helps to obfuscate the discriminator by generating more realistic images. As the training iterations proceed, the generators converge to different local clusters of data distribution.
Effect of LoadBalancing on Specialization
Another important aspect in training MEGAN is determining the hyperparameter for the loadbalancing regularization. A desired outcome from the assignment module is a logit vector l
with high variance among its elements, while maintaining the training of generators in a balanced manner. Although the loadbalancing regularization is designed to balance workloads between generators, it slightly nudges the assignment module to yield a logit vector closer to a uniform distribution. Thus we observe when an extremely large value is set for
(e.g., 1000), the logit values follow a uniform distribution. It is not a desired consequence, because a uniform distribution of means the gating networks failed to properly perform the generator assignment, and the specialization effect of generators is minimized. To prevent this, we suggest two solutions.The first solution is to obtain an optimal value of where training is stable, and the logit values are not too uniform. It is a simple but reliable remedy, for finding the optimal is not demanding. The second possible solution is to increase when the logit values follow a uniform distribution. Most of our experiments were performed by the first method in which we fix , because a stable point could be found quickly, and it allows us to focus more on the general capability of the model.
Data Efficiency
Some may claim that our model lacks data efficiency because each generator focuses on a small subset of a dataset. When trained with our algorithm, a single generator is exposed to a smaller number of images, because the generators specialize in a certain subset of the images. However, it also means that each generator can focus on learning fewer modes. Consequently, we observed that our model produces images with an improved quality, as described in detail in Section 6.
5 Experiment Details
In this section, we describe our experiment environments and objectives. All the program codes are available in https://github.com/heykeetae/MEGAN.
5.1 Experiment Environments
We describe the detailed experiment environments in this section, such as baseline methods, datasets, etc.
Underlying GANs
We apply our algorithm on both DCGAN and WGANGP (DCGAN layer architecture) frameworks, chosen based on their stability and high performance. The experiments consist of visual inspections, visual expertise analysis, quantitative evaluations, and user studies for generalized qualitative analyses.
Baseline algorithms
We compared our quantitative results with many stateoftheart GAN models such as BEGAN, LSGAN, WGANGP, improved GAN (L+HA) [Salimans et al.2016], MGAN and SplittingGAN. AdaGAN could not be included for our evaluation, because the official code for AdaGAN does not provide a stable training for the datasets we used.
Datasets
We used three datasets for our evaluation: CIFAR10, CelebA, and LSUN. CIFAR10 has 60,000 images from 10 different object classes. CelebA has 202,599 facial images of 10,177 celebrities. LSUN has various scenic images but we evaluated with the church outdoor subset, which consists of 126,227 images.
Evaluation Metric
We evaluated our model on two standard metrics to quantitatively measure the quality of the generated images: inception score (IS) and multiscale structural similarity (MSSSIM). The IS is calculated through the inception object detection network and returns high scores when various and highquality images are generated. MSSSIM is also a widely used measure to check the similarity of two different images. We generated 2,000 images and checked their average pairwise MSSSIM scores. The lower the score is, the better is the algorithm in terms of diversity.
User Study
We also conducted webbased user studies. In our test website,^{1}^{1}1http://gantest.herokuapp.com  a test run can be made by entering the following key: 5B3309 randomly selected real and fake images are displayed, and users are asked to downvote images that they think are fake. Nine images were provided per test and users iterate the test for 100 times. Regarding the CelebA dataset, we observed that the participants were good at detecting the generated facial images of the same race. Therefore, we diversified the ethnicity in our user groups by having thirty participants from three different continents.
Hyperparameters
We tested the following hyperparameter setups: the number of generators as 3, 5, and 10; the minibatch size = ; annealing temperature [Jang et al.2016] where denotes the iteration number as in the Algorithm LABEL:alg:gan ; loadbalancing parameter =100; and feature vector of dimension for CIFAR10 and for both LSUN and CelebA.
6 Experimental Results
6.1 Evaluation on Specialization
We describe our results based on various visual inspections.
Visual Inspection
Throughout our evaluations, each generator is found to learn different context and features, for at least up to 10 generators that we have inspected. The decision to assign a particular subset of data to a particular generator is typically based on visually recognizable features, such as background colors or the shape of the primary objects. Figs. 3 and 4 show the generated samples drawn from different generators trained with 10 generators on CelebA and LSUNChurch outdoor, respectively. Each of the block of four images are from the same generator. We chose six generators that have learned the most conspicuous and distinctive features readily captured even by the human eyes.
All four images share some features in common, while having at least one distinguishing characteristic from other blocks of images. For instance, the topleft celebrities in Fig. 3 have black hair without noticeable facial expressions. On the contrary, the topright celebrities have lightcolored hair with smiling faces. Among the samples from LSUN in Fig. 4, we also detected distinguishing patterns specific to each generator.
Visual Expertise Analysis
If the model learns properly, a desirable outcome is that each generator produces images of different features. We generated 2,000 CIFAR10 images from MEGAN trained for 20 epochs, fed them to a pretrained VGG19 network, and extracted the feature vectors from the
relu4_2 layer. Subsequently, the 8,192dimensional feature vectors are reduced to twodimensional vectors using the tSNE algorithm [Maaten and Hinton2008]. Fig. 5 shows the results. Each twodimensional vector is represented as a dot in the figure, and samples from the same generator are of the same color. The colored shades indicate the clusters of images that are generated by the same generator. We tested MEGAN with 5 generators and 10 generators, confirming that each generator occupies its own region in the feature vector space. It is noteworthy that they overlap in the figure owing to dimensionality reduction for visualization purposes. In the original 8192dimension space, they may overlap much less.Method  Score 

DCGAN  
Improved GAN (L+HA)  
WGANGP (Resnet)  
SplittingGAN  
AdaGAN  Not being properly trained 
MGAN  
MEGAN (DCGAN)  

Method  CelebA  LSUN  

BEGAN  1  
DRAGAN  1  
LSGAN  1  
MGAN  5  
MGAN  10  
MEGAN  3  
MEGAN  5  
MEGAN  10 
CelebA  LSUN  

Avg  Min  Max  Avg  Min  Max  
BEGAN  0.70  0.50  0.89  0.73  0.49  0.93  
DRAGAN  0.91  0.71  0.98  0.81  0.50  0.96  
LSGAN  0.88  0.71  0.96  0.59  0.36  0.91  
WGANGP  0.82  0.70  0.96  0.58  0.33  0.85  

0.76  0.58  0.95  0.49  0.20  0.71  

0.73  0.57  0.93  0.61  0.40  0.81  

0.74  0.60  0.92  0.58  0.28  0.92 
6.2 Quantitative analysis
We introduce our quantitative experiment results using the inception score, MSSSIM, and user study.
Cifar10
Table 1 lists the inception scores (Section 5.1) of various models on CIFAR10. MEGAN trained on the DCGAN records an inception score of 8.33 — our MEGAN shows a slightly better variance, i.e., 0.09 in MEGAN vs.0.1 in the MGAN. The official code for the AdaGAN does not provide a stable training for the CIFAR10 dataset, and its inception score is not comparable to other baseline methods.
CelebA and LSUNChurch outdoor
The MSSSIM scores (Section 5.1) measured for CelebA and LSUNChurch outdoor are reported in Table 3. As the MSSSIM scores of the baseline models are missing in their papers, we evaluate them after generating many samples using their official codes. In this experiment, MEGAN is trained to minimize the WGANGP loss function that was found to perform best in our preliminary experiments. We report that MEGAN outperforms all baseline models in terms of the diversity of generated images, as shown by its lowest MSSSIM scores. Notably, MEGAN with five generators achieve the lowest MSSSIM scores for both datasets.
User Study
Table 3 shows the result of the webbased user study on CelebA and LSUNChurch outdoor datasets. The score is computed by dividing the number of downvoted fake images by the total number of fake images shown to users. Thus, a low score indicates that users struggle to distinguish generated images from real images. For both datasets, MEGAN records competitive performance, and especially for LSUNChurch outdoor it outperforms all the baseline models.
In conjunction with the previous MSSSIM results, MEGAN’s low detection rates indicate that it can generate more diverse images in better quality than other baseline methods. We observed that BEGAN achieves the lowest detection rate for the CelebA dataset, in exchange for the low diversity of generated images, as indicated by the high MSSSIM score of BEGAN in Table 3.
7 Conclusion
This paper proposed a novel generative adversarial networks model called MEGAN, for learning the complex underlying modalities of datasets. Both our quantitative and qualitative analyses suggest that our method is suitable for various datasets. Future work involves extending our algorithm to other variants of GANs and a broader range of generative models.
References
 [Arjovsky et al.2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [Berthelot et al.2017] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
 [Ghosh et al.2017] Arnab Ghosh, Viveka Kulharia, Vinay Namboodiri, Philip HS Torr, and Puneet K Dokania. Multiagent diverse generative adversarial networks. arXiv preprint arXiv:1704.02906, 2017.
 [Goodfellow et al.2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [Grinblat et al.2017] Guillermo L Grinblat, Lucas C Uzal, and Pablo M Granitto. Classsplitting generative adversarial networks. arXiv preprint arXiv:1709.07359, 2017.
 [Gumbel1954] Emil Julius Gumbel. Statistical theory of extreme valuse and some practical applications. Nat. Bur. Standards Appl. Math. Ser. 33, 1954.
 [Hampshire and Waibel1990] John B Hampshire and Alex H Waibel. The metapi network: Connectionist rapid adaptation for highperformance multispeaker phoneme recognition. In Acoustics, Speech, and Signal Processing, 1990. ICASSP90., 1990 International Conference on, pages 165–168. IEEE, 1990.
 [Hoang et al.2017] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung. Multigenerator gernerative adversarial nets. arXiv preprint arXiv:1708.02556, 2017.

[Huang et al.2017]
Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and Serge Belongie.
Stacked generative adversarial networks.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, volume 2, page 4, 2017.  [Jacobs et al.1991] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
 [Jang et al.2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 [Karras et al.2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 [Kodali et al.2017] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. How to train your dragan. arXiv preprint arXiv:1705.07215, 2017.
 [Maaten and Hinton2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 [Maddison et al.2014] Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In Advances in Neural Information Processing Systems, pages 3086–3094, 2014.
 [Mao et al.2017] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2813–2821. IEEE, 2017.
 [Salimans et al.2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
 [Tolstikhin et al.2017] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, CarlJohann SimonGabriel, and Bernhard Schölkopf. Adagan: Boosting generative models. In Advances in Neural Information Processing Systems, pages 5430–5439, 2017.