MEGAN: Mixture of Experts of Generative Adversarial Networks for Multimodal Image Generation

by   David Keetae Park, et al.
UNC Charlotte
Korea University

Recently, generative adversarial networks (GANs) have shown promising performance in generating realistic images. However, they often struggle in learning complex underlying modalities in a given dataset, resulting in poor-quality generated images. To mitigate this problem, we present a novel approach called mixture of experts GAN (MEGAN), an ensemble approach of multiple generator networks. Each generator network in MEGAN specializes in generating images with a particular subset of modalities, e.g., an image class. Instead of incorporating a separate step of handcrafted clustering of multiple modalities, our proposed model is trained through an end-to-end learning of multiple generators via gating networks, which is responsible for choosing the appropriate generator network for a given condition. We adopt the categorical reparameterization trick for a categorical decision to be made in selecting a generator while maintaining the flow of the gradients. We demonstrate that individual generators learn different and salient subparts of the data and achieve a multiscale structural similarity (MS-SSIM) score of 0.2470 for CelebA and a competitive unsupervised inception score of 8.33 in CIFAR-10.


page 3

page 5

page 6


Generating Images Part by Part with Composite Generative Adversarial Networks

Image generation remains a fundamental problem in artificial intelligenc...

Hierarchical Mixtures of Generators for Adversarial Learning

Generative adversarial networks (GANs) are deep neural networks that all...

Lessons Learned from the Training of GANs on Artificial Datasets

Generative Adversarial Networks (GANs) have made great progress in synth...

Sparse Generative Adversarial Network

We propose a new approach to Generative Adversarial Networks (GANs) to a...

Improved Image Generation via Sparse Modeling

The interest of the deep learning community in image synthesis has grown...

Selective Sampling and Mixture Models in Generative Adversarial Networks

In this paper, we propose a multi-generator extension to the adversarial...

Attention2AngioGAN: Synthesizing Fluorescein Angiography from Retinal Fundus Images using Generative Adversarial Networks

Fluorescein Angiography (FA) is a technique that employs the designated ...

1 Introduction

Since the introduction of generative adversarial networks (GANs) [Goodfellow et al.2014], researchers have dove deeply into improving the quality of generated images. Recently, a number of new approaches have been proposed for high-quality image generation, e.g., ProgressiveGAN [Karras et al.2017], SplittingGAN [Grinblat et al.2017], SGAN [Huang et al.2017], and WGAN-GP [Salimans et al.2016].

Figure 1: Multiple generators specialized in particular data clusters

We propose a novel GAN model equipped with multiple generators, each of which specializes in learning a certain modality of dataset (see Fig. 1). In addition to the generators, we employ an auxiliary network that determines a generator that will be trained from a certain training instance. We name the auxiliary network as gating networks following the precedent [Jacobs et al.1991].

Ensembling multiple neural networks coupled with gating networks was first introduced to achieve a higher performance in multi-speaker phoneme recognition 

[Hampshire and Waibel1990]

. In their method, the design of the loss function caused the neural networks to cooperate. A later research introduced a new loss function that stimulates competitions among neural networks, where the involved neural networks attempt to specialize in a certain task rather than redundantly learn the same feature 

[Jacobs et al.1991]. The algorithm is now called mixture of experts

in various machine learning domains. Reminiscent of their work, we name our proposed GAN approach as MEGAN, short for the mixture of experts GAN.

The gating networks in our proposed MEGAN are responsible for selecting one particular generator that would perform best given a certain condition. The gating networks consist of two submodules, an assignment module and Straight-Through Gumbel-Softmax [Jang et al.2016], which we will discuss in detail in Section 4.2.

Although MEGAN inherits the idea of multiple generators and the gating networks, we will not adopt the proposed loss function [Jacobs et al.1991] but utilize adversarial learning to leverage the latest success of GANs.

Our work has two contributions. First, we build a mixture of experts GAN algorithms that are capable of encouraging generators to learn different modalities existing in our data. Second, we utilize the newly discovered Gumbel-Softmax reparameterization trick and develop the regularization for load-balancing to further stabilize the training of MEGAN. We evaluate our model using various criteria, notably achieving an MS-SSIM score of 0.2470 for CelebA, which suggests that MEGAN generates more diverse images compared to other baseline models. Our generated samples also achieve a competitive inception score of 8.33 in an unsupervised setting.

2 Related Work

Several studies on GANs have been proposed to stabilize the learning process and improve the quality of generated samples. Some of these studies incorporated novel distance metrics to achieve better results. For instance, the original GAN [Goodfellow et al.2014]

suffers from the vanishing gradient problem arising from the sigmoid cross-entropy loss function used by the discriminator. LSGAN 

[Mao et al.2017] solves this problem by substituting the cross-entropy loss with the least-squares loss function. WGAN [Arjovsky et al.2017] adopts the Earth mover’s distance that enables an optimal training and solves the infamous mode collapse. WGAN-GP progresses one step further by adopting a gradient penalty term for a stable training and higher performance. Meanwhile, BEGAN [Berthelot et al.2017] aims to match auto-encoder loss distributions using a loss elicited from the Wasserstein distance, instead of matching the data distributions directly. In addition, DRAGAN [Kodali et al.2017] prevents mode collapse using a no-regret algorithm.

Other algorithms such as AdaGAN [Tolstikhin et al.2017] and MGAN [Hoang et al.2017] employ ensembling approaches, owing to having multiple generators to learn complex distributions. Based on the idea of boosting in the context of the ensemble model, AdaGAN trains generators sequentially while adding a new individual generator into a mixture of generators. While AdaGan gradually decreases the importance of generators as more generators are added into the model, within the framework of our proposed MEGAN, we pursue an equal balancing between generators by explicitly regularizing the model, which avoids the problem of being dominated by a particular generator.

MGAN adopts a predefined mixture weight of generators and trains all generators simultaneously; however, our proposed MEGAN dynamically alters generators through gating networks and train the generators one at a time. The MGAN’s fixed mixture model is sub-optimal, compared to our trainable mixture model. Our proposed MEGAN is different from these models in that each generator can generate images on its own and learn different and salient features.

MAD-GAN [Ghosh et al.2017]

has strong relevance to our work. Having multiple generators and a carefully designed discriminator, MAD-GAN overcomes the mode-collapsing problem by explicitly forcing each generator to learn different mode clusters of a dataset. Our MEGAN and MAD-GAN are similar in that both models allow the generators to specialize in different submodalities. However, MEGAN is differentiated from MAD-GAN in two aspects. First, all generators in MEGAN share the same latent vector space, while the generators of MAD-GAN are built on separated latent vector spaces. Second, the generators of MAD-GAN can theoretically learn the identical mode clusters; however, the gating networks built in our MEGAN ensure that each generator learns different modes by its design.

3 Categorical Reparameterization

Essentially, GANs generate images when given latent vectors. Given generators and a latent vector , our model aims to select a particular generator that will produce the best-quality image. It essentially raises the question as to how a categorical decision is made. The Gumbel-Max trick [Gumbel1954, Maddison et al.2014] allows to sample a one-hot vector

based on the underlying probability distribution



where is sampled from Uniform(0,1). However, the operator in the Gumbel-Max trick is a stumbling block when training via back propagation because it gives zero gradients passing through the stochastic variable and precludes gradients from flowing further. A recent finding [Jang et al.2016]

suggests an efficient detour to back propagate even in the presence of discrete random variables by a categorical reparameterization trick. The Gumbel-Softmax function generates a sample

that approximates as follows:


where is an -th component of the vector , and is the temperature that determines how closely the function approximates the sample . It is noteworthy that in practice, we directly predict through the assignment module that we will discuss in Section 4.2.

3.1 Straight-Through Gumbel-Softmax

The Gumbel-Softmax method approximates the discrete sampling by gradually annealing the temperature . It appears to be problematic in our setting because when the temperature is high; a Gumbel-Softmax distribution is not categorical, leading all generators to be engaged in producing a fake image for a given latent vector . Our objective is to choose the most appropriate generator. Therefore, we do not use the Gumbel-Softmax but adopt the following Straight-Through Gumbel-Softmax (STGS).

The STGS always generates discrete outputs (even when the temperature is high) while allowing the gradients flow. In practice, the STGS calculates but returns :


where is a variable having the same value as but is detached from the computation graph. With this trick, the gradients flow through and allows the networks to be trained with the annealing temperature.

Figure 2: The proposed architecture of MEGAN; (a) shows the overview of our main networks. Given a latent vector , each of the generators produces an output . The latent vector z and feature vectors (denoted in yellow) extracted from the generators are given as input to the gating networks that produce a one-hot vector , as shown in the middle. The chosen image by the one-hot vector (marked as “Fake Image”) will be fed into the discriminator that measures the adversarial loss with regard to both real and fake classes. (b) illustrates an in-depth view on the gating networks. The gating networks output a one-hot vector .

4 Mixture of Experts GAN

In this section, we illustrate the details of our proposed MEGAN and discuss how generators become specialized in generating images with particular characteristics following the notion of the mixture of experts [Jacobs et al.1991].

4.1 Proposed Network Architecture

Let denote a generator in a set }, and is a random latent vector. A latent vector is fed into each generator, yielding images } and their feature vectors }. Each feature vector is produced in the middle of

. In our experiments, we particularly used the ReLU activation map from the second transposed convolution layer of the generator as the representative feature vector. The latent vector

and all of the feature vectors are then passed to the gating networks to measure how well fits each generator. The gating networks produce a one-hot vector where . We formulate the entire process as follows:


where denotes the generated fake image that will be delivered to the discriminator, and GN is the gating network. Fig. 2 provides an overview of the proposed networks.

4.2 Gating Networks

In the context of the mixture of experts, the gating networks play a central role in specialization of submodules [Jacobs et al.1991]. We use auxiliary gating networks that assign each to a single generator so as to motivate each generator to learn different features. Concretely, we aim to train each generator to (1) be in charge of the images with certain characteristics potentially corresponding to a particular area in the entire feature space, and (2) learn the specialized area accordingly. The gating networks consist of two distinctive modules, an assignment module that measures how well the latent vector fits each generator and an STGS module that samples a generator based on the underlying distribution given by the assignment module.

Assignment Module

The assignment module uses the feature vectors and first encodes each of them into a hidden state in a smaller dimension :



denotes a linear transformation for feature vector

. Encoding each feature vector reduces the total complexity significantly, because , the dimension of a feature map , is typically large, e.g., = 8192 in our implementation. The reduced dimension

is a hyperparameter that we set as 100.


are then concatenated along with the latent vector. The merged vector is then passed to a three-layer perceptron, which consists of batch normalizations and ReLU activations:


where the resulting

is a logit vector, an input for the STGS.

l also corresponds to explained in Section 3.

STGS Module

l is an unnormalized density that determines the generator that most adequately fits the latent vector. The STGS samples a one-hot vector with l as an underlying distribution. We denote the sampled one-hot vector as , which corresponds to yhard illustrated in Eq. (3). It strictly yields one-hot vectors. Thus, with the STGS, we can select a particular generator among many, enabling each generator to focus on a sub-area of the latent vector space decided by the gating networks. It is noteworthy that the assignment module is updated by the gradients flowing through the STGS module.

4.3 Load-Balancing Regularization

We observed that the gating networks converge too quickly, often resorting to only a few generators. The networks tend to be strongly affected by the first few data and favor generators chosen in its initial training stages over others. The fast convergence of the gating networks is undesirable because it leaves little room for other generators to learn in the later stages. Our goal is to assign the data space to all the generators involved.

To prevent the networks from resorting to a few generators, we force the networks to choose the generators in equal frequencies in a mini-batch. Thus, we introduce a regularization to the model as follows:


where indicates the load-balancing loss, is the mini-batch size, is the -th element of the one-hot vector for the -th data of a training mini-batch. is the probability that a certain generator will be chosen. The indicator function returns 1 if and only if . Concretely speaking, we train the model with mini-batch data; further, for all the data in a mini-batch, we count every assignment to each generator. Thus, the regularization loss pushes to equally select generators.


4.4 Total Loss

The total loss of our model set for training is as follows:


where is any adversarial loss computed through an existing GAN model. We do not specify in this section, because it may vary based on the GAN framework used for training. We set to control the impact of the load-balancing regularization.

4.5 Discussions

In this chapter, we discuss a couple of potential issues about some difficulties in the mixture model.

Mechanism of Specialization

In MEGAN, what forces the generators to specialize? We presume it is the implicit dynamics between multiple generators and the STGS. No explicit loss function exists to teach the generators to be specialized. Nevertheless, they should learn how to generate realistic images because the STGS isolates a generator from others by a categorical sampling. The gating networks learn the type of that best suits a certain generator and keep assigning similar ones to the generator. The generators learn that specializing on a particular subset of data distribution helps to obfuscate the discriminator by generating more realistic images. As the training iterations proceed, the generators converge to different local clusters of data distribution.

Effect of Load-Balancing on Specialization

Another important aspect in training MEGAN is determining the hyperparameter for the load-balancing regularization. A desired outcome from the assignment module is a logit vector l

with high variance among its elements, while maintaining the training of generators in a balanced manner. Although the load-balancing regularization is designed to balance workloads between generators, it slightly nudges the assignment module to yield a logit vector closer to a uniform distribution. Thus we observe when an extremely large value is set for

(e.g., 1000), the logit values follow a uniform distribution. It is not a desired consequence, because a uniform distribution of means the gating networks failed to properly perform the generator assignment, and the specialization effect of generators is minimized. To prevent this, we suggest two solutions.

The first solution is to obtain an optimal value of where training is stable, and the logit values are not too uniform. It is a simple but reliable remedy, for finding the optimal is not demanding. The second possible solution is to increase when the logit values follow a uniform distribution. Most of our experiments were performed by the first method in which we fix , because a stable point could be found quickly, and it allows us to focus more on the general capability of the model.

Data Efficiency

Some may claim that our model lacks data efficiency because each generator focuses on a small subset of a dataset. When trained with our algorithm, a single generator is exposed to a smaller number of images, because the generators specialize in a certain subset of the images. However, it also means that each generator can focus on learning fewer modes. Consequently, we observed that our model produces images with an improved quality, as described in detail in Section 6.

5 Experiment Details

In this section, we describe our experiment environments and objectives. All the program codes are available in

5.1 Experiment Environments

We describe the detailed experiment environments in this section, such as baseline methods, datasets, etc.

Underlying GANs

We apply our algorithm on both DCGAN and WGAN-GP (DCGAN layer architecture) frameworks, chosen based on their stability and high performance. The experiments consist of visual inspections, visual expertise analysis, quantitative evaluations, and user studies for generalized qualitative analyses.

Figure 3: Visual Inspection; CelebA dataset, 64x64 samples from MEGAN with each block of four images generated by the same generator. Noticeable differences between each block indicate that different generators produce images with different features.

Baseline algorithms

We compared our quantitative results with many state-of-the-art GAN models such as BEGAN, LSGAN, WGAN-GP, improved GAN (-L+HA) [Salimans et al.2016], MGAN and SplittingGAN. AdaGAN could not be included for our evaluation, because the official code for AdaGAN does not provide a stable training for the datasets we used.


We used three datasets for our evaluation: CIFAR-10, CelebA, and LSUN. CIFAR-10 has 60,000 images from 10 different object classes. CelebA has 202,599 facial images of 10,177 celebrities. LSUN has various scenic images but we evaluated with the church outdoor subset, which consists of 126,227 images.

Evaluation Metric

We evaluated our model on two standard metrics to quantitatively measure the quality of the generated images: inception score (IS) and multiscale structural similarity (MS-SSIM). The IS is calculated through the inception object detection network and returns high scores when various and high-quality images are generated. MS-SSIM is also a widely used measure to check the similarity of two different images. We generated 2,000 images and checked their average pairwise MS-SSIM scores. The lower the score is, the better is the algorithm in terms of diversity.

User Study

We also conducted web-based user studies. In our test website,111 - a test run can be made by entering the following key: 5B3309 randomly selected real and fake images are displayed, and users are asked to downvote images that they think are fake. Nine images were provided per test and users iterate the test for 100 times. Regarding the CelebA dataset, we observed that the participants were good at detecting the generated facial images of the same race. Therefore, we diversified the ethnicity in our user groups by having thirty participants from three different continents.


We tested the following hyperparameter setups: the number of generators as 3, 5, and 10; the mini-batch size = ; annealing temperature  [Jang et al.2016] where denotes the iteration number as in the Algorithm LABEL:alg:gan ; load-balancing parameter =100; and feature vector of dimension for CIFAR-10 and for both LSUN and CelebA.

6 Experimental Results

Figure 4: Visual Inspection; LSUN-Church outdoor dataset, 64x64 samples from MEGAN with each block of four images generated by the same generator. Distinguishable features include the church architectural style, the location, and the cloud cover.

6.1 Evaluation on Specialization

We describe our results based on various visual inspections.

Figure 5: Visual Expertise Analysis; 2,000 images are generated by MEGAN on CIFAR-10 dataset and feature vectors of those images are extracted from the relu4_2 layer of VGG-19 networks. All 2,000 feature vectors are visualized in a two-dimensional space by the t-SNE algorithm.

Visual Inspection

Throughout our evaluations, each generator is found to learn different context and features, for at least up to 10 generators that we have inspected. The decision to assign a particular subset of data to a particular generator is typically based on visually recognizable features, such as background colors or the shape of the primary objects. Figs. 3 and 4 show the generated samples drawn from different generators trained with 10 generators on CelebA and LSUN-Church outdoor, respectively. Each of the block of four images are from the same generator. We chose six generators that have learned the most conspicuous and distinctive features readily captured even by the human eyes.

All four images share some features in common, while having at least one distinguishing characteristic from other blocks of images. For instance, the top-left celebrities in Fig. 3 have black hair without noticeable facial expressions. On the contrary, the top-right celebrities have light-colored hair with smiling faces. Among the samples from LSUN in Fig. 4, we also detected distinguishing patterns specific to each generator.

Visual Expertise Analysis

If the model learns properly, a desirable outcome is that each generator produces images of different features. We generated 2,000 CIFAR-10 images from MEGAN trained for 20 epochs, fed them to a pretrained VGG19 network, and extracted the feature vectors from the

relu4_2 layer. Subsequently, the 8,192-dimensional feature vectors are reduced to two-dimensional vectors using the t-SNE algorithm [Maaten and Hinton2008]. Fig. 5 shows the results. Each two-dimensional vector is represented as a dot in the figure, and samples from the same generator are of the same color. The colored shades indicate the clusters of images that are generated by the same generator. We tested MEGAN with 5 generators and 10 generators, confirming that each generator occupies its own region in the feature vector space. It is noteworthy that they overlap in the figure owing to dimensionality reduction for visualization purposes. In the original 8192-dimension space, they may overlap much less.

Method Score
Improved GAN (-L+HA)
WGAN-GP (Resnet)
AdaGAN Not being properly trained

Table 1: Inception Score on CIFAR-10 (trained without labels)
Method CelebA LSUN
Table 3: : User study results
Avg Min Max Avg Min Max
BEGAN 0.70 0.50 0.89 0.73 0.49 0.93
DRAGAN 0.91 0.71 0.98 0.81 0.50 0.96
LSGAN 0.88 0.71 0.96 0.59 0.36 0.91
WGAN-GP 0.82 0.70 0.96 0.58 0.33 0.85
0.76 0.58 0.95 0.49 0.20 0.71
0.73 0.57 0.93 0.61 0.40 0.81
0.74 0.60 0.92 0.58 0.28 0.92
Table 2: MS-SSIM Score on CelebA

6.2 Quantitative analysis

We introduce our quantitative experiment results using the inception score, MS-SSIM, and user study.


Table 1 lists the inception scores (Section 5.1) of various models on CIFAR-10. MEGAN trained on the DCGAN records an inception score of 8.33 — our MEGAN shows a slightly better variance, i.e., 0.09 in MEGAN vs.0.1 in the MGAN. The official code for the AdaGAN does not provide a stable training for the CIFAR-10 dataset, and its inception score is not comparable to other baseline methods.

CelebA and LSUN-Church outdoor

The MS-SSIM scores (Section 5.1) measured for CelebA and LSUN-Church outdoor are reported in Table 3. As the MS-SSIM scores of the baseline models are missing in their papers, we evaluate them after generating many samples using their official codes. In this experiment, MEGAN is trained to minimize the WGAN-GP loss function that was found to perform best in our preliminary experiments. We report that MEGAN outperforms all baseline models in terms of the diversity of generated images, as shown by its lowest MS-SSIM scores. Notably, MEGAN with five generators achieve the lowest MS-SSIM scores for both datasets.

User Study

Table 3 shows the result of the web-based user study on CelebA and LSUN-Church outdoor datasets. The score is computed by dividing the number of downvoted fake images by the total number of fake images shown to users. Thus, a low score indicates that users struggle to distinguish generated images from real images. For both datasets, MEGAN records competitive performance, and especially for LSUN-Church outdoor it outperforms all the baseline models.

In conjunction with the previous MS-SSIM results, MEGAN’s low detection rates indicate that it can generate more diverse images in better quality than other baseline methods. We observed that BEGAN achieves the lowest detection rate for the CelebA dataset, in exchange for the low diversity of generated images, as indicated by the high MS-SSIM score of BEGAN in Table 3.

7 Conclusion

This paper proposed a novel generative adversarial networks model called MEGAN, for learning the complex underlying modalities of datasets. Both our quantitative and qualitative analyses suggest that our method is suitable for various datasets. Future work involves extending our algorithm to other variants of GANs and a broader range of generative models.