In generative modeling, we are given a data set sampled from some unknown probability distribution and we want to be able to generate new instances from . This is an unsupervised learning problem and the usual approach is to first build an estimator for and then sample from that. The generative adversarial network (GAN)  is interesting in that it learns a generative model without explicitly modeling
but by using an auxiliary discriminative model, thereby transforming an unsupervised learning problem into a supervised learning problem.
A GAN model is composed of two learners, a generator and a discriminator . takes as input random drawn from some simple parametric distribution of relatively low dimensionality, e.g., a zero-mean Gaussian with unit covariance, and learns to transform it to a valid instance from (the unknown) . is implemented as a deep neural network that takes as input, generates as output, and has many layers in between necessary for the transformation; the weights in are denoted by . The that are generated by are called fake because they are synthetic. The discriminator
is a two-class classifier that learns to discriminate such fakes from truesampled from the training set . is another deep neural network with either or as input and 0 or 1 as the desired output respectively for the single sigmoid output. Again has as many hidden layers as necessary for the task; the weights in are denoted by .
The objective function is
We train the weights of both and using gradient-based optimization, alternating between the two. wants to maximize the likelihood for true instances (drawn from unknown as represented by the training set ) and minimize the likelihood for fake instances generated by . At the same time, wants to generate fakes for which assigns as high likelihoods as possible. As gets better in generating fakes for which assigns high likelihood, is forced to better separate them from true instances, which in turn forces to generate even better fakes, and so on.
GANs are used successfully especially in image generation. A well-trained GAN can generate images that are almost indistinguishable by humans [12, 2, 13]; still, there are two main difficulties in training: Sometimes learns only a part of the true and can generate only a subset of the possible ; this is called mode collapse because it is indication that does not cover all the modes of . The second problem is that of vanishing gradients that we always have in training deep neural networks; note that here because both and are deep, is doubly deep because its gradient needs to be back-propagated through .
There is recent work in the literature that focuses on these problems. To solve problems related to training, it has been proposed to use either different objective functions, regularization methods, or architectures; see [3, 9, 15] for good surveys of the state of the art.
The direction we pursue in this study is to use multiple generators each one responsible from generating a local region of . Different local generators will learn to cover different modes and this will help alleviate the mode collapse problem. They also help with the problem of vanishing gradients because local generators are simpler, i.e., more shallow, and hence the paths through which the gradient is back-propagated are shorter. We review three previously proposed approaches from the literature, namely multi-agent diverse GAN (MADGAN), mixture GAN (MGAN), and mixtures of experts GAN (MEGAN).
We propose the hierarchical mixture of generators that has a tree structure with internal decision nodes that divide up the latent space and leaves that are local generators responsible from a local region generating a subset of . Since the splits are soft, given the tree structure, the split parameters at the internal nodes as well as those of the generators in the leaves can be updated using gradient-descent. Note that it is only that is modeled this way and split locally, there is still a single implemented as a deep fully-connected neural network as usual.
The rest of this paper is organized as follows. In Chapter II, we discuss the previously proposed models in literature that also use multiple generators. We explain our proposed model of hierarchical mixture of generators in detail in Chapter III. Our experimental results on a toy two-dimensional and five real-world image data sets are given in Chapter IV. We conclude in Chapter V.
Ii Combining multiple generators in GAN
Ii-a Multi-agent diverse GAN
In the multi-agent diverse GAN (MADGAN) , there are generators each of which labels the fake data it generates with its index. does not learn a two-class, true vs. fake classification problem, but a -class problem where class 0 is for true instances and classes 1 to are the different ways of generating fake; in terms of implementation, has softmax outputs instead of one sigmoid output as we have for the original GAN.
The model is shown in Figure 1. Given , first a shared neural network block produces an intermediate representation . is a low-dimensional unstructured vector, contains successive deconvolution layers and generates a two-dimensional image using multiple filters in , which is given as input to a set of generators , one of which we choose at random. The discriminator sees either a true with class code 0 or one of the generated fake with its index as the class code. The discriminator should push the different generators to different modes to solve the classification problem successfully: For correct classification, instances from the same class (i.e., fakes generated by the same generator) need to be more similar than instances from different classes (i.e., fakes generated by different generators).
More formally, for the discriminator, the objective is to
where is 1 if is processed by generator and 0 otherwise, and denotes the output of the discriminator for class .
In updating the weights of , the objective is to
Note that though there are multiple generators, their outputs are not combined in a cooperative manner. We do not partition and use each local partition for a different generators; for any , any of the generators can be used. It is more as if each generator produces its own interpretation of ; instead of partitioning , we learn alternative generator functions for the same region in .
Ii-B Mixture GAN
The mixture GAN (MGAN)  has some similarities with MADGAN, the main difference being that the classifier and the discriminator are separated. The discriminator is a two-class classifier as usual differentiating between true and fake examples, and there is an additional -class classifier used only for the fake examples learning the index of the generator used.
The model is shown in Figure 2. There is also the difference that the split of the generators is earlier and the shared deconvolution block comes afterwards. The generators transform to in parallel and for all, the shared produces the final output. Training is formalized as a multi-task learning problem where the discriminator is trained to discriminate between the fake and the real data as usual, and at the same time, for a fake, the -class classifier tries to predict the index of the generator that produced it.
The overall objective is defined as follows:
where , parameterized by , is the -class classifier for the fakes whose output for class is denoted by .
Ii-C Mixtures of experts GAN
The model is shown in Figure 3. In addition to the generators , there is also a gating network that takes and as its input: are the first layer activations of and as such are believed to provide additional information as to how best to choose the responsible generator. Then Straight-Through Gumbel Softmax is applied which only selects one expert while allowing differentiability. The discriminator is still two-class. The gating model also has its parameters that are updated together with the generators. Although all generators generate an output, it is the gating model that decides which one is to be used. Except the way the generator is written as a weighted sum of generators, the training objective is the same as Equation (1) used in the original GAN model.
Different from MADGAN and MGAN, here, is partitioned into local regions which map to local regions of . Thanks to the gating network outputs, each generator is only responsible for a local region of and generates the corresponding local region of . However, this partitioning is hard since we let only one generator to be used. Besides, the gating network takes processed features as extra inputs and this may lead to a partitioning that might be non-smooth in the space.
Iii Hierarchical mixtures of generators
All previous approaches use multiple generators, yet these generators do not work cooperatively, and they all train a flat set of generators. We propose the hierarchical mixture of generators that are inspired from the hierarchical mixtures of experts , where the generators are organized at the leaves of a tree and they cooperate as defined by the tree structure.
Let us think of a binary decision tree. The generators are at the leaves of this tree. At each internal nodeof the tree, there is a gating function with parameters that calculates the probability that we take the left child
is the probability that we take the right child. If is a leaf mode, the response is given by the generator at that leaf, . If is an internal node, its response is a weighted sum of its left and right children weighted by the gating values:
where and are the responses of the left and the right children respectively, calculated recursively until we get to leaf nodes; see Figure 4.
The generators at the leaves are simple linear models:
Because the gating in Equation (5
) is a sigmoid, we take a soft combination of generators at the leaves. This has two uses: First, we have a smooth transition from one local generator to another smoothly interpolating in between. Second, the model is differentiable and therefore given a tree structure, we can use gradient-based optimization to learn all the gating parametersin the decision nodes and the parameters of the generators at the leaves.
Here, estimates the score of “trueness” for a sample, and Wasserstein loss checks for the difference between the average scores for true samples and the generated fake samples. , which is a regressor and not a classifier, is trained to maximize this, and is trained to minimize it.
Note that Equation (5) defines a binary tree; we can also have a -ary tree by using the softmax instead of the sigmoid in the gating nodes. At the extreme, as a special case, we can have a tree of depth 1 with generator leaves; see Figure 5. This is the (flat) mixtures of generators (MoG), which is similar to MEGAN with two differences: We keep softmax gating so the combination is soft just like the original mixture of experts model , and the input to gating uses only without any extra features extracted from .
Iv-a Results on toy data
We begin with experiments on a toy two-dimensional data set sampled from a mixture of five Gaussians. The latent
is drawn from a two-dimensional Gaussian distribution with zero mean and unit variance, and we use eight generators. The data and how the generators split the data amongst themselves are shown in Figure6 for MADGAN, MGAN, MEGAN. We color the different regions of and the corresponding so as to show which parts of generate which part of ; we also show in smaller plots below samples generated by individual generators similarly color coded.
We see that because MADGAN and MGAN do not use any gating function, with these two, the output regions of the generators overlap with each other. Each color in -space corresponds to eight different points in the -space by the eight models. MEGAN does use a gating function but because the gating also uses extra , the output regions of generators still overlap with each other. Note that all three miss parts of the underlying distribution; MADGAN and MEGAN miss the top component, and MGAN miss the one at the bottom.
The tree learned by our proposed hierarchical mixture of generators, HMoG, is shown in Figure 7; this is a tree of depth three that also has eight generators at its leaves. We calculate each decision node’s responsibility by counting the (soft) gating values and an instance is drawn in the box of the expert having the highest responsibility. We see that the tree has learned a hierarchical soft clustering of the data with the leaves learning parts of each corresponding to a part of . We see that this model covers the data completely and has not missed any of the components.
The results using a flat mixture of generators, MoG, is shown in Figure 8. Because the combination is soft and only depends on the input , here too each generator operates in a local region of . This model too learns the distribution without dropping any modes of the data; note that here, we see that some generators do more of the work with some not used at all. This we believe is the advantage of a hierarchical model which dissects the problem into two at each level, easing the problem in a divide-and-conquer fashion. The hierarchical organization also lends to discovering structure in the data.
Iv-B Results on image data sets
We test and compare our proposed mixture models HMoG and MOG with MADGAN, MGAN, and MEGAN, on five image data sets that are widely used in the GAN literature: MNIST , FashionMNIST , UTZap50K , Oxford Flowers , and CelebA . We resize MNIST and FashionMNIST data set to and the other data sets with more detail to . All image pixels are normalized to the range .
It is known that using a convolutional architecture for tasks that involve images increases the performance dramatically, and we incorporate transposed convolutional (also known as deconvolutional or fractionally strided convolution) layers in each model. More specifically, we use the (transposed) convolutional part of DCGAN as the shared part of generators, denoted by above. Instead of generating samples directly in the data domain , each model generates an abstract representation which is given to the shared block that produces the output . For any data set, the same is used in all models.
All these variants combine multiple local models; we also define the fully connected (FC) model that uses a fully connected layer, which stands for the standard distributed alternative having one global generator, which we take as the baseline against which we compare all the localized variants.
In training HMoG, MoG, and the baseline model FC, Wasserstein loss with gradient penalty  is used. For MADGAN and MGAN, we use the original likelihood-based loss; Wasserstein loss is not applicable with these since since they require to be a classifier. MEGAN can be used with either Wasserstein loss or the original loss; we use the original loss because it performed better in our preliminary experiments. For all methods, we used the Adam optimizer  with amsgrad option . The learning rate is set to with beta values of Adam set to . The batch size is set to 128.
, here, 5-nearest neighbor (5-NN) leave-one-out accuracy. Both FID and 5-NN accuracy are calculated with the activations before the softmax layer (2048-dim) of InceptionV3
. Lower FID scores are better and 5-NN accuracies that are close to 50% are better. All models are run five times with different random seeds, and we report the mean and standard deviations.
For flat models, we experiment with 4, 8, 16, and 32 generators, which for the hierarchical model translates to trees of depth 2, 3, 4, and 5. We also report the parameter count of each model; these do not include the shared deconvolution block used in all models.
Our experimental results on the five data sets are shown in Figure 9. We see that in terms of FID score, both of our proposed MoG and HMoG outperform other approaches. We also see that MADGAN and MGAN perform worse than the baseline FC; only on MNIST, MADGAN performs better than the baseline. This might suggest that forcing discriminator to classify generators may not always work, which is the idea behind MADGAN and MGAN. On the other hand, MEGAN seems to perform on par with the baseline, sometimes even better. Note that unlike MADGAN and MGAN, MEGAN uses a gating function to select among its generators. This hints at the importance of training different generators in different input regions and combining them based on the input, instead of relying on the discriminator to force multiple generators to different modes.
If we compare our mixture of experts formulation (MoG) with MEGAN, we see that our model gets better results in terms of FID scores and 5-NN accuracies. As opposed to MEGAN, our mixture of generators is a soft cooperative one. The input to the gating model is only the latent , which also reduces the number of parameters significantly.
Some samples generated from HMoG with depth four are shown in Figure 10. For the sampling procedure, we randomly draw and disregard the least likely
percent to get rid of possible outliers. A visual inspection of these also show that HMoG is able to generate realistic and diverse samples on all data sets.
Because both MoG and HMoG use a soft combination, we can check whether there is any correlation between the outputs of the local generators. For the flat MoG, the probability that a local model is used is given by the softmax gating; for the HMoG, it is the product of all the (binary) gatings on the path to the root. We calculate the correlation between these probabilities for pairs of local models for the case of 16 generators (or a tree of depth 4) on the CelebA data set. The correlation matrices for both are shown color coded in Figure 11.
We see that with the flat MoG, correlations are randomly scattered. In HMoG however, we see that the correlations are gathered around the diagonal; we can see spectral squares of sizes and corresponding to subtrees, which is an implication that generators that have the same ancestor on the second or the third level of the tree are frequently used together indicating that they learn semantically correlated samples.
In Figures 12 and 13, the average responses of decision nodes in the tree are visualized by taking the weighted average of the generated samples on MNIST and CelebA respectively. For a given node, weights are found by multiplying the gating probabilities along the path to the node. At the bottom of the tree under each leaf, we show five random samples generated from the corresponding generator. To find these, we sample random 10,000 vectors and select the top five most likely for each generator. Here, the most likely point for a generator is the point which maximizes the probability that the corresponding leaf is chosen. We see the data set mean at the top root, and as we go down the tree the blurriness decreases and each node becomes more specialized to a specific region of . We see in Figure 12 that digits that are similar in shape are generated by leaves that are nearby in the tree. For CelebA too, as we see in Figure 13, we see that the examples are distributed over the leaves in terms of similarity in orientation, color, or background.
We believe that this interpretability is the advantage of the HMoG model over the MoG model, as well as other approaches that train a flat set of generators. As in soft hierarchical clustering, the division at each level, which may be interpreted as an architectural inductive bias, lets us view the data in different levels of granularity and understand the decisive features of the data through a divide-and-conquer type of approach.
We propose the hierarchical mixture of generators, HMoG, and a special case, MoG, which is a flat mixture of generators. There are GAN variants in the literature that also combine multiple generators but they are limited in the way they force the generators to different modes. Our formulation is the first to our knowledge that learns a cooperative mixture of generators, either organized in a flat manner or hierarchically.
An important advantage of the hierarchical model is its interpretability. Since it is a tree architecture, we can make a post-hoc analysis of the learned tree to gain insight about the data. At each level of the tree, nodes can be seen as clusters, or modes, in different levels of granularity, where as we go down the tree, clusters get more local. At the same time, splits are soft and what the tree learns is a hierarchical soft clustering of the data. In the generative setting that we have here, the leaves are generators each responsible from generating one local cluster,
Our experimental results on five data sets show that the proposed models can generate samples that are realistic and diverse. Our proposed models have better FID score and 5-NN accuracy with lower variance when compared with other methods that incorporate multiple generators as well as the fully-connected standard GAN implementation.
This work is partially supported by Boğaziçi University Research Funds with Grant Number 18A01P7. The numerical calculations reported in this work were partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources).
Wasserstein generative adversarial networks.
International Conference on Machine Learning 34, pp. 214–223. Cited by: §III.
-  (2018) Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §I, §IV-B.
-  (2018) Generative adversarial networks: an overview. IEEE Signal Processing Magazine 35 (1), pp. 53–65. Cited by: §I.
-  (2018) Multi-agent diverse generative adversarial networks. In , pp. 8513–8521. Cited by: §I, Fig. 1, §II-A.
-  (2014) Generative adversarial nets. In Neural Information Processing Systems 27, pp. 2672–2680. Cited by: §I.
-  (2017) Improved training of Wasserstein GANs. In Neural Information Processing Systems 30, pp. 5767–5777. Cited by: §IV-B.
-  (2017) GANs trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500. Cited by: §IV-B.
-  (2018) MGAN: training generative adversarial nets with multiple generators. In International Conference on Learning Representations 6, Cited by: §I, Fig. 2, §II-B.
-  (2019) How generative adversarial networks and their variants work: an overview. ACM Computing Surveys 52 (1), pp. 10. Cited by: §I.
-  (1991) Adaptive mixtures of local experts. Neural Computation 3 (1), pp. 79–87. Cited by: Fig. 3, §II-C, §III.
-  (1994) Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6 (2), pp. 181–214. Cited by: §III.
-  (2017) Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §I.
-  (2019) A style-based generator architecture for generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition 32, pp. 4401–4410. Cited by: §I.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B.
-  (2018) The GAN landscape: losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720. Cited by: §I.
The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Cited by: §IV-B.
-  (2015) Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision 15, pp. 3730–3738. Cited by: §IV-B.
-  (2016) Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545. Cited by: §IV-B.
-  (2008) Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing 6, pp. 722–729. Cited by: §IV-B.
-  (2018) MEGAN: mixture of experts of generative adversarial networks for multimodal image generation. arXiv preprint arXiv:1805.02481. Cited by: §I, Fig. 3, §II-C.
-  (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §IV-B.
-  (2019) On the convergence of Adam and beyond. arXiv preprint arXiv:1904.09237. Cited by: §IV-B.
-  (2016) Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition 29, pp. 2818–2826. Cited by: §IV-B.
-  (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §IV-B.
-  (2014) Fine-grained visual comparisons with local learning. In IEEE Conference on Computer Vision and Pattern Recognition 27, pp. 192–199. Cited by: §IV-B.